notes on kettle jobs

generally:

a step is basically a function
it has inputs, performs some transformation, and returns outputs
the inputs and outputs are always rows with some fields
they always enter and leave asynchronously
you have two main problems when you’re trying to design a tf:
- finding the right step (occasionally intuitive by name but not often)
- making sure the step has the correct inputs (where applicable) and is configured to return the right output
the ability to easily switch between data paths in a tf is useful
the ‘ability’ to use a repository for kettle resources is not
- seriously it’s buggy and poorly-maintained, don’t do it
- just copy the shit to a prd server and cron up kitchen/pan

FNWI/CJKR

determines, somehow, what BB data it is allowed to get
- FNWI job: dynamically determines based on which students answered ‘yes’ to a specific test in BB, which also means the course is hardcoded here
- CJKR job: reads lists of permitted courses + students from text files
gets data about student activity
translates it to xAPI
sends all that to the LRS

get cmd args

cjkr main tf needs filename to read from
these will be passed in from the command line as arguments
- this works exactly the same way as it does in annemarie’s job
however: the whole thing needs to be a job because the timestamp part depends on having a certain order
therefore: the job receives the arguments, not the tf
therefore there has to be a separate transform to take the job arguments and make them available to the rest of the job
- caveat: it is tempting to think you can take care of order inside a single tf with blocking steps
- don’t do it
- kettle’s not designed for that so while possible it’s fuckin annoying and hairy

get latest timestamp

main tf needs to know latest timestamp
it needs to know before querying any DB
it also can’t be part of the same tf because defined variables can’t be used in the tf that defines them
therefore the step has to be a job step, which have a fixed order in a way that tf steps don’t

generate rows (input/generate rows)

generates a row with a very old timestamp (UNIX epoch) for purposes of courses that are already finished

set header (scripting/modified java script value)

no real reason to use this over putting it in a field in Generate Rows, but this is how it’s passed in the main tf
it is somewhat of a shortcoming that one can’t specify the headers directly in the step but has to pass them in

statements from LRS (lookup/rest client)

can’t do separate params on a GET, which is a good practice but the fact that y’all had to enforce it says more about you
params: ?limit=1&registration=uuidblat
registration is the conceptual grouping of shit in such a way that it’s directly queryable; ie. all statements made by one application

parse json (input/json input)

uses JSONPath syntax to retrieve a given field/property

max of timestamp (statistics/memory group by)

input fields not in the top set don’t become output fields
input fields in the bottom set are reduced according to the chosen scheme (in this case, maximum)
this step technically isn’t necessary anymore since GET params but:
- it applies its own wacky formatting to the timestamp, such that diking it out would probably break the date format in the queries
- leaving it in changes nothing content-wise, since the maximum of a set with one element is that element

write to log (utility/write to log)

important: if you’re testing something, and the fields you want to see have changed (ie almost always), you have to Get Fields again here

user data to lrs

courses and consenting students (input/table input)

magically determine using specific course and content PKs which students have answered yes to the consent question in either course
output a row that looks like the old one: course id, [duped course id,] user id

pre pare down (transform/select values)

forum queries don’t need to see the course ID twice because they don’t include the clever shit to get the content_path

more queries (input/table input)

honestly like 70% of the time went into writing these
which kind of implies that i should be writing notes on how to get intelligible shit out of the bb schema
but, like, the necessary intelligible shit varies based on requirements
and the schema itself isn’t that hard, it’s just counter-intuitive in several places
sql nuggets, i guess:
- you can substitute a parenthesized select statement (subquery) for a single term in a lot of places, such as the from clause
- you can usually substitute either a case/when or a subquery that returns a single row and field where sql expects a single term

row to statement (scripting/modified java script value)

probably 5% of the total time, the remaining 25% being oracle stupidity or deployment issues
contingent on recipes/communities of practice, or else totally pointless, because xAPI is about interoperability
get a list of reqs from stakeholders beforehand:
- make them supply statements of the form “actor A verb V’d object O [with result R] [in context C] [at time T]” which they want to see in the LRS
- if they cannot do this in a way that makes sense, their data should not be in the LRS and you should tell them this
- most unique id fields in the LRS are IRIs so agree a schema/namespace/whatever with your stakeholders before you craft a recipe
- beware of extensions. infinitely extensible, infinitely meaningless to any other application
  - (if stakeholders do not care about this then hard questions need to be asked about why they’re using an LRS instead of some other DW)

pare down (transform/select values)

necessary because you can’t have a step with multiple input steps unless they all output the same fields

statement to LRS (lookup/rest client)

no params on this one

write to log (utility/write to log)

correctly stored statements send their (uu)ids as the http response
the LRS blats its java error in other cases, which is moderately helpful

Annemarie

gets lists of allowed students and courses from text files
gets activity data from BB
compares the personal data of the allowed students to the anonymous, aggregated data of everyone
outputs personal data + aggregate-based percentage in a csv
mails it to annemarie

user data to csv

get filenames (job/get variables)

command line arguments passed as variables by a previous tf
that could have been consolidated into the main tf here but encapsulation is a Good Thing

read course/user ids (input/text file input)

annemarie supplied the files in question, she runs a mac so these were macRoman encoding and UNIX format
beware encoding/line-terminator issues

join rows (joins/join rows (cartesian product))

cartesian products are useful to add single fields to rows but otherwise leave them unchanged, because x * 1 = x for all x

dupe (scripting/modified java script value)

personalized query needs to see the course ID twice

query db for participants (input/table input)

query db for anon activity/dsll (input/table input)

anon queries never even get to see the user IDs: it’s not just technically secure, it’s logically secure

anon data to vectors (statistics/memory group by)

‘vector’ meaning a serialized list of values for a given field
only called it a vector because annemarie did (it’s apparently a thing in R, which makes an amount of sense)

join rows 2+3 (joins/join rows (cartesian product))

once again, these single fields get added to the personalized data

aggregate (scripting/modified java script value)

uses a start script (runs once per execution of the whole transformation) to define a function for getting the percentage of values in a vector that are equal or higher to a comparator value
uses the transform script (runs once per row received) to apply that function to the vector and the specific personal value for the property whose values are listed in the vector

csv output (output/text file output)

straightforward

mail

you probably want to select “only send comment in mail body” in the “email message” tab unless you want a bunch of extraneous log shit to be sent

deploy notes

[sudo] crontab -e
0 * * * * sh path/to/kitchen.sh -norep -file path/to/job.kjb arg arg2 >> cron.log 2>&1

or, if you’re doing a one-time thing:

nohup sh path/to/kitchen.sh -norep -file path/to/job.kjb arg arg2 &

version control

shit’s on subversion

https://source.ic.uva.nl/repos/svn/odg-1/Cameron/etl/kettle