notes on kettle jobs
generally:
- a step is basically a function
- it has inputs, performs some transformation, and returns outputs
- the inputs and outputs are always rows with some fields
- they always enter and leave asynchronously
- you have two main problems when you’re trying to design a tf:
- finding the right step (occasionally intuitive by name but not often)
- making sure the step has the correct inputs (where applicable) and is configured to return the right output
- the ability to easily switch between data paths in a tf is useful
- the ‘ability’ to use a repository for kettle resources is not
- seriously it’s buggy and poorly-maintained, don’t do it
- just copy the shit to a prd server and cron up kitchen/pan
FNWI/CJKR
- determines, somehow, what BB data it is allowed to get
- FNWI job: dynamically determines based on which students answered ‘yes’ to a specific test in BB, which also means the course is hardcoded here
- CJKR job: reads lists of permitted courses + students from text files
- gets data about student activity
- translates it to xAPI
- sends all that to the LRS
get cmd args
- cjkr main tf needs filename to read from
- these will be passed in from the command line as arguments
- this works exactly the same way as it does in annemarie’s job
- however: the whole thing needs to be a job because the timestamp part depends on having a certain order
- therefore: the job receives the arguments, not the tf
- therefore there has to be a separate transform to take the job arguments and make them available to the rest of the job
- caveat: it is tempting to think you can take care of order inside a single tf with blocking steps
- don’t do it
- kettle’s not designed for that so while possible it’s fuckin annoying and hairy
get latest timestamp
- main tf needs to know latest timestamp
- it needs to know before querying any DB
- it also can’t be part of the same tf because defined variables can’t be used in the tf that defines them
- therefore the step has to be a job step, which have a fixed order in a way that tf steps don’t
generate rows (input/generate rows)
- generates a row with a very old timestamp (UNIX epoch) for purposes of courses that are already finished
set header (scripting/modified java script value)
- no real reason to use this over putting it in a field in Generate Rows, but this is how it’s passed in the main tf
- it is somewhat of a shortcoming that one can’t specify the headers directly in the step but has to pass them in
statements from LRS (lookup/rest client)
- can’t do separate params on a GET, which is a good practice but the fact that y’all had to enforce it says more about you
- params: ?limit=1®istration=uuidblat
- registration is the conceptual grouping of shit in such a way that it’s directly queryable; ie. all statements made by one application
parse json (input/json input)
- uses JSONPath syntax to retrieve a given field/property
max of timestamp (statistics/memory group by)
- input fields not in the top set don’t become output fields
- input fields in the bottom set are reduced according to the chosen scheme (in this case, maximum)
- this step technically isn’t necessary anymore since GET params but:
- it applies its own wacky formatting to the timestamp, such that diking it out would probably break the date format in the queries
- leaving it in changes nothing content-wise, since the maximum of a set with one element is that element
write to log (utility/write to log)
- important: if you’re testing something, and the fields you want to see have changed (ie almost always), you have to Get Fields again here
user data to lrs
courses and consenting students (input/table input)
- magically determine using specific course and content PKs which students have answered yes to the consent question in either course
- output a row that looks like the old one: course id, [duped course id,] user id
pre pare down (transform/select values)
- forum queries don’t need to see the course ID twice because they don’t include the clever shit to get the content_path
more queries (input/table input)
- honestly like 70% of the time went into writing these
- which kind of implies that i should be writing notes on how to get intelligible shit out of the bb schema
- but, like, the necessary intelligible shit varies based on requirements
- and the schema itself isn’t that hard, it’s just counter-intuitive in several places
- sql nuggets, i guess:
- you can substitute a parenthesized select statement (subquery) for a single term in a lot of places, such as the from clause
- you can usually substitute either a case/when or a subquery that returns a single row and field where sql expects a single term
row to statement (scripting/modified java script value)
- probably 5% of the total time, the remaining 25% being oracle stupidity or deployment issues
- contingent on recipes/communities of practice, or else totally pointless, because xAPI is about interoperability
- get a list of reqs from stakeholders beforehand:
- make them supply statements of the form “actor A verb V’d object O [with result R] [in context C] [at time T]” which they want to see in the LRS
- if they cannot do this in a way that makes sense, their data should not be in the LRS and you should tell them this
- most unique id fields in the LRS are IRIs so agree a schema/namespace/whatever with your stakeholders before you craft a recipe
- beware of extensions. infinitely extensible, infinitely meaningless to any other application
- (if stakeholders do not care about this then hard questions need to be asked about why they’re using an LRS instead of some other DW)
pare down (transform/select values)
- necessary because you can’t have a step with multiple input steps unless they all output the same fields
statement to LRS (lookup/rest client)
- no params on this one
write to log (utility/write to log)
- correctly stored statements send their (uu)ids as the http response
- the LRS blats its java error in other cases, which is moderately helpful
Annemarie
- gets lists of allowed students and courses from text files
- gets activity data from BB
- compares the personal data of the allowed students to the anonymous, aggregated data of everyone
- outputs personal data + aggregate-based percentage in a csv
- mails it to annemarie
user data to csv
get filenames (job/get variables)
- command line arguments passed as variables by a previous tf
- that could have been consolidated into the main tf here but encapsulation is a Good Thing
read course/user ids (input/text file input)
- annemarie supplied the files in question, she runs a mac so these were macRoman encoding and UNIX format
- beware encoding/line-terminator issues
join rows (joins/join rows (cartesian product))
- cartesian products are useful to add single fields to rows but otherwise leave them unchanged, because x * 1 = x for all x
dupe (scripting/modified java script value)
- personalized query needs to see the course ID twice
query db for participants (input/table input)
query db for anon activity/dsll (input/table input)
- anon queries never even get to see the user IDs: it’s not just technically secure, it’s logically secure
anon data to vectors (statistics/memory group by)
- ‘vector’ meaning a serialized list of values for a given field
- only called it a vector because annemarie did (it’s apparently a thing in R, which makes an amount of sense)
join rows 2+3 (joins/join rows (cartesian product))
- once again, these single fields get added to the personalized data
aggregate (scripting/modified java script value)
- uses a start script (runs once per execution of the whole transformation) to define a function for getting the percentage of values in a vector that are equal or higher to a comparator value
- uses the transform script (runs once per row received) to apply that function to the vector and the specific personal value for the property whose values are listed in the vector
csv output (output/text file output)
- straightforward
- you probably want to select “only send comment in mail body” in the “email message” tab unless you want a bunch of extraneous log shit to be sent
deploy notes
[sudo] crontab -e
0 * * * * sh path/to/kitchen.sh -norep -file path/to/job.kjb arg arg2 >> cron.log 2>&1
or, if you’re doing a one-time thing:
nohup sh path/to/kitchen.sh -norep -file path/to/job.kjb arg arg2 &
version control
shit’s on subversion