Skip to content

2019 Sprint Exercises for Day 1

Anthony Blaom, PhD edited this page Jun 16, 2019 · 1 revision

The following suggested exercises are to help acquaint you with MLJ, but just as importantly, to provide the MLJ team with feedback on documentation, functionality and bugs.

  1. If you have not already done so, set up Julia and MLJ, following some variation of these instructions. To ensure you are using the bleeding edge version of MLJ, activate your sprint environment, and at the package manager prompt (sprint) pkg> enter
add MLJ#master
add MLJModels#Master
update

Alternatively, for use in the exercises below, instantiate the environment used for the introductory demo. We advise you against using an existing environment, full of every package you every tried out, or might want to try out some day ☺.

  1. Optional. Go over the introductory demo to check you can at least reproduce the demonstration (using the notebook or script according to your preference).

  2. For experimentation with basic regression or classification, do one of these:

    • Preferred, for better user testing. Identify a relatively small public, structured, multivariate dataset of your choosing. No images or text, one-dimensional data (e.g., time series), nor sparse data for now. No missing data. Ideally, pick data with a mixture of continuous and categorical data types. Load your data into a table raw_table (using the CSV package for CSV-like formats).

    • Expedient. Load the professor salaries data set with: using RDatasets; raw_table = dataset("car", "Salaries").

  3. Load MLJ and determine the scientific types MLJ will infer for the table columns. Use the MLJ coerce method to correct for wrong interpretations.

  4. Construct a new table with only continuous features, by one-hot encoding the categoricals.

  5. Identifying a target, construct a learning task from the data set, ensuring appropriate scitype data coercions.

  6. Identify a registered MLJ model that can solve your task.

  7. Shuffle the data and split the data rows into train and test parts.

  8. On the train rows of your data only, evaluate your model using default hyperparameters, according to one or two appropriate measures (loss functions), using 5-fold cross-validation.

  9. If your model is iterative, construct a learning curve to determine an appropriate iteration parameter.

  10. Construct a "self-tuning" wrap of your model, based on the two most important parameters, using a grid search and 5-fold cross-validation as a resampling strategy. Fit this model on the train rows. In detail, can you say what has actually taken place?

  11. Evaluate the performance of your tuned model on the test set.

  12. Repeat the above exercise for a bagged ensemble of size 10, restricting tuning to just one atomic hyperparameter, together with the bagging fraction.

  13. Advanced. Alternatively, repeat the above exercise with the following alternative work-flow: (i) Construct a task, directly from raw_table, coercing scitypes during task construction; (ii) Construct a learning network (pipeline) that combines one-hot encoding with the predictive model you used before; (iii) export the network as a stand-alone model; (iii) Tune and evaluate the composite model.