Cml - WatchTower Documentation

This document contains information about model training and WFO via automations.

Model training

Name: Run Model training
Purpose: Train model on GPU runners and store it in DVC.
When it runs: Whenever someone triggers the workflow manually with some condition. Any push happens to branch starting with delivery/update-data- (branch for data update) or ci/model-update- (branch for on merging any change to develop)
Things to know:
For any automated runs model training always happens for data starting from 2017 and logs training metrics into datadog.
For manual runs you get 5 different options, data starting from 2021 debug and full model, data starting from 2017 debug and full model, debug data aggregated to parent vehicle level
For 2017 full and debug model training it typically takes around 1.2 hours
Training metrics are tracked in datadog and there is a dashboard associated with it.

Error

Most common errors in this process are following - 1. For automation if hashes are not updated and training is triggered. This will only occur if you have pushed something on automation branches and triggered training before hdf5 run is complete. 2. For manual runs, error can occur if code changes are not compatible with data in this case debug data run will fail and debugging can be done via logs. 3. Error can also occur for NaN loss, way to debug is to indentify the root cause and debug in local. 4. On metric logging error can also occur that would be because of env issues.

WFO Runs

Name: Run WFO Pipeline
Purpose: Runs WFO on GPU runners and store results in DVC and log metrics on DataDog.
When it runs: Whenever someone triggers the workflow manually with some condition. Any push happens to branch starting with delivery/update-data- (branch for data update) or ci/model-update- (branch for on merging any change to develop)
Things to know:
For any automated runs WFO always happens for data starting from 2017 and logs different validation metrics into datadog.
For manual runs you get 3 different options, debug data run on 2017 data, full data run on 2017 data debug data aggregated to parent vehicle level
For 2017 full and debug model WFO typically takes around 4 hours now.
Validation metrics are tracked in datadog and there is a dashboard associated with it.
For automations datadog run tag will be PR_<PR_NUMBER>, if it is an experiment run it will be EXPERIMENT_<PR_NUMBER>

Error

Most common errors in this process are following - 1. All errors that can happen in training stage. 2. If WFO is going on very heavy memory clearance was not done properly, then OOM will happen. 3. Currently, we have 4 period in WFO, if you wish to add more period make sure it does not go beyond current stage timeout.

Notes

We have an option of train from scratch, use this option only if you are changing something in core model or once in 2 months?

Change early stopping params very carefully

All metrics are logged in datadog, look carefully at each period and metrics before raising model PR.

While doing model improvement task, load back the cml model and try viz_bad_wholesalers to identify model shortfalls.