This document contains information about model training and WFO via automations.

Model training

Error

Most common errors in this process are following - 1. For automation if hashes are not updated and training is triggered. This will only occur if you have pushed something on automation branches and triggered training before hdf5 run is complete. 2. For manual runs, error can occur if code changes are not compatible with data in this case debug data run will fail and debugging can be done via logs. 3. Error can also occur for NaN loss, way to debug is to indentify the root cause and debug in local. 4. On metric logging error can also occur that would be because of env issues.

WFO Runs

Error

Most common errors in this process are following - 1. All errors that can happen in training stage. 2. If WFO is going on very heavy memory clearance was not done properly, then OOM will happen. 3. Currently, we have 4 period in WFO, if you wish to add more period make sure it does not go beyond current stage timeout.

Notes

  1. We have an option of train from scratch, use this option only if you are changing something in core model or once in 2 months?
  2. Change early stopping params very carefully
  3. All metrics are logged in datadog, look carefully at each period and metrics before raising model PR.
  4. While doing model improvement task, load back the cml model and try viz_bad_wholesalers to identify model shortfalls.