Welcome to the developer guide for data validator module.
1. The data validator has 3 important elements:
- data loader:
- load_data function is the central point of data_loader.
- It takes in parquet, json and hdf5 directories as input and generates the necessary inputs for checks.
- The return type of load_data function should always be DataLoader.
- It makes all the 2D dataframes and renames the columns and sorts the dataframes for data consistency.
- It returns encodings, date index, parquet, json and hdf5 paths to be used incase
the check does not rely on dataframes.
- checks:
- All the checks go inside checks directory as scripts.
- Further details on how to add a check can be referred inside the init.py of checks directory
- export:
- This defines the supported exporting formats for the reports.
- Currently, 3 formats are supported for exporting: html, markdown and excel.
2. In order to run data validator, use the following command in the terminal:
python -m wt_ml.dataset.data_validator.data_validator <mode> <export_type>
Here,
<mode> can be anyone of 'dbg' or 'full'.
'full' is the default behavior if nothing is passed.
<export_type> can be anyone of 'html', 'markdown', 'excel' or 'all'.
'excel' is the default behavior if nothing is passed.
- All the reports are published inside the results directory of wt_ml repository.