Code for finetuning Economics Transformer.
- Train different baseline models and pass to Aiden for evaluation
- Ablate over different models and sizes (eg. Qwen-2.5 Instruct, Llama Instruct models), as well as full finetuning vs LoRA (need to sweep over rank)
- Integrate other optimizers (eg. SOAP) into codebase
- First install torch, then run
pip install -r requirements.txt
, then runpip install -e .
. - To run the training script, you need to first download the FRED time series data and the metadata. These are available on GCS at
gs://humun-storage/path/in/bucket/final_filtered_FRED_data.csv
andgs://humun-storage/path/in/bucket/all_fred_metadata.csv
. To download these, refer to the first part of the tutorial written by the Data Collection team to get the file containing the keys to access the GC space as well as a python example script to download both files. You will need to pass them asraw_data_path
andmetadata_path
arguments to the SFT script (see more below).
The dataset at gs://humun-storage/path/in/bucket/final_filtered_FRED_data.csv
gets preprocessed in the following ways (see the get_fred_data
function in humun_econ_transformer/data/fred_data.py
; a standalone script is also available at humun_econ_transformer/data/preprocess_fred_data.py
):
- Takes each time series and splits into chunks;
max_prediction_window
specifies the max length of the forecast (where we select a random value from 2 tomax_prediction_window
for each chunk), and we providecontext_window = context_multiplier * prediction_window
additional values for the forecast. - The final chunk (when sorted in chronological order) is passed to the test set. One can set
max_cutoff_year
to ensure the test set only contains samples from a specific year onwards. The remaining chunks are used for training. This yields a train and test DataFrame with columns['series_id', 'title', 'history', 'forecast']
where thehistory
column contains the firstcontext_window
values of the chunk andforecast
column yields the remainingprediction_window
values of the chunk, saved as list of tuples('YYYY-MM-DD', v)
.
The main training script can be found in humun_econ_transformer/train_sft.py
. Once you have performed the setup instructions above, you can run this script with Deepspeed via the command deepspeed --module humun_econ_transformer.train_sft
. The important arguments to pass when running this script are the following:
--raw_data_path
: The path to the raw FREDfinal_filtered_FRED_data.csv
file--metadata_path
: The path to the FRED metadatametadata.csv
file, eg.datasets/all_fred_metadata.csv
--processed_dataset_path
: An optional argument to set a folder path to save the processed SFT dataset so it will be retrieved if you run the same script again (processing the dataset takes awhile, so this will save time in future runs), eg.datasets/processed_split
.--input_key
: The key used by the SFT trainer as the input prompt. This should behistory
for most purposes, but you can modify the default prompt, set this to a different field, and pass in a differentinput_key
for training.--output_key
: The key used by the SFT trainer as the expected output, i.e. identifying the tokens that loss is computed on. This should beforecast
for most purposes.
To run locally, an example script is in scripts/train_sft.sh
. If you're running code on a SLURM cluster, an example sbatch script which allows for multi-GPU training is given in scripts/slurm_train_sft.sh
.
Pretraining project notion board: https://www.notion.so/humanity-unleashed/Pretraining-131d57b83b5181ebb282ff6569458c59