When you are installing this for the first time, run:
conda env create --file environment.ymlAfter that, and on every new session, run:
conda activate mscan2During development, add dependencies that you care about to environment.yml. Then run:
conda env update --file environment.yml --pruneThe SCAN repo gives us the pre-split data, while the MCD splits give a JSON description of how to split. The following script converts the SCAN splits to the same format as MCD.
./scripts/create_split_files.shThe following translates the English to various languages, then uses the split JSON descriptions to produces datasets for each (language, split) pair.
./scripts/preprocess_all_scan.shFor each language and split, we can generate a bunch of few-shot prompts:
python src/generate_prompts.pySet an environment variable with your Hugging Face token:
export HF_TOKEN="hf_..."Or even better, create a .env file with your token (the code will pick it up):
HF_TOKEN="hf_..."Then, to run an experiment, you can run:
python src/inference.py --model-name "bigscience/bloom" --train data/output/en/simple/train.txt --test data/output/en/simple/test.txt --output data/output/results/playground/results.json --context-size 2 --num-queries 1Set an environment variable with your OpenAI token:
export OPENAI_API_KEY="sk-..."Edit condor-run.sh with whatever you want to do in the Condor task, then run:
condor_submit condor-task.cmdTo see the queue, and see whether your job is running:
condor_q Start by generating all SLURM submission scripts
python src/generate_slurm.pyTo run a single SLURM job, do e.g.:
sbatch scripts/generated/slurm/bloomz/run-bloomz-en-mcd1.slurmThen check job status with:
squeue --me
# or, for more information:
squeue --me -o "%22S %.12i %.45j %.10T %.10M %.30R" --sort="M,j"And check job output with e.g.:
# Replace job ID and task ID below
cat /mmfs1/gscratch/clmbr/amelie/projects/thesis_multiling_compos/data/output/results/bloomz/en/mcd1/<task_id>_<job_id>.out./scripts/generated/slurm/bloomz/slurm-submit-all.shOnce you've run all experiments, you can aggregate scores into a single CSV file for doing further analysis. Run:
python src/aggregate_scores.pyThis will output a CSV file in data/output/results/aggregated_scores.csv.