Simple indexer to integrate selected datasets from ir_datasets into the ChatNoir search engine.
- Install Python 3.10
- Install pipx
- Install Pipenv.
- Install dependencies:
pipenv install
I had problems with running it with PipEnv, hence, I used the one above.
./main.py
pipenv run python -m chatnoir_ir_datasets_indexer
mkdir .metadata/longeval-sci-2024-11
./manage.py ir_datasets_loader_cli --ir_datasets_id 'longeval-sci/2024-11/train' --output_dataset_path inputs --output_dataset_truth_path truths
Upload data to s3:
wget https://github.com/tira-io/tirex-tracker/releases/download/0.2.7/measure-0.2.7-linux -O tirex-tracker
chmod +x tirex-tracker
./tirex-tracker --poll-intervall 2000 -f object-storage-upload.yml -o 's3cmd put longeval-sci-2024-11-train-corpus.jsonl s3://corpora-tirex-small/longeval-sci-2024-11-train-corpus.jsonl'
Document offsets:
./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno .metadata/longeval-sci-2024-11/longeval-sci-2024-11-train-corpus.jsonl .metadata/longeval-sci-2024-11/longeval-sci-offsets.json.gz
export ES_PASSWORD=PASSWORD
export ES_USERNAME=USER
python3 main.py \
--data-index chatnoir_data_longeval_sci_2024_11 \
--meta-index chatnoir_meta_longeval_sci_2024_11 \
longeval-sci/2024-11/train
md5sum ~/.ir_datasets/trec-tot/2024/corpus.jsonl
gives: 0c535ac8d5cee481add41543bc8cb854
.
Upload data to s3:
s3cmd mb s3://corpus-trec-tot-2024
s3cmd put corpus.jsonl s3://corpus-trec-tot-2024/corpus.jsonl
export IR-dataset
create documents.jsonl file (within tira repo, store in /mnt/ceph/tira for easy re-use):
./src/manage.py ...
Upload to S3:
s3cmd mb s3://corpus-msmarco-passage-v1
s3cmd put /mnt/ceph/tira/data/publicly-shared-datasets/msmarco-passage-trec-dl-v1/documents.jsonl s3://corpus-msmarco-passage-v1/corpus.jsonl
Create document offsets:
./chatnoir_ir_datasets_indexer/document_offsets.py --docno doc_id ~/.ir_datasets/trec-tot/2024/corpus.jsonl trec-tot-offsets.json.gz
./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/msmarco-passage-trec-dl-v1/documents.jsonl msmarco-v1-passage-offsets.json.gz
./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-document-v1/documents.jsonl msmarco-v1-document-offsets.json.gz
./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-document-v2/documents.jsonl msmarco-v2-document-offsets.json.gz
./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-passage-v2/documents.jsonl msmarco-v2-passage-offsets.json.gz
Index:
python3 main.py
--data-index chatnoir_data_trec_tot_2024
--meta-index chatnoir_meta_trec_tot_2024
--username USER
--password PASSWORD
trec-tot/2024