Skip to content

chatnoir-eu/chatnoir-ir-datasets-indexer

Repository files navigation

chatnoir-ir-datasets-indexer

Simple indexer to integrate selected datasets from ir_datasets into the ChatNoir search engine.

Installation

  1. Install Python 3.10
  2. Install pipx
  3. Install Pipenv.
  4. Install dependencies:
    pipenv install

Usage without Pipenv

I had problems with running it with PipEnv, hence, I used the one above.

./main.py

Usage with Pipenv

pipenv run python -m chatnoir_ir_datasets_indexer

Datasets in progress

LongEval-SCI

mkdir .metadata/longeval-sci-2024-11
./manage.py ir_datasets_loader_cli --ir_datasets_id 'longeval-sci/2024-11/train' --output_dataset_path inputs --output_dataset_truth_path truths

Upload data to s3:

wget https://github.com/tira-io/tirex-tracker/releases/download/0.2.7/measure-0.2.7-linux -O tirex-tracker
chmod +x tirex-tracker
./tirex-tracker --poll-intervall 2000 -f object-storage-upload.yml -o 's3cmd put longeval-sci-2024-11-train-corpus.jsonl s3://corpora-tirex-small/longeval-sci-2024-11-train-corpus.jsonl'

Document offsets:

./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno .metadata/longeval-sci-2024-11/longeval-sci-2024-11-train-corpus.jsonl .metadata/longeval-sci-2024-11/longeval-sci-offsets.json.gz
export ES_PASSWORD=PASSWORD
export ES_USERNAME=USER
python3 main.py \
	--data-index chatnoir_data_longeval_sci_2024_11 \
	--meta-index chatnoir_meta_longeval_sci_2024_11 \
	longeval-sci/2024-11/train

TREC TOT

md5sum ~/.ir_datasets/trec-tot/2024/corpus.jsonl gives: 0c535ac8d5cee481add41543bc8cb854.

Upload data to s3:

s3cmd mb s3://corpus-trec-tot-2024
s3cmd put corpus.jsonl s3://corpus-trec-tot-2024/corpus.jsonl

export IR-dataset

create documents.jsonl file (within tira repo, store in /mnt/ceph/tira for easy re-use):

./src/manage.py ...

Upload to S3:

s3cmd mb s3://corpus-msmarco-passage-v1
s3cmd put /mnt/ceph/tira/data/publicly-shared-datasets/msmarco-passage-trec-dl-v1/documents.jsonl s3://corpus-msmarco-passage-v1/corpus.jsonl




Create document offsets:

./chatnoir_ir_datasets_indexer/document_offsets.py --docno doc_id ~/.ir_datasets/trec-tot/2024/corpus.jsonl trec-tot-offsets.json.gz

./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/msmarco-passage-trec-dl-v1/documents.jsonl msmarco-v1-passage-offsets.json.gz

./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-document-v1/documents.jsonl msmarco-v1-document-offsets.json.gz

./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-document-v2/documents.jsonl msmarco-v2-document-offsets.json.gz

./chatnoir_ir_datasets_indexer/document_offsets.py --docno docno /mnt/ceph/tira/data/publicly-shared-datasets/ms-marco-passage-v2/documents.jsonl msmarco-v2-passage-offsets.json.gz


Index:

python3 main.py
--data-index chatnoir_data_trec_tot_2024
--meta-index chatnoir_meta_trec_tot_2024
--username USER
--password PASSWORD
trec-tot/2024

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages