Skip to content

Latest commit

 

History

History

uniprot

Index/query scripts for UniProtKB datasets

  • index.py: Index UniProtKB xml files

    Tested with Swiss-Prot dataset only, (April 2021 release)

    ./nosqlbiosets/uniprot/index.py --help
    usage: index.py [-h] [--index INDEX] [--doctype DOCTYPE] [--host HOST]
                    [--port PORT] [--db DB]
                    infile
    
    Index UniProt xml files, with Elasticsearch or MongoDB
    
    positional arguments:
      infile             Input file name for UniProt Swiss-Prot compressed xml
                         dataset
    
    optional arguments:
      -h, --help         show this help message and exit
      --index INDEX      Name of the Elasticsearch index or MongoDB database
      --doctype DOCTYPE  Document type name for Elasticsearch, collection name for
                         MongoDB
      --host HOST        Elasticsearch or MongoDB server hostname
      --port PORT        Elasticsearch or MongoDB server port number
      --db DB            Database: 'Elasticsearch' or 'MongoDB'
    
  • query.py: Query API, at its early stages of development

  • ../../tests/test_uniprot_queries.py: Tests for the query API

Usage

Example command lines for downloading uniprot_sprot.xml file and for indexing:

Download UniProt/Swiss-Prot data set

mkdir -p data
# ~760M(compressed), ~173.5 million lines, ~565,000 entries
wget -nc -P ./data ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz

Index with Elasticsearch or MongoDB

If you have not already installed nosqlbiosets project see the Installation section of the readme.md file on project main folder.

Server default connection settings are read from ../../conf/dbservers.json

# Index with Elasticsearch, typically requires about 1 to 8 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
 --host localhost --db Elasticsearch  --esindex uniprot

# Index with MongoDB, typically requires about 1 to 2 hours
./nosqlbiosets/uniprot/index.py ./data/uniprot_sprot.xml.gz\
 --host localhost --db MongoDB --index biosets

Index/query scripts for InterPro dataset

Elasticsearch, ~10m
./nosqlbiosets/uniprot/interpro.py \
   ~/data/interpro/interpro.xml.gz\
   --esindex interpro\
   --dbtype Elasticsearch --recreateindex true\
   --host localhost 
MongoDB  ~3m
./nosqlbiosets/uniprot/interpro.py \
   ~/data/interpro/interpro.xml.gz\
   --dbtype MongoDB --recreateindex true\
   --mdbdb=biosets --mdbcollection interpro\
   --host localhost

PSI MI-TAB support

This folder also includes an index script for PSI-MI TAB protein interactions data files

Links for the PSI MI-TAB format

wget -P ./data http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/HIPPIE-current.mitab.txt

# Index with Elasticsearch
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
 --db Elasticsearch

# Index with MongoDB
./nosqlbiosets/uniprot/index_mitab.py --infile ./data/HIPPIE-current.mitab.txt\
 --db MongoDB

HIPPIE indexing takes ~8m with MongoDB, ~2m with Elasticsearch