SeqRepo is a Python package for storing and reading a local collection of biological sequences. The repository is non-redundant, compressed, and journalled, making it efficient to store and transfer multiple snapshots.
Specific, named biological sequences provide the reference and coordinate system for communicating variation and consequential phenotypic changes. Several databases of sequences exist, with significant overlap, all using distinct names. Furthermore, these systems are often difficult to install locally.
SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences. Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions of large sequences.
A "fully-qualified" identifier includes a namespace to disambiguate accessions from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error.
SeqRepo favors namespaces from identifiers.org whenever available. Examples include refseq and ensembl.
seqrepo-rest-service provides a REST interface and docker image.
Released under the Apache License, 2.0.
|
|
| ChangeLog
Hart RK, Prlić A (2020). SeqRepo: A system for managing local collections of biological sequences. PLoS ONE 15(12): e0239883. https://doi.org/10.1371/journal.pone.0239883
- Timestamped, read-only snapshots.
- Space-efficient storage of sequences within a single snapshot and across snapshots.
- Bandwidth-efficient transfer incremental updates.
- Fast fetching of sequence slices on chromosome-scale sequences.
- Precomputed digests that may be used as sequence aliases.
- Mappings of external aliases (i.e., accessions or identifiers like
NM_013305.4
) to sequences.
- Local read-only archive, mirrored from public site, accessed via Python API (see Mirroring documentation)
- Local read-write archive, maintained with command line utility and/or API (see Command Line Interface documentation).
- Docker data-only container that may be linked to application container.
- SeqRepo and refget REST API for local or remote access (see seqrepo-rest-service)
Within a single snapshot, sequences are stored non-redundantly and compressed in an add-only journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an internal id. (The digest is truncated for space efficiency.)
Sequences are compressed using the Block GZipped Format (BGZF)), which enables pysam to provide fast random access to compressed sequences. (Variable compression typically makes random access impossible.)
Sequence files are immutable, thereby enabling the use of hardlinks across
snapshots and eliminating redundant transfers (e.g., with rsync
).
Each sequence id is associated with a namespaced alias in a sqlite database.
Such as <seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>
, <NCBI,NP_004009.1>
,
<gi,5032303>
, <ensembl-75ENSP00000354464>
, <ensembl-85,ENSP00000354464.4>
.
The sqlite database is mutable across releases.
For calibration, recent releases that include 3 human genome assemblies (including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).
For more information, see docs/design.rst.
Reading a sequence repository requires several Python packages, all of which are
available from pypi. Installation should be as simple as pip install biocommons.seqrepo
.
Acquiring SeqRepo snapshots using the CLI requires an rsync binary. Note that openrsync, which now ships with new MacOS installs, does not support all required functions. Mac users should install rsync from HomeBrew and use the --rsync-exe
option to declare its exact location.
Writing sequence files also requires bgzip
, which provided in the
htslib repo. Ubuntu users should install
the tabix
package with sudo apt install tabix
.
Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get other systems working would be welcomed.
$ brew install python htslib
$ sudo apt install -y python3-dev gcc zlib1g-dev tabix
$ python -m venv venv
$ source venv/bin/activate
$ pip install seqrepo
$ sudo mkdir -p /usr/local/share/seqrepo
$ sudo chown $USER /usr/local/share/seqrepo
$ seqrepo pull -i 2024-12-20
$ seqrepo show-status -i 2024-12-20
seqrepo 0.6.12.dev7+gd311e3e.d20250730
instance directory: /usr/local/share/seqrepo/2024-12-20, 13.5 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 1144093 sequences, 128795051613 residues, 435 files
aliases: 5865328 aliases, 5865328 current, 35 namespaces, 1144093 sequences
# Simple Pythonic interface to sequences
>> from biocommons.seqrepo import SeqRepo
>> sr = SeqRepo("/usr/local/share/seqrepo/2024-12-20")
>> sr["NC_000001.11"][780000:780020]
'TGGTGGCACGCGCTTGTAGT'
# Or, use the seqrepo shell for even easier access
$ seqrepo start-shell -i 2024-12-20
In [1]: sr["NC_000001.11"][780000:780020]
Out[1]: 'TGGTGGCACGCGCTTGTAGT'
# N.B. The following output is edited for simplicity
$ seqrepo export -i 2024-12-20 | head -n100
>MD5:611ff0945aa9eeaaf9ef908d0e744cd0 SEGUID:mirLo912A/MppLuS1cUyFMduLUQ SHA1:9a2acba3dd7603f329a4bb92d5c53214c76e2d44 VMC:GS_---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 ga4gh:SQ.---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3 sha512t24u:---7nAwbv5Fs2Ml2-k3X6Zvj-6ZcjeD3
MDSPLREDDSQTCARLWEAEVKRHSLEGLTVFGTAVQIHNVQRRAIRAKGTQEAQAELLCRGPRLLDRFLEDACILKEGRGTDTGQHCRGDARISSHLEA
SGTHIQLLALFLVSSSDTPPSLLRFCHALEHDIRYNSSFDSYYPLSPHSRHNDDLQTPSSHLGYIITVPDPTLPLTFASLYLGMAPCTSMGSSSMGIFQS
QRIHAFMKGKNKWDEYEGRKESWKIRSNSQTGEPTF
>MD5:8ca1247fe64b17b9d40c2112a8bfc3a2 NCBI:XM_017008743.2 SEGUID:8cijSTmR/FOL+Gtq9gf6JlbRcvY SHA1:f1c8a3493991fc538bf86b6af607fa2656d172f6 VMC:GS_---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 ga4gh:SQ.---BUlBwgZN_r5wSII-WCNDd9nn1Owj4 refseq:XM_017008743.2 sha512t24u:---BUlBwgZN_r5wSII-WCNDd9nn1Owj4
ACTTATGGAAAACAGTGTGGCATATTCTGCTGAGCTTCGCCCTGGAAGAAGCCTCTTTTATACATCTCTTCAGGGAAGAGAGAAGCAATGGGCATGTTAG
TATACAATGATCACAGCCACGCAGGCCTGCAAGCTGCCTTTTGGACAGGCTGTTGACTGCCGTTCCAATTAGCTGATTGGAGAATGTGGAATGCAGAGTG
ATAATGCTGCATATCTGCTATCAGGCAGCAGCAAAGGTTTTTGTCTTGGGAAGGCAAGCTTTCCCTGCAATATTATCTCAGCAGCTCCCTAGCTGCTTAC
See Installation and Mirroring for more information.
SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query response caching. It defaults to 1 million but can also be set to "none" to be unlimited.
SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using a moderate value (>10) will greatly increase performance of sequence retrieval.
brew install python htslib
If you get "xcrun: error: invalid active developer path", you need to install XCode. See this StackOverflow answer.
sudo apt install -y python3-dev gcc zlib1g-dev tabix
Here's how to get started developing:
make devready
source venv/bin/activate
seqrepo --version
Code reformatting:
make reformat
Install pre-commit hook:
# included in `make devready`, not necessary for new installations
pre-commit install
Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. Tags correspond to the version of data, not the version of seqrepo, because the intent is to make it easy to depend on a local version of seqrepo files. Each docker image is an installation of seqrepo that downloads the corresponding version of seqrepo data. When used in conjunction with docker volumes for persistence, this provides an easy way to incorporate seqrepo data into a docker stack.
cd misc/docker
make 2021-01-29.log # builds and pushes to hub.docker.com (i.e., you need creds)