Skip to content

Conversation

reece
Copy link
Member

@reece reece commented Mar 5, 2025

Implements the ability for seqfetcher callers to specify additional arguments to seqfetcher, and particularly to specify the type of sequence for ensembl transcript sequences. In order to address #75, the default ENST type is "cdna", although this is not the default in Ensembl's rest API. The type may also be set for ENST sequences using the ENST_DEFAULT_TYPE environment variable.

Examples

>>> from bioutils import seqfetcher
>>> import os

# set default type to cds
>>> os.environ["ENST_DEFAULT_TYPE"] = "cds"
>>> s_def = seqfetcher.fetch_seq("ENST00000617537")
>>> s_gen = seqfetcher.fetch_seq("ENST00000617537", type="genomic")
>>> s_cdna = seqfetcher.fetch_seq("ENST00000617537", type="cdna")
>>> s_cds = seqfetcher.fetch_seq("ENST00000617537", type="cds")
>>> len(s_def), len(s_gen), len(s_cdna), len(s_cds)
(1728, 211554, 2385, 1728)
# note that default corresponds to the cds size

# now remove the default type; seqfetcher ENST default type is "cdna"
# warning is emitted when falling back to default
>>> del os.environ["ENST_DEFAULT_TYPE"]
>>> s_def = seqfetcher.fetch_seq("ENST00000617537")
ENST00000617537: Transcript type not specified or set in ENST_DEFAULT_TYPE; assuming cdna
>>> s_gen = seqfetcher.fetch_seq("ENST00000617537", type="genomic")
>>> s_cdna = seqfetcher.fetch_seq("ENST00000617537", type="cdna")
>>> s_cds = seqfetcher.fetch_seq("ENST00000617537", type="cds")
>>> len(s_def), len(s_gen), len(s_cdna), len(s_cds)
(2385, 211554, 2385, 1728)
# note that the default size is that of cdna

Questions

  • What do you think of overriding the Ensembl API default (genomic) with cdna?

@reece reece requested a review from a team as a code owner March 5, 2025 05:05
@reece reece linked an issue Mar 5, 2025 that may be closed by this pull request
@reece
Copy link
Member Author

reece commented Mar 5, 2025

@jsstevenson @davmlaw: Comments please ↑

Copy link
Contributor

@jsstevenson jsstevenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! looks good, just a minor note about either altering the docstring in _fetch_seq_ensembl or the type argument name

@davmlaw
Copy link
Contributor

davmlaw commented Mar 6, 2025

I'm happy with default cds. It's what I think of as "transcript" and how RefSeq behaves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna
3 participants