Skip to content

USCbiostats/annoq-py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

annoq-py

A Python package for programmatically accessing SNP data from the AnnoQ API.

Installation

Install directly from GitHub using pip:

pip install git+https://github.com/USCbiostats/annoq-py.git

Requirements

  • Python 3.7 or higher

Quick Start

import annoq

# Get available SNP attributes
attributes = annoq.get_snp_attributes()

# Search SNPs on chromosome 1
snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=100000,
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)

Core Functions

The package provides 7 main functions organized into three categories:

Attribute Discovery

  • get_snp_attributes() - List all available SNP attributes

SNP Retrieval

  • get_snps_by_chr() - Query by chromosome and position range
  • get_snps_by_rsid_list() - Query by RSID identifiers
  • get_snps_by_gene_product() - Query by gene information

SNP Counting

  • count_snps_by_chr() - Count SNPs by chromosome
  • count_snps_by_rsid_list() - Count SNPs by RSID list
  • count_snps_by_gene_product() - Count SNPs by gene

Detailed Usage

1. Getting SNP Attributes

Retrieve the list of all available SNP attributes that can be queried.

import annoq

# Get all available attributes
attributes = annoq.get_snp_attributes()

# attributes is a list of dictionaries with attribute metadata
for attr in attributes:
    print(f"{attr['label']}: {attr['description']}")

2. Querying SNPs by Chromosome

Search for SNPs within a specific chromosome region.

Basic Usage

# Query chromosome 1 from position 1 to 100,000 and get basic fields
snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=100000,
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)

# Query the X chromosome from position 1,000 to 50,000 and get basic default fields
snps = annoq.get_snps_by_chr(
    chromosome_identifier="X",
    start_position=1000,
    end_position=50000
)

Selecting Specific Fields

You can specify which fields to return in three different ways:

As a list of field names:

snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=10000,
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)

As a string config exported from AnnoQ:

snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=10000,
    fields='{"_source":["chr", "pos", "ref", "alt", "rs_dbSNP151"]}'
)

From a JSON config exported from AnnoQ:

# Export the config file: config.txt from AnnoQ
# {"_source":["chr", "pos", "ref", "alt", "rs_dbSNP151"]}

snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=10000,
    fields="/path/to/config.txt"
)

Note: The maximum number of fields you can request is 20. For more fields you can make multiple queries and combine the results.

Filtering by Non-Empty Fields

Return only SNPs where specific annotation fields have values:

snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=100000,
    filter_fields=["ANNOVAR_ucsc_Transcript_ID", "VEP_ensembl_Gene_ID"]
)

Pagination

By default, the API returns 1,000 results per page with a maximum of 10,000 results across all pages.

# Get first 500 results
snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=1000000,
    pagination_from=0,
    pagination_size=500
)

# Get next 500 results
snps_page2 = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=1000000,
    pagination_from=500,
    pagination_size=500
)

# Note: pagination_from + pagination_size must be <= 10,000

Fetching All Results

To retrieve all matching SNPs (up to 1,000,000), use fetch_all=True:

# This will download all matching SNPs
all_snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=100000,
    fetch_all=True
)

# When fetch_all=True, the pagination parameters are ignored

Important: When fetch_all=True, the function downloads a lot of data in a different format and may take a long time for large result sets.

3. Querying SNPs by RSID

Search for SNPs using RSID identifiers.

Basic Usage

# Using a comma-separated string
snps = annoq.get_snps_by_rsid_list(
    rsid_list="rs1219648,rs2912774,rs2981582"
)

# Using a list
snps = annoq.get_snps_by_rsid_list(
    rsid_list=["rs1219648", "rs2912774", "rs2981582"]
)

With Custom Fields

snps = annoq.get_snps_by_rsid_list(
    rsid_list=["rs1219648", "rs2912774", "rs2981582"],
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)

With Filtering

snps = annoq.get_snps_by_rsid_list(
    rsid_list="rs1219648,rs2912774,rs2981582",
    filter_fields=["VEP_ensembl_Gene_ID"],
    pagination_from=0,
    pagination_size=100
)

Fetching All Matching RSIDs

# Get all SNPs for a large list of RSIDs
all_snps = annoq.get_snps_by_rsid_list(
    rsid_list=["rs1219648", "rs2912774", "rs2981582", "rs123456", "rs789012"],
    fetch_all=True
)

4. Querying SNPs by Gene Product

Search for SNPs associated with a gene using gene ID, gene symbol, or UniProt ID.

Basic Usage

# Search by gene symbol
snps = annoq.get_snps_by_gene_product(gene="BRCA1")

# Search by gene ID or UniProt ID
snps = annoq.get_snps_by_gene_product(gene="ENSG00000012048")

With Custom Fields and Filtering

snps = annoq.get_snps_by_gene_product(
    gene="TP53",
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
    filter_fields=["ANNOVAR_ucsc_Transcript_ID"]
)

With Pagination

# Get first 500 SNPs for a gene
snps = annoq.get_snps_by_gene_product(
    gene="APOE",
    pagination_from=0,
    pagination_size=500
)

Fetching All Gene-Associated SNPs

# Get all SNPs associated with a gene
all_snps = annoq.get_snps_by_gene_product(
    gene="ZMYND11",
    fetch_all=True
)

5. Counting SNPs

Count functions return the number of matching SNPs without retrieving the actual data.

Count by Chromosome

# Count all SNPs in a region
count = annoq.count_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=100000
)
print(f"Found {count} SNPs")

# Count with filters
count = annoq.count_snps_by_chr(
    chromosome_identifier="X",
    start_position=1000,
    end_position=50000,
    filter_fields=["VEP_ensembl_Gene_ID", "ANNOVAR_ucsc_Transcript_ID"]
)

Count by RSID List

# Count matching RSIDs
count = annoq.count_snps_by_rsid_list(
    rsid_list=["rs1219648", "rs2912774", "rs2981582"]
)

# Count with filters
count = annoq.count_snps_by_rsid_list(
    rsid_list="rs1219648,rs2912774,rs2981582",
    filter_fields=["ANNOVAR_ucsc_Transcript_ID"]
)

Count by Gene Product

# Count SNPs for a gene
count = annoq.count_snps_by_gene_product(gene="BRCA1")

# Count with filters
count = annoq.count_snps_by_gene_product(
    gene="TP53",
    filter_fields=["VEP_ensembl_Gene_ID"]
)

Common Patterns

Example 1: Progressive Filtering

# First, count to see how many SNPs match
total = annoq.count_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=1000000
)
print(f"Total SNPs: {total}")

# Count with filters applied
filtered_count = annoq.count_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=1000000,
    filter_fields=["VEP_ensembl_Gene_ID"]
)
print(f"Filtered SNPs: {filtered_count}")

# Retrieve the filtered data
snps = annoq.get_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=1000000,
    filter_fields=["VEP_ensembl_Gene_ID"],
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151", "VEP_ensembl_Gene_ID"]
)

Example 2: Working with Large Datasets

# For large regions, first check the count
count = annoq.count_snps_by_chr(
    chromosome_identifier="1",
    start_position=1,
    end_position=10000000
)

if count > 1000000:
    print(f"Warning: {count} SNPs found. Consider narrowing your search.")
elif count > 10000:
    # Use fetch_all for counts between 10K and 100K
    snps = annoq.get_snps_by_chr(
        chromosome_identifier="1",
        start_position=1,
        end_position=10000000,
        fetch_all=True
    )
else:
    # Use regular pagination for smaller datasets
    snps = annoq.get_snps_by_chr(
        chromosome_identifier="1",
        start_position=1,
        end_position=10000000,
        pagination_size=count # Get all in one go
    )

Example 3: Gene-Focused Analysis

# Get all SNPs for multiple genes
genes = ["BRCA1", "BRCA2", "TP53"]
all_gene_snps = {}

for gene in genes:
    count = annoq.count_snps_by_gene_product(gene=gene)
    print(f"{gene}: {count} SNPs")
    
    all_gene_snps[gene] = annoq.get_snps_by_gene_product(
        gene=gene,
        fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
        fetch_all=True
    )

Example 4: Batch RSID Lookup

# Read RSIDs from a file
with open("rsid_list.txt", "r") as f:
    rsids = [line.strip() for line in f if line.strip()]

# Check how many exist in the database
count = annoq.count_snps_by_rsid_list(rsid_list=rsids)
print(f"{count} out of {len(rsids)} RSIDs found")

# Retrieve all matching SNPs
snps = annoq.get_snps_by_rsid_list(
    rsid_list=rsids,
    fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
    fetch_all=True
)

Important Limitations

Pagination Constraints

  • Regular queries: Maximum of 10,000 results across all pages (pagination_from + pagination_size ≤ 10,000)
  • Fetch all queries: Maximum of 1,000,000 total results.
  • Note: For large datasets, the results may be too large and could lead to performance issues. It is recommended to narrow down the query if possible.

Field Selection

  • Maximum of 20 fields can be requested per query
  • Use the get_snp_attributes() function to see all available fields

Rate Limiting

  • The API may implement rate limiting for excessive requests
  • Use count functions before large retrievals to estimate data size

Error Handling

All functions raise exceptions for common error cases:

try:
    snps = annoq.get_snps_by_chr(
        chromosome_identifier="1",
        start_position=1,
        end_position=100000,
        pagination_from=9500,
        pagination_size=1000  # This exceeds the 10,000 limit
    )
except ValueError as e:
    print(f"Pagination error: {e}")

try:
    snps = annoq.get_snps_by_chr(
        chromosome_identifier="1",
        start_position=1,
        end_position=100000,
        fields="/nonexistent/file.json"
    )
except ValueError as e:
    print(f"File error: {e}")

try:
    snps = annoq.get_snps_by_chr(
        chromosome_identifier="invalid",
        start_position=1,
        end_position=100000
    )
except requests.exceptions.HTTPError as e:
    print(f"API error: {e}")

Contributing

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

License

This package is licensed under the MIT License.

Support

For questions or issues related to AnnoQ itself, please visit the site AnnoQ

About

Python package for annoq-api-v2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages