A Python package for programmatically accessing SNP data from the AnnoQ API.
Install directly from GitHub using pip:
pip install git+https://github.com/USCbiostats/annoq-py.git- Python 3.7 or higher
import annoq
# Get available SNP attributes
attributes = annoq.get_snp_attributes()
# Search SNPs on chromosome 1
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)The package provides 7 main functions organized into three categories:
get_snp_attributes()- List all available SNP attributes
get_snps_by_chr()- Query by chromosome and position rangeget_snps_by_rsid_list()- Query by RSID identifiersget_snps_by_gene_product()- Query by gene information
count_snps_by_chr()- Count SNPs by chromosomecount_snps_by_rsid_list()- Count SNPs by RSID listcount_snps_by_gene_product()- Count SNPs by gene
Retrieve the list of all available SNP attributes that can be queried.
import annoq
# Get all available attributes
attributes = annoq.get_snp_attributes()
# attributes is a list of dictionaries with attribute metadata
for attr in attributes:
print(f"{attr['label']}: {attr['description']}")Search for SNPs within a specific chromosome region.
# Query chromosome 1 from position 1 to 100,000 and get basic fields
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)
# Query the X chromosome from position 1,000 to 50,000 and get basic default fields
snps = annoq.get_snps_by_chr(
chromosome_identifier="X",
start_position=1000,
end_position=50000
)You can specify which fields to return in three different ways:
As a list of field names:
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000,
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)As a string config exported from AnnoQ:
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000,
fields='{"_source":["chr", "pos", "ref", "alt", "rs_dbSNP151"]}'
)From a JSON config exported from AnnoQ:
# Export the config file: config.txt from AnnoQ
# {"_source":["chr", "pos", "ref", "alt", "rs_dbSNP151"]}
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000,
fields="/path/to/config.txt"
)Note: The maximum number of fields you can request is 20. For more fields you can make multiple queries and combine the results.
Return only SNPs where specific annotation fields have values:
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
filter_fields=["ANNOVAR_ucsc_Transcript_ID", "VEP_ensembl_Gene_ID"]
)By default, the API returns 1,000 results per page with a maximum of 10,000 results across all pages.
# Get first 500 results
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=1000000,
pagination_from=0,
pagination_size=500
)
# Get next 500 results
snps_page2 = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=1000000,
pagination_from=500,
pagination_size=500
)
# Note: pagination_from + pagination_size must be <= 10,000To retrieve all matching SNPs (up to 1,000,000), use fetch_all=True:
# This will download all matching SNPs
all_snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
fetch_all=True
)
# When fetch_all=True, the pagination parameters are ignoredImportant: When fetch_all=True, the function downloads a lot of data in a different format and may take a long time for large result sets.
Search for SNPs using RSID identifiers.
# Using a comma-separated string
snps = annoq.get_snps_by_rsid_list(
rsid_list="rs1219648,rs2912774,rs2981582"
)
# Using a list
snps = annoq.get_snps_by_rsid_list(
rsid_list=["rs1219648", "rs2912774", "rs2981582"]
)snps = annoq.get_snps_by_rsid_list(
rsid_list=["rs1219648", "rs2912774", "rs2981582"],
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"]
)snps = annoq.get_snps_by_rsid_list(
rsid_list="rs1219648,rs2912774,rs2981582",
filter_fields=["VEP_ensembl_Gene_ID"],
pagination_from=0,
pagination_size=100
)# Get all SNPs for a large list of RSIDs
all_snps = annoq.get_snps_by_rsid_list(
rsid_list=["rs1219648", "rs2912774", "rs2981582", "rs123456", "rs789012"],
fetch_all=True
)Search for SNPs associated with a gene using gene ID, gene symbol, or UniProt ID.
# Search by gene symbol
snps = annoq.get_snps_by_gene_product(gene="BRCA1")
# Search by gene ID or UniProt ID
snps = annoq.get_snps_by_gene_product(gene="ENSG00000012048")snps = annoq.get_snps_by_gene_product(
gene="TP53",
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
filter_fields=["ANNOVAR_ucsc_Transcript_ID"]
)# Get first 500 SNPs for a gene
snps = annoq.get_snps_by_gene_product(
gene="APOE",
pagination_from=0,
pagination_size=500
)# Get all SNPs associated with a gene
all_snps = annoq.get_snps_by_gene_product(
gene="ZMYND11",
fetch_all=True
)Count functions return the number of matching SNPs without retrieving the actual data.
# Count all SNPs in a region
count = annoq.count_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000
)
print(f"Found {count} SNPs")
# Count with filters
count = annoq.count_snps_by_chr(
chromosome_identifier="X",
start_position=1000,
end_position=50000,
filter_fields=["VEP_ensembl_Gene_ID", "ANNOVAR_ucsc_Transcript_ID"]
)# Count matching RSIDs
count = annoq.count_snps_by_rsid_list(
rsid_list=["rs1219648", "rs2912774", "rs2981582"]
)
# Count with filters
count = annoq.count_snps_by_rsid_list(
rsid_list="rs1219648,rs2912774,rs2981582",
filter_fields=["ANNOVAR_ucsc_Transcript_ID"]
)# Count SNPs for a gene
count = annoq.count_snps_by_gene_product(gene="BRCA1")
# Count with filters
count = annoq.count_snps_by_gene_product(
gene="TP53",
filter_fields=["VEP_ensembl_Gene_ID"]
)# First, count to see how many SNPs match
total = annoq.count_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=1000000
)
print(f"Total SNPs: {total}")
# Count with filters applied
filtered_count = annoq.count_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=1000000,
filter_fields=["VEP_ensembl_Gene_ID"]
)
print(f"Filtered SNPs: {filtered_count}")
# Retrieve the filtered data
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=1000000,
filter_fields=["VEP_ensembl_Gene_ID"],
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151", "VEP_ensembl_Gene_ID"]
)# For large regions, first check the count
count = annoq.count_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000000
)
if count > 1000000:
print(f"Warning: {count} SNPs found. Consider narrowing your search.")
elif count > 10000:
# Use fetch_all for counts between 10K and 100K
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000000,
fetch_all=True
)
else:
# Use regular pagination for smaller datasets
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=10000000,
pagination_size=count # Get all in one go
)# Get all SNPs for multiple genes
genes = ["BRCA1", "BRCA2", "TP53"]
all_gene_snps = {}
for gene in genes:
count = annoq.count_snps_by_gene_product(gene=gene)
print(f"{gene}: {count} SNPs")
all_gene_snps[gene] = annoq.get_snps_by_gene_product(
gene=gene,
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
fetch_all=True
)# Read RSIDs from a file
with open("rsid_list.txt", "r") as f:
rsids = [line.strip() for line in f if line.strip()]
# Check how many exist in the database
count = annoq.count_snps_by_rsid_list(rsid_list=rsids)
print(f"{count} out of {len(rsids)} RSIDs found")
# Retrieve all matching SNPs
snps = annoq.get_snps_by_rsid_list(
rsid_list=rsids,
fields=["chr", "pos", "ref", "alt", "rs_dbSNP151"],
fetch_all=True
)- Regular queries: Maximum of 10,000 results across all pages (
pagination_from + pagination_size ≤ 10,000) - Fetch all queries: Maximum of 1,000,000 total results.
- Note: For large datasets, the results may be too large and could lead to performance issues. It is recommended to narrow down the query if possible.
- Maximum of 20 fields can be requested per query
- Use the
get_snp_attributes()function to see all available fields
- The API may implement rate limiting for excessive requests
- Use count functions before large retrievals to estimate data size
All functions raise exceptions for common error cases:
try:
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
pagination_from=9500,
pagination_size=1000 # This exceeds the 10,000 limit
)
except ValueError as e:
print(f"Pagination error: {e}")
try:
snps = annoq.get_snps_by_chr(
chromosome_identifier="1",
start_position=1,
end_position=100000,
fields="/nonexistent/file.json"
)
except ValueError as e:
print(f"File error: {e}")
try:
snps = annoq.get_snps_by_chr(
chromosome_identifier="invalid",
start_position=1,
end_position=100000
)
except requests.exceptions.HTTPError as e:
print(f"API error: {e}")Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
This package is licensed under the MIT License.
For questions or issues related to AnnoQ itself, please visit the site AnnoQ