-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for constructing and using GZI format files for BGZF compressed FASTA #164
base: master
Are you sure you want to change the base?
Conversation
There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.
@@ -603,13 +627,75 @@ def build_index(self): | |||
% self.indexname) | |||
elif isinstance(e, FastaIndexingError): | |||
raise e | |||
|
|||
|
|||
def build_gzi(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.
if not eof.empty: | ||
raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read_gzi
and write_gzi
methods should be re-written to use the functions from the end of this file (I think).
chunk = start0 + newlines_before + newlines_inside + seq_len | ||
chunk_seq = self.file.read(chunk).decode() | ||
seq = chunk_seq[start0 + newlines_before:] | ||
bstart = i.offset + newlines_before + start0 # uncompressed offset for the start of requested string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.
def build_gzi(self): | ||
""" Build the htslib .gzi index format """ | ||
from Bio import bgzf | ||
with open(self.filename, 'rb') as bgzf_file: | ||
for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)): | ||
self.gzi_index[i] = BGZFblock(*values) | ||
|
||
def write_gzi(self): | ||
""" Write the on disk format for the htslib .gzi index | ||
https://github.com/samtools/htslib/issues/473""" | ||
with open(self.gzi_indexname, 'wb') as bzi_file: | ||
bzi_file.write(struct.pack('<Q', len(self.gzi_index))) | ||
for block in self.gzi_index.values(): | ||
bzi_file.write(block.as_bytes()) | ||
|
||
def read_gzi(self): | ||
""" Read the on disk format for the htslib .gzi index | ||
https://github.com/samtools/htslib/issues/473""" | ||
from ctypes import c_uint64, sizeof | ||
with open(self.gzi_indexname, 'rb') as bzi_file: | ||
number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0] | ||
for i in range(number_of_blocks): | ||
cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2)) | ||
if cstart == '' or ustart == '': | ||
raise IndexError("Unexpected end of .gzi file. ") | ||
else: | ||
self.gzi_index[i] = BGZFblock(cstart, None, ustart, None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a duplicate code block?
Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks! |
@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data. |
This is a work-in-progress implementation for #126 and much doesn't work properly.