Support for constructing and using GZI format files for BGZF compressed FASTA #164

mdshw5 · 2020-06-25T20:07:19Z

This is a work-in-progress implementation for #126 and much doesn't work properly.

There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.

mdshw5 · 2020-06-25T20:15:18Z

pyfaidx/__init__.py

@@ -603,13 +627,75 @@ def build_index(self):
                    % self.indexname)
            elif isinstance(e, FastaIndexingError):
                raise e
-
+
+    def build_gzi(self):


I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.

mdshw5 · 2020-06-25T20:16:45Z

pyfaidx/__init__.py

+	        if not eof.empty:
+	            raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)
+
+


The read_gzi and write_gzi methods should be re-written to use the functions from the end of this file (I think).

mdshw5 · 2020-06-25T20:19:04Z

pyfaidx/__init__.py

-                chunk = start0 + newlines_before + newlines_inside + seq_len
-                chunk_seq = self.file.read(chunk).decode()
-                seq = chunk_seq[start0 + newlines_before:]
+            bstart = i.offset + newlines_before + start0  # uncompressed offset for the start of requested string


I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.

Maarten-vd-Sande · 2022-02-12T09:51:15Z

pyfaidx/__init__.py

+    def build_gzi(self):
+        """ Build the htslib .gzi index format """
+        from Bio import bgzf
+        with open(self.filename, 'rb') as bgzf_file:
+            for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)):
+                self.gzi_index[i] = BGZFblock(*values)
+
+    def write_gzi(self):
+        """ Write the on disk format for the htslib .gzi index
+        https://github.com/samtools/htslib/issues/473"""
+        with open(self.gzi_indexname, 'wb') as bzi_file:
+            bzi_file.write(struct.pack('<Q', len(self.gzi_index)))
+            for block in self.gzi_index.values():
+                bzi_file.write(block.as_bytes())
+
+    def read_gzi(self):
+        """ Read the on disk format for the htslib .gzi index
+        https://github.com/samtools/htslib/issues/473"""
+        from ctypes import c_uint64, sizeof
+        with open(self.gzi_indexname, 'rb') as bzi_file:
+            number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0]
+            for i in range(number_of_blocks):
+                cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2))
+                if cstart == '' or ustart == '':
+                    raise IndexError("Unexpected end of .gzi file. ")
+                else:
+                    self.gzi_index[i] = BGZFblock(cstart, None, ustart, None)
+


Is this a duplicate code block?

ThomVett · 2023-02-08T08:26:52Z

Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks!

dennishendriksen · 2023-05-22T08:54:42Z

@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data.

mdshw5 added 4 commits October 11, 2017 11:52

Initial changes to support #126

db7f140

Initial changes to support #126

b5d375a

Merge local changes for #126

e526ce8

There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.

Resolve merge conflicts

f878775

mdshw5 added the enhancement label Jun 25, 2020

mdshw5 mentioned this pull request Jun 25, 2020

Create or load htslib .fai and .gzi index files when using BGZF files #126

Closed

mdshw5 commented Jun 25, 2020

View reviewed changes

mdshw5 mentioned this pull request Aug 6, 2020

Implement fsspec in place of open #168

Closed

mdshw5 mentioned this pull request Feb 11, 2022

BGZip slow performance near end of chromosomes #153

Open

Maarten-vd-Sande reviewed Feb 12, 2022

View reviewed changes

mdshw5 mentioned this pull request Feb 19, 2022

Version in APT doesn't have correct dependenceis #187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Support for constructing and using GZI format files for BGZF compressed FASTA #164

mdshw5 commented Jun 25, 2020

mdshw5 Jun 25, 2020

mdshw5 Jun 25, 2020

mdshw5 Jun 25, 2020

Maarten-vd-Sande Feb 12, 2022

ThomVett commented Feb 8, 2023

dennishendriksen commented May 22, 2023

		if not eof.empty:
		raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Are you sure you want to change the base?

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Conversation

mdshw5 commented Jun 25, 2020

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

Maarten-vd-Sande Feb 12, 2022

Choose a reason for hiding this comment

ThomVett commented Feb 8, 2023

dennishendriksen commented May 22, 2023