Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to read record from bam information, truncated record in SAM/BAM/CRAM file #375

Open
ziphra opened this issue Feb 17, 2025 · 7 comments
Labels
troubleshooting workflow and data preparation questions

Comments

@ziphra
Copy link

ziphra commented Feb 17, 2025

Hello,

I am trying to use the modkit extract full command on my data, but i only get records failed during processing:

(sturgeon) p2@p2-Precision-5860-Tower:~/tools$ modkit extract full haplotagged.cram /output/HM.txt --log-filepath log.txt
> found BAM index, processing reads in 100000 base pair chunks
> 2952 ~records failed
> 0 ~records skipped
> 0 ~records used
> 0 rows written
|---------------------------------------       1/10000   enqueued processed reads
[00:03:17] ####------------------------------------ 294656422/3299210039 genome positions                   ^C

And the log tells me:

[src/logging.rs::60][2025-02-17 12:52:47][DEBUG] command line: modkit_v0.4.3_u16_x86_64/dist_modkit_v0.4.3_d13b97d/modkit extract full /home/p2/epi2melabs/instances/wf-human-variation_01JKAX74XSDEBXWFFN086R6HTH/output/19H01535.haplotagged.cram /home/p2/epi2melabs/instances/wf-human-variation_01JKAX74XSDEBXWFFN086R6HTH/output/19H01535HM.txt --log-filepath /home/p2/epi2melabs/instances/wf-human-variation_01JKAX74XSDEBXWFFN086R6HTH/log.txt
[src/extract/util.rs::281][2025-02-17 12:52:47][INFO] found BAM index, processing reads in 100000 base pair chunks
[src/interval_chunks.rs::512][2025-02-17 12:52:47][DEBUG] there are 711 contig(s) to work on (711 parts)
[src/mod_bam.rs::112][2025-02-17 12:52:48][DEBUG] failed to read record from bam information, truncated record in SAM/BAM/CRAM file
[src/mod_bam.rs::112][2025-02-17 12:52:48][DEBUG] failed to read record from bam information, truncated record in SAM/BAM/CRAM file
...

Do you have any idea on what could cause this issue? Could it be that it needs a BAM?

Many thanks in advance!

@ArtRand
Copy link
Contributor

ArtRand commented Feb 18, 2025

Hello @ziphra,

Could you run modkit modbam check-tags docs here on the CRAM or a small subset of it and post the output here? It will tabulate the errors encountered whilst parsing the records. (You need 0.4.3 for this function).

CRAM should work fine, you can test the command using the CRAM file in the testing resources directory. Which program is generating the modBAM files? If it's still not clear what's going on, maybe you can attach a sampling of the file and I can take a look.

@ArtRand ArtRand added the troubleshooting workflow and data preparation questions label Feb 18, 2025
@ziphra
Copy link
Author

ziphra commented Feb 18, 2025

Hello, thank you for your reply.

modkit modbam check-tags gave me this:

 modkit modbam check-tags '/home/p2/epi2melabs/instances/wf-human-variation_01JKAX74XSDEBXWFFN086R6HTH/output/19H01535.haplotagged.cram'
[00:00:14] ######################################## 3299210039/3299210039 genome positions                                                                                                                                        > input modBAM contains 1314 (1.04%) failed records
> num PASS records: 126360 (100.00%)
> num records: 126360
> errors:
+----------------------------------------------------+-------+--------+
| error                                              | count | pct    |
+----------------------------------------------------+-------+--------+
| HtsLib-error-truncated record in SAM/BAM/CRAM file | 1314  | 100.00 |
| total                                              | 1314  | 100    |
+----------------------------------------------------+-------+--------+


> valid record tag headers:
+------------+--------+
| tag_header | count  |
+------------+--------+
| C+h?       | 117320 |
| C+m?       | 117320 |
+------------+--------+


> modified bases:
+--------+--------------+----------+------+
| strand | primary_base | mod_code | mode |
+--------+--------------+----------+------+
| +      | C            | h        | ?    |
| +      | C            | m        | ?    |
+--------+--------------+----------+------+

> Error! input modBAM contains 1314 (1.04%) failed records

My cram was generated with Epi2me, so it's the usual workflow.

It worked with the bam files generated during sequencing.

@ArtRand
Copy link
Contributor

ArtRand commented Feb 18, 2025

@ziphra

It looks like these are different files you're using? haplotagged.cram vs. 19H01535.haplotagged.cram? The first file looks to have almost entirely failed records, but the second one it is a small sub-population.

As you've seen in the first post, most modkit extract will simply drop failed records. Of course if they all fail, it's not much help. I can check with the Epi2Me team about how truncated records could be written.

Aside:

> num PASS records: 126360 (100.00%)

Looks to me like this counter is broken, so I'll fix that.

@ziphra
Copy link
Author

ziphra commented Feb 19, 2025

No, it's the same file, I changed names and removed path when posting for readability. I just tried the two commands again and got the same results.
I wonder if this could be due to my reads being phased?

@ArtRand
Copy link
Contributor

ArtRand commented Feb 19, 2025

Hello @ziphra,

Ok. I don't think that having phased reads should matter, the phasing is just an tag in the record (same as the modified base probabilities). Would it be possible fo you to attach a small sample CRAM that exposes the issue? You can also email/share it with me at art.rand[at]nanoporetech.com.

@ziphra
Copy link
Author

ziphra commented Feb 20, 2025

I'm encountering an unexpected issue with this CRAM file. I try subsetting a region using the following commands, so I could share it with you:

samtools view -T hg38.fa --output-fmt cram -h -o chr7.cram'  haplotagged.cram chr7
samtools index chr7.cram

Then, when I run modkit extract full on the subsetted CRAM, it works!

I'm not sure why this behaviour occurs. Any insights would be appreciated.

@ziphra ziphra closed this as completed Feb 20, 2025
@ziphra ziphra reopened this Feb 20, 2025
@ArtRand
Copy link
Contributor

ArtRand commented Feb 20, 2025

Hello @ziphra,

If you randomly subsample the CRAM (samtools view --output-fmt cram -h --subsample FRAC) does it work? If not, could you send me that one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

2 participants