Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

Closed
shuaiwang2 opened this issue Feb 8, 2023 · 6 comments

Comments

@shuaiwang2
Copy link

Hello, When I implement "HaplotypeCaller" commands, the reference genome is about 15G , every chromosome is more then 600M, I get some errors, could you give me some advice?
the commands

# the step is after marked duplication 
samtools index -c  markdup.bam.gz
gatk --java-options "-Xmx100G -Djava.io.tmpdir=./" HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz   -ERC GVCF  -OVI >3gvcf.log 2>&1

the bug:

Using GATK jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100G -Djava.io.tmpdir=./ -jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.
toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz -ERC GVCF -OVI
09:01:25.845 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
09:01:25.950 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
09:01:25.950 INFO  HaplotypeCaller - Executing as ywt@ywt-Precision-5820-Tower on Linux v5.15.0-41-generic amd64
09:01:25.950 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.5+8-Ubuntu-2ubuntu122.04
09:01:25.950 INFO  HaplotypeCaller - Start Date/Time: February 8, 2023 at 9:01:25 AM CST
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Version: 3.0.1
09:01:25.951 INFO  HaplotypeCaller - Picard Version: 2.27.5
09:01:25.951 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:01:25.951 INFO  HaplotypeCaller - Deflater: IntelDeflater
09:01:25.951 INFO  HaplotypeCaller - Inflater: IntelInflater
09:01:25.951 INFO  HaplotypeCaller - GCS max retries/reopens: 20
09:01:25.951 INFO  HaplotypeCaller - Requester pays: disabled
09:01:25.952 INFO  HaplotypeCaller - Initializing engine
09:01:26.059 INFO  HaplotypeCaller - Done initializing engine
09:01:26.060 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
09:01:26.067 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
09:01:26.067 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
09:01:26.077 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
09:01:26.078 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
09:01:26.089 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
09:01:26.089 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
09:01:26.090 INFO  IntelPairHmm - Available threads: 36
09:01:26.090 INFO  IntelPairHmm - Requested threads: 4
09:01:26.090 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
09:01:26.121 INFO  ProgressMeter - Starting traversal
09:01:26.121 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
09:01:26.406 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 1A:145 and possibly subsequent; at least 10 samples must have called genotypes
09:01:33.373 WARN  DepthPerSampleHC - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:33.374 WARN  StrandBiasBySample - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:36.316 INFO  ProgressMeter -           1A:2054431              0.2                  7310          43025.3
09:01:46.831 INFO  ProgressMeter -           1A:3580946              0.3                 12960          37547.1
09:01:56.858 INFO  ProgressMeter -           1A:4888859              0.5                 17840          34824.5
09:02:07.416 INFO  ProgressMeter -           1A:7184455              0.7                 26090          37907.7
09:02:17.850 INFO  ProgressMeter -           1A:9469826              0.9                 34580          40109.0
09:02:28.162 INFO  ProgressMeter -          1A:11632942              1.0                 42480          41082.5
09:02:38.391 INFO  ProgressMeter -          1A:12861813              1.2                 47220          39203.0
09:02:48.460 INFO  ProgressMeter -          1A:15373965              1.4                 56590          41236.8

09:20:43.520 INFO  ProgressMeter -         1A:536836177             19.3               1951140         101147.8
09:20:44.715 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2023 at 9:20:44 AM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 19.31 minutes.
Runtime.totalMemory()=1811939328
java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
        at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
        at org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.close(GVCFWriter.java:71)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:277)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1101)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
(END)



@lbergelson
Copy link
Member

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.

Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).

We should probably improve that error message to make it clear what the problem is.

@shuaiwang2
Copy link
Author

thank you, very useful advice for me

@lbergelson
Copy link
Member

I've opened a ticket (samtools/htsjdk#1651) to improve this error message and make it less confusing

@Dentalium
Copy link

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.

Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).

We should probably improve that error message to make it clear what the problem is.

Hi, I think that means that the error only occurs when producing tbi index for vcf files of large chromosomes, while the variant calling progress can still be perform normally and I needn't to re-run the HaplotypeCaller. Am I right?

@shuaiwang2
Copy link
Author

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).
It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.
Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).
We should probably improve that error message to make it clear what the problem is.

Hi, I think that means that the error only occurs when producing tbi index for vcf files of large chromosomes, while the variant calling progress can still be perform normally and I needn't to re-run the HaplotypeCaller. Am I right?

I agree with your comment, but I don't test it.

@gokalpcelik
Copy link
Contributor

@Dentalium
Yes you are right. If you need to random access variants within you need to index your file with CSI index using tabix -C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants