java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

shuaiwang2 · 2023-02-08T01:28:35Z

Hello, When I implement "HaplotypeCaller" commands, the reference genome is about 15G , every chromosome is more then 600M, I get some errors, could you give me some advice?
the commands

# the step is after marked duplication 
samtools index -c  markdup.bam.gz
gatk --java-options "-Xmx100G -Djava.io.tmpdir=./" HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz   -ERC GVCF  -OVI >3gvcf.log 2>&1

the bug:

Using GATK jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100G -Djava.io.tmpdir=./ -jar /home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R Triticum_aestivum.IWGSC.dna.
toplevel.fa -I rmarkdup.bam.gz -O SRR9851087.gvcf.gz -ERC GVCF -OVI
09:01:25.845 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
09:01:25.950 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
09:01:25.950 INFO  HaplotypeCaller - Executing as ywt@ywt-Precision-5820-Tower on Linux v5.15.0-41-generic amd64
09:01:25.950 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.5+8-Ubuntu-2ubuntu122.04
09:01:25.950 INFO  HaplotypeCaller - Start Date/Time: February 8, 2023 at 9:01:25 AM CST
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.950 INFO  HaplotypeCaller - ------------------------------------------------------------
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Version: 3.0.1
09:01:25.951 INFO  HaplotypeCaller - Picard Version: 2.27.5
09:01:25.951 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:01:25.951 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:01:25.951 INFO  HaplotypeCaller - Deflater: IntelDeflater
09:01:25.951 INFO  HaplotypeCaller - Inflater: IntelInflater
09:01:25.951 INFO  HaplotypeCaller - GCS max retries/reopens: 20
09:01:25.951 INFO  HaplotypeCaller - Requester pays: disabled
09:01:25.952 INFO  HaplotypeCaller - Initializing engine
09:01:26.059 INFO  HaplotypeCaller - Done initializing engine
09:01:26.060 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
09:01:26.067 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
09:01:26.067 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
09:01:26.077 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
09:01:26.078 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/ywt/anaconda3/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
09:01:26.089 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
09:01:26.089 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
09:01:26.090 INFO  IntelPairHmm - Available threads: 36
09:01:26.090 INFO  IntelPairHmm - Requested threads: 4
09:01:26.090 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
09:01:26.121 INFO  ProgressMeter - Starting traversal
09:01:26.121 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
09:01:26.406 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 1A:145 and possibly subsequent; at least 10 samples must have called genotypes
09:01:33.373 WARN  DepthPerSampleHC - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:33.374 WARN  StrandBiasBySample - Annotation will not be calculated at position 1A:1702502 and possibly subsequent; genotype for sample SRR9851087 is not called
09:01:36.316 INFO  ProgressMeter -           1A:2054431              0.2                  7310          43025.3
09:01:46.831 INFO  ProgressMeter -           1A:3580946              0.3                 12960          37547.1
09:01:56.858 INFO  ProgressMeter -           1A:4888859              0.5                 17840          34824.5
09:02:07.416 INFO  ProgressMeter -           1A:7184455              0.7                 26090          37907.7
09:02:17.850 INFO  ProgressMeter -           1A:9469826              0.9                 34580          40109.0
09:02:28.162 INFO  ProgressMeter -          1A:11632942              1.0                 42480          41082.5
09:02:38.391 INFO  ProgressMeter -          1A:12861813              1.2                 47220          39203.0
09:02:48.460 INFO  ProgressMeter -          1A:15373965              1.4                 56590          41236.8

09:20:43.520 INFO  ProgressMeter -         1A:536836177             19.3               1951140         101147.8
09:20:44.715 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2023 at 9:20:44 AM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 19.31 minutes.
Runtime.totalMemory()=1811939328
java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770
        at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:142)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeFeature(TabixIndexCreator.java:106)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.finalizeIndex(TabixIndexCreator.java:129)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:177)
        at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:233)
        at org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.close(GVCFWriter.java:71)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:277)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1101)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
(END)

The text was updated successfully, but these errors were encountered:

lbergelson · 2023-02-08T15:45:03Z

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.

Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).

We should probably improve that error message to make it clear what the problem is.

shuaiwang2 · 2023-02-09T08:34:10Z

thank you, very useful advice for me

lbergelson · 2023-02-13T18:43:31Z

I've opened a ticket (samtools/htsjdk#1651) to improve this error message and make it less confusing

Dentalium · 2024-12-05T11:27:02Z

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).

It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.

Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).

We should probably improve that error message to make it clear what the problem is.

Hi, I think that means that the error only occurs when producing tbi index for vcf files of large chromosomes, while the variant calling progress can still be perform normally and I needn't to re-run the HaplotypeCaller. Am I right?

shuaiwang2 · 2024-12-05T11:40:35Z

@shuaiwang2 Hi, we don't currently support indexes that long. We use a bai index for bams and tabix for vcf which only support up to 512 M. You need to use a CSI index for references that large but we don't support writing those. (Reading them is weird, I think we can read BAM csi indexes but not VCF ones).
It might be possible to work around this issue by setting --create-output-variant-index false, although downstream gatk tools would need an index if you're sharding them.
Otherwise I recommend splitting your chromosomes into two separate parts and calling on the split chromosomes. Splitting along a long region of N's should be a safe way to avoid missing any useful calls. (The telemere might be a good spot unless you have a T2T reference.).
We should probably improve that error message to make it clear what the problem is.

Hi, I think that means that the error only occurs when producing tbi index for vcf files of large chromosomes, while the variant calling progress can still be perform normally and I needn't to re-run the HaplotypeCaller. Am I right?

I agree with your comment, but I don't test it.

gokalpcelik · 2024-12-05T11:58:26Z

@Dentalium
Yes you are right. If you need to random access variants within you need to index your file with CSI index using tabix -C

shuaiwang2 closed this as completed Feb 9, 2023

lbergelson mentioned this issue Feb 13, 2023

Add a better error message when the reference is too long for indexing samtools/htsjdk#1651

Open

WimSpee mentioned this issue Mar 16, 2023

Support for loading CRAM + CRAI on large reference genomes. Instead of IndexOutOfBound igvteam/igv#1300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

shuaiwang2 commented Feb 8, 2023

lbergelson commented Feb 8, 2023

shuaiwang2 commented Feb 9, 2023

lbergelson commented Feb 13, 2023

Dentalium commented Dec 5, 2024

shuaiwang2 commented Dec 5, 2024

gokalpcelik commented Dec 5, 2024

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

java.lang.ArrayIndexOutOfBoundsException: Index 32770 out of bounds for length 32770 #8192

Comments

shuaiwang2 commented Feb 8, 2023

lbergelson commented Feb 8, 2023

shuaiwang2 commented Feb 9, 2023

lbergelson commented Feb 13, 2023

Dentalium commented Dec 5, 2024

shuaiwang2 commented Dec 5, 2024

gokalpcelik commented Dec 5, 2024