Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format of input files: java.lang.NumberFormatException: For input string: #24

Open
zeneofa opened this issue Jan 12, 2015 · 7 comments
Open

Comments

@zeneofa
Copy link

zeneofa commented Jan 12, 2015

Hi,

I am trying to compare a set of vcf files to a set of confirmed snps from a genome in a bottle database. I do not have access to the raw fastq file, so I am unsure regarding the filters applied to mapping. I merely have a set of bam files, vcf files a bed region file. I therefore also don't know what post mapping alteration have been performed.

I have have tried to run:

java -jar ~/Downloads/bcbio.variation-0.2.1-standalone.jar variant-compare ref-grading.yaml

where my ref-grading.yaml file contains the following:

dir:
out: grading
prep: grading/prep
experiments:

  • sample: NA00001
    ref: /export/home/pjones/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa
    intervals: ref.bed
    summary-level: quick
    approach: grade
    calls:
    • name: reference
      file: ref.vcf
      remove-refcalls: true
    • name: case1
      prep: true
      preclean: true
      remove-refcalls: true
      file: case1.vcf
      intervals: ref.bed

I get the following error, (I am not familiar with java though):

2015-01-12 16:48:18,299 [INFO ] MLog clients using log4j logging.
2015-01-12 16:48:18,760 [INFO ] State :begin :: {:desc "Starting variation analysis"}
2015-01-12 16:48:18,788 [INFO ] State :clean :: {:desc "Cleaning input VCF: reference"}
2015-01-12 16:48:18,789 [INFO ] State :merge :: {:desc "Merging multiple input files: reference"}
2015-01-12 16:48:18,790 [INFO ] State :prep :: {:desc "Prepare VCF, resorting to genome build: reference"}
"ava.lang.NumberFormatException: For input string: "14596
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at bcbio.align.ref$prep_bedline_sort$fn__1333.invoke(ref.clj:85)
at bcbio.align.ref$sort_bed_file$fn__1338$fn__1339$fn__1344.invoke(ref.clj:98)
at clojure.core$sort_by$fn__4299.invoke(core.clj:2769)
at clojure.lang.AFunction.compare(AFunction.java:49)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:203)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at clojure.core$sort.invoke(core.clj:2754)
at clojure.core$sort_by.invoke(core.clj:2769)
at clojure.core$sort_by.invoke(core.clj:2767)
at bcbio.align.ref$sort_bed_file$fn__1338$fn__1339.invoke(ref.clj:99)
at bcbio.align.ref$sort_bed_file$fn__1338.invoke(ref.clj:97)
at bcbio.align.ref$sort_bed_file.invoke(ref.clj:96)
at bcbio.run.broad$gatk_cl_intersect_intervals$fn__1816.invoke(broad.clj:56)
at clojure.core$map$fn__4207.invoke(core.clj:2487)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$map$fn__4207.invoke(core.clj:2479)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$tree_seq$walk__4647$fn__4648.invoke(core.clj:4475)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.LazySeq.more(LazySeq.java:96)
at clojure.lang.RT.more(RT.java:607)
at clojure.core$rest.invoke(core.clj:73)
at clojure.core$flatten.invoke(core.clj:6478)
at bcbio.run.broad$gatk_cl_intersect_intervals.doInvoke(broad.clj:56)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at bcbio.variation.filter.intervals$select_by_sample.doInvoke(intervals.clj:56)
at clojure.lang.RestFn.invoke(RestFn.java:846)
at bcbio.variation.combine$dirty_prep_work$run_sample_select__1157.invoke(combine.clj:140)
at bcbio.variation.combine$dirty_prep_work.invoke(combine.clj:155)
at bcbio.variation.combine$gatk_normalize.invoke(combine.clj:187)
at bcbio.variation.compare$prepare_vcf_calls$fn__7526.invoke(compare.clj:120)
at clojure.core$map$fn__4207.invoke(core.clj:2487)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31)
at clojure.core$vec.invoke(core.clj:354)
at bcbio.variation.compare$prepare_vcf_calls.invoke(compare.clj:121)
at bcbio.variation.compare$variant_comparison_from_config$iter__7582__7586$fn__7587.invoke(compare.clj:255)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$tree_seq$walk__4647$fn__4648.invoke(core.clj:4475)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.LazySeq.more(LazySeq.java:96)
at clojure.lang.RT.more(RT.java:607)
at clojure.core$rest.invoke(core.clj:73)
at clojure.core$flatten.invoke(core.clj:6478)
at bcbio.variation.compare$variant_comparison_from_config.invoke(compare.clj:254)
at bcbio.variation.compare$_main.invoke(compare.clj:274)
at clojure.lang.AFn.applyToHelper(AFn.java:161)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:617)
at bcbio.variation.core$_main.doInvoke(core.clj:35)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at bcbio.variation.core.main(Unknown Source)

I have no idea how to start debuggin this, is there some input file format that I am not aware of? Must my reference.fa be truncated to the same chromosomes as indicated in the bed file?

My Aim: To get a good estimate of the false positive/negative rate, as well as possible factors influencing these (such as coverage, entropy of neigbouring regions, mapping quality etc).

Additional information:
from the header of the vcf file the reference appears to be hg19 ucsc (which is what I used), it also appears that the additional chromosomes have been removed from the header and the call list in the vcf file (ie only chr1 - 22 + x +y). The ref.vcf and bed was downloaded and appear to have the same ucsc naming convension. My reference is indexed and there exists a gatk dictionary file. Java version (jdk 1.7.0_45). CentosOS, cluster with lustre file system.

Kind Regards,
Piet Jones

@chapmanb
Copy link
Owner

Piet;
Thanks for trying out bcbio.variation and for the very complete report. It looks like something is unexpected with your bed file ref.bed. Specifically, do the start/end coordinates in the file contain quotes around them? It looks like we're complaining about "14596 being present as either the start or end of one of the lines. If you clean that up, hopefully it'll continue without any issues and get you the comparison info. Hope this helps.

@zeneofa
Copy link
Author

zeneofa commented Jan 12, 2015

Hi Brad,

Thanks for the very quick reply. Getting my feet wet with variant calling
atm :)

I have grep'ed every possible file for a quote followed by that number, but
nothing. I have grep'ed without the quote and ensures that some of the,
what looked like spaces, are actually tabs. But still nothing...

P

On Mon, Jan 12, 2015 at 5:13 PM, Brad Chapman [email protected]
wrote:

Piet;
Thanks for trying out bcbio.variation and for the very complete report. It
looks like something is unexpected with your bed file ref.bed.
Specifically, do the start/end coordinates in the file contain quotes
around them? It looks like we're complaining about "14596 being present
as either the start or end of one of the lines. If you clean that up,
hopefully it'll continue without any issues and get you the comparison
info. Hope this helps.


Reply to this email directly or view it on GitHub
#24 (comment)
.

@chapmanb
Copy link
Owner

Piet;
Would you be able to provide your BED file input as a Gist (https://gist.github.com/) or send to me directly? Maybe we'll be able to figure out the underlying issue by looking at it. Sorry to not have any better ideas right now but hopefully this'll help us get things running for you.

@zeneofa
Copy link
Author

zeneofa commented Jan 14, 2015

Hi Brad,

Unfortunately I can't share the bed file, the data I am using does not
belong to me and I don't have permission to share it :(

Is there a specific bed format that is required, BED6 or BED12. My current
bed file contains only the first three columns, and grep reveals that the
offending line could be line 2 (ie contains the 1456). Is there also a
header required for the bed format?

Sorry about the inconvenience with the file sharing.

Kind Regards,
Piet Jones

On Wed, Jan 14, 2015 at 6:10 AM, Brad Chapman [email protected]
wrote:

Piet;
Would you be able to provide your BED file input as a Gist (
https://gist.github.com/) or send to me directly? Maybe we'll be able to
figure out the underlying issue by looking at it. Sorry to not have any
better ideas right now but hopefully this'll help us get things running for
you.


Reply to this email directly or view it on GitHub
#24 (comment)
.

@chapmanb
Copy link
Owner

Piet;
bcbio doesn't have any special requirements for headers or columns. Where it is failing it is only trying to split the line by tabs and then take the first 3 columns, then turn the start and end coordinates into integers. I can't do much without being able to see the file but guessing: if there are strange line endings or other non-standard characters in there, maybe that is what is causing the issue. Hope this helps some.

@zeneofa
Copy link
Author

zeneofa commented Jan 15, 2015

Hi Brad,

Solved the problem, parsed the bed file with a python script (removing
newlines and splitting the respective lines). This removed the offending
item. Now however I get this:

INFO 13:52:38,856 HelpFormatter - Date/Time: 2015/01/15 13:52:38

INFO 13:52:38,856 HelpFormatter -

INFO 13:52:38,856 HelpFormatter -

INFO 13:52:39,702 GenomeAnalysisEngine - Strictness is SILENT
INFO 13:52:39,768 GenomeAnalysisEngine - Downsampling Settings: Method:
BY_SAMPLE, Target Coverage: 1000
WARN 13:52:39,786 FSLockWithShared$LockAcquisitionTask - WARNING: Unable
to lock file
/lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-case1-nomnp-nosv.vcf.idx
because an IOException occurred with message: Function not implemented.
INFO 13:52:39,788 RMDTrackBuilder - Could not acquire a shared lock on
index file
/lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-case1-nomnp-nosv.vcf.idx,
falling back to using an in-memory index for this GATK run.
WARN 13:52:41,002 FSLockWithShared$LockAcquisitionTask - WARNING: Unable
to lock file
/lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-reference-nomnp-nosv.vcf.idx
because an IOException occurred with message: Function not implemented.
INFO 13:52:41,003 RMDTrackBuilder - Could not acquire a shared lock on
index file
/lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-reference-nomnp-nosv.vcf.idx,
falling back to using an in-memory index for this GATK run.
INFO 13:52:43,298 IntervalUtils - Processing 64190747 bp from intervals
INFO 13:52:43,370 GenomeAnalysisEngine - Preparing for traversal
INFO 13:52:43,421 GenomeAnalysisEngine - Done preparing for traversal
INFO 13:52:43,421 ProgressMeter - [INITIALIZATION COMPLETE; STARTING
PROCESSING]
INFO 13:52:43,421 ProgressMeter - | processed | time
| per 1M | | total | remaining
INFO 13:52:43,421 ProgressMeter - Location | sites | elapsed
| sites | completed | runtime | runtime
org.broadinstitute.gatk.utils.exceptions.UserException$BadInput: Bad input:
Samples entered on command line (through -sf or -sn) that are not present
in the VCF.

A list of these samples:

NA00001

To ignore these samples, run with
--ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES
at
org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.initialize(SelectVariants.java:365)
at
org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at
org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at
org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at
org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at
org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at bcbio.run.broad$run_gatk$fn__1805.invoke(broad.clj:34)
at bcbio.run.broad$run_gatk.invoke(broad.clj:31)

On Wed, Jan 14, 2015 at 5:49 PM, Brad Chapman [email protected]
wrote:

Piet;
bcbio doesn't have any special requirements for headers or columns. Where
it is failing it is only trying to split the line by tabs and then take the
first 3 columns, then turn the start and end coordinates into integers. I
can't do much without being able to see the file but guessing: if there are
strange line endings or other non-standard characters in there, maybe that
is what is causing the issue. Hope this helps some.


Reply to this email directly or view it on GitHub
#24 (comment)
.

@chapmanb
Copy link
Owner

Piet;
Thanks much for following up and for the details about the line endings. I pushed a fix which should handle this for future files by stripping off stray whitespace.

For your second problem, it looks like you used the example naming for the sample name in the input YAML (NA00001) where you probably want this to match the actual names of the samples in the VCF files. If you want bcbio.variation can fix that for you by setting fix-sample-header: true:

https://github.com/chapmanb/bcbio.variation#configuration-file

Hope this helps get you going and thanks again for all the help debugging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants