Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hap.py benchmarking #896

Open
alisamatisse opened this issue Oct 18, 2024 · 5 comments
Open

hap.py benchmarking #896

alisamatisse opened this issue Oct 18, 2024 · 5 comments

Comments

@alisamatisse
Copy link

Hello,

Sorry for writing again about some questions in the analysis, I don't have any expertise on DeepVariant/variant benchmarking around me :(

I wanted to benchmark identified variants using hap.py as you suggest in the tutorial, and the output surprised me a bit. For the variants I called the recall/precision were from 0 to 0.4, which is very low... I did not change default pacbio parameters when running DeepVariant, except for asking to keep supplementary alignments. Maybe it is partially because of the complexity of HLA region, but I am not sure what is the reason. What do you think, is there something obviously wrong?

singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type PACBIO \
    --ref $REFERENCE \
    --reads $BAM_FILE \
    --make_examples_extra_args=keep_supplementary_alignments=true \
    --output_vcf  $VCF_FILE \
    --num_shards 12 \
    --regions chr6:32541543-32701886

I am using version 1.6.1.

Kind regards,
Alisa

@lucasbrambrink
Copy link
Collaborator

Great question, always happy to help!

Setting keep_supplementary_alignments=true shouldn't dramatically change the accuracy.

Is it possible for you to share the BAM and VCF/truth set used? One possibility is that the truth set doesn't match HG003. For example if you run happy for HG003 using HG002 truth set the accuracy would be around 0.4.

@jrharting
Copy link

That particular region in HLA is likely to have multiple different alleles mapped to it due to the diversity of structural variation in the DRB loci. Hg38 (no_alts) only has DRB5 in chr6. If your sample has DRB1 alleles that are linked to DRB3/4, these reads often map here.

(see fig 4) https://pmc.ncbi.nlm.nih.gov/articles/PMC10849470/

@alisamatisse
Copy link
Author

alisamatisse commented Oct 25, 2024

Hi Lucas and John,

Thank you so much for your input. I’m currently trying to figure out what’s happening… with different references..!

Kind regards,
Alisa

@alisamatisse
Copy link
Author

Hey, sorry to reopen the issue. I called variants against HLA class II genes region for HG003 using HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

and have very low precision values :( probably I am doing wrong variant calling in the HLA...
also for some reason my coverage for this region is also very low, although WG is okay. I am not sure if it was reported elsewhere.

sorry for dumb questions, I am trying to figure this out on my own (and with AI)

Image

@alisamatisse alisamatisse reopened this Mar 7, 2025
@kishwarshafin
Copy link
Collaborator

@alisamatisse ,

You need to provide the HLA bed with -T option in hap.py cause it seems like it's comparing against everything. Can you paste your hap.py command here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants