Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF and Reference do not match, is this expected? #411

Open
djbradshaw2 opened this issue Jun 5, 2024 · 4 comments
Open

REF and Reference do not match, is this expected? #411

djbradshaw2 opened this issue Jun 5, 2024 · 4 comments

Comments

@djbradshaw2
Copy link

Dear Gubbins Creators,

Thanks for such a great tool! I wanted to check on an observation that I made looking through the *.summary_of_snp_distribution.vcf file. There seems to be a difference in the number of SNPs between the REF column and the Reference column that have matching nucleotides (only 151,919/153,743 SNPs match between the columns). Is this an expected result?

This is despite the *.per_branch_statistics.csv stating that the Reference had no SNPs...
gubbins.per_branch_statistics.csv

Please let me know if you'd like me to email the vcf file, it is 1.12 GB. Please let know if you need any other information or have any other questions.

Thanks for your time and help.

Sincerely,

David

Gubbins Version: 3.3.0

Scripts:
snippy-clean_full_aln core.full.aln > SX519_Chromosomal_Ref_clean.core.full.aln

run_gubbins.py --first-tree-builder rapidnj --first-model JC -p gubbins -c 32 -v SX519_Chromosomal_Ref_clean.core.full.aln

@nickjcroucher
Copy link
Owner

Do you have a specific example I could take a look at? Thanks!

@djbradshaw2
Copy link
Author

Hi Dr. Croucher,

Thanks again for helping me!. Please see attached the full SNPs for my dataset with just the CHROM, POS, ID, REF, ALT, Reference, example isolate (short read version of the reference), and columns to sort to see which SNPs do not match between the REF and Reference columns, and the REF and example isolate. Please let me know if you would like any other information.

Thanks for your time and help,

Sincerely,

David

gubbins_all_SNPs_Refs.txt

@djbradshaw2
Copy link
Author

Hi Dr. Croucher,

Sorry, I was curious if you have had a chance to determine if this was an expected result or not (having REF and Reference not matching in the vcf file)? Please let me know if you need any additional information. I have run into the same issue with a subsetted version of these isolates.

Thank you again for your time and help.

Sincerely,

David

@m6thu
Copy link

m6thu commented Nov 19, 2024

Hi David,

When gubbins creates a vcf, it uses the first sample listed in the aln file as the REF column. Snippy, however, likes to to put what you feed it as Reference last.

Try putting the Reference sequence first, something along the lines of

snippy-clean_full_aln core.full.aln > snippy-clean.aln

sequence_count=$(grep -c "^>" snippy-clean.aln)

awk '/^>/{n++} {if (n == '$sequence_count') print}' snippy-clean.aln > SX519_Chromosomal_Ref_clean.core.full.aln

awk '/^>/{n++; if (n == '$sequence_count') exit} {if (n < '$sequence_count') print}' snippy-clean.aln >> SX519_Chromosomal_Ref_clean.core.full.aln

unset sequence_count

run_gubbins.py --first-tree-builder rapidnj --first-model JC -p gubbins -c 32 -v SX519_Chromosomal_Ref_clean.core.full.aln

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants