Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insertion.vcf output #38

Open
mrb20045 opened this issue Jul 19, 2023 · 6 comments
Open

insertion.vcf output #38

mrb20045 opened this issue Jul 19, 2023 · 6 comments

Comments

@mrb20045
Copy link

Hi
I ran the pipeline successfully and insertions.vcf file is obtained after place-finish step. But, I can not find any RP (reverse placed) and all the lines are FP (forward placed). So is it OK? Could you please explain this file in details or share some links etc.

Regards

@Krannich479
Copy link
Collaborator

Hi @mrb20045 ,
I looked into the code and found that 'FP' is actually a constant string in the module that writes the VCF records.
I.e. no matter how the insertion is actually placed, there will always be 'FP' written in the variant's record. This is legacy code from popins version 1, at this point I don't precisely understand the design decision why it was implemented like this. I checked the VCF file format specification but couldn't find anything about 'FP' and 'RP' being descriptive for forward and reverse placed, respectively.
But you can still identify the orientation of the variant placement in the VCF record. It is encoded in the square brackets of the ALT field. I couldn't possibly explain it any better than the format specification itself, please see the format specification at chapter 5.4.

Let me know if that helps for your case.


I leave the corresponding code line here for reference

outStream << "\t" << loc.loc.chr << ":" << loc.refPos + 1 << ":" << "FP";

@mrb20045
Copy link
Author

I appreciate you taking the time to explain that. Could you please correct me if I am wrong about this example. For example I found these lines for jcf7180000267774_S115 contig. Based on the ALT field both ends of this contig is placed in the genome. Actually it is located between A....T from 8713105 to 8713437 of NC_049916.1 in the genome. This contig is located in the reverse strand of the genome.

NC_049916.1 8713105 NC_049916.1:8713105:FP A A[jcf7180000267774_S115r:122[ . . AR=197;AS=0.970443;GS=3

NC_049916.1 8713437 NC_049916.1:8713437:FP T ]jcf7180000267774_S115r:435]T . . AR=9;AS=0.5;GS=1

@Krannich479
Copy link
Collaborator

Given your two lines of results, all your above statements are true ✅ .
Your reliable source to determine the orientation here is "r" after contig identifier.

I have to look into the code that prints the square brackets and verify the compliance with the VCF format specifications.
I'll leave this issue open until then.

@mrb20045
Copy link
Author

Thanks, just two questions:

  1. There are more than 2 results for some contigs. I want to report the one with more AS. Do you have any better suggestion?

For example:

NC_028617.1 33639 NC_028617.1:33639:FP C C[jcf7180000052677_S170r:152[ . . AR=426;AS=0.391544;GS=1
NC_049906.1 35668606 NC_049906.1:35668606:FP A ]jcf7180000052677_S170r:11890]A . . AR=10798;AS=0.826356;GS=3
NW_023361349.1 15096 NW_023361349.1:15096:FP G ]jcf7180000052677_S170f:11915]G . . AR=374;AS=0.34375;GS=1

  1. Also, for some lines there is not a number after f/r in ALT filed. I know what is the number but why it is not reported for some results like:

NC_049915.1 14415726 NC_049915.1:14415726:FP A ]jcf7180000052790_S170f]A . . AR=18;AS=0.439024

@Krannich479
Copy link
Collaborator

Hi again,

  1. Given that all your chromosome identifiers differ it is hard to tell what's going on there. Are the chromosomes from a draft genome (i.e. no fully telomere-to-telomere assembled genome)? Or decoy sequences? I can understand why you'd like to report only one record per breakend and using the AS seems reasonable from a computational standpoint. But for your example above I suppose the biological meaningfulness is limited in terms of reporting a completely placed INS.
  2. I'll look into this. Tbd

@mrb20045
Copy link
Author

mrb20045 commented Jul 21, 2023

My reference genome is not at chromosome levels and it is scaffolds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants