Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusion-making .py script does not take into account gene orientation #26

Open
nvolkovaGEL opened this issue Jun 11, 2020 · 8 comments
Open

Comments

@nvolkovaGEL
Copy link

Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.

E.g. NTRK3 in cancer.hg38.csv:

NTRK3_ENST00000394480.6,chr15:87859751-88256768
1,88256644,88256747
2,88256284,88256486
3,88255906,88256168
4,88184225,88184299
...

Versus NTRK3 from the fusions.csv file generated by the script:

NTRK3_NM_002530,chr15:87859750-88256739
1,87859750,87877120
2,87880269,87880428
3,87929190,87929434
4,87933011,87933184
...

Here is a quick fix I introduced to make_fusion_genes.py:

Line 33:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
Line 44:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
And insert these after line 50:
if strand == '-':
exons = exons[::-1]

This returns the exons in the right order.

@sfchen
Copy link
Member

sfchen commented Jun 23, 2020

@ZKai0801

Can you please check and fix this issue?

@nvolkovaGEL
Copy link
Author

nvolkovaGEL commented Jun 26, 2020

Hi, just to add to this, I encountered one more bug in the fusion list making script: it picks up the wrong gene if the gene name is contained in some other gene name (like RET and RETREG1) and the relevant transcript is not supplied. Example:
grep RET refFlat.txt

RETREG1 NM_019000 chr5 - 16473037 16508998 16474740 16508612 7 16473037,16477661,16478033,16478849,16481008,16483345,16508577, 16475234,16477788,16478098,16478987,16481093,16483472,16508998,
RET NM_020975 chr10 + 43077068 43130349 43077258 43128269 20 43077068,43100458,43102341,43104951,43106375,43109030,43111206,43112098,43112852,43113555,43114479,43116583,43118372,43119530,43120080,43121945,43123670,43124882,43126574,43128111, 43077331,43100722,43102629,43105193,43106571,43109230,43111465,43112224,43112963,43113675,43114736,43116731,43118480,43119745,43120203,43122016,43123808,43124982,43126722,43130349,

and if I supply RET to current script, I'll get this:

RET_NM_001034850,chr5:16473037-16617058

I suspect this may be an issue for transcripts as well. I invented a quick fix, but someone more keen with regular expressions can just check if gene[0] or gene[1] plus tab symbol is in the line:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                    continue
                transcripts[transcript] = (chrom, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, start, end, exonstart, exonend  = transcripts[transcript]
 # use user-specified transcript
    elif len(gene) == 2:
        with open(refflat, "r") as fh:
            for line in fh:
                _, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if transcript != gene[1]:
                    continue
                break`

@ZKai0801
Copy link
Contributor

Thank you very much for reporting bugs and providing solutions! I have fixed the two bugs you mentioned as the way you suggested.
@sfchen please could you merge the pull request I just created?

sfchen added a commit that referenced this issue Jun 28, 2020
@nvolkovaGEL
Copy link
Author

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

@sfchen
Copy link
Member

sfchen commented Jun 28, 2020

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

You should submit another PR?

@ZKai0801
Copy link
Contributor

@nvolkovaGEL You are perfectly right, I should have more test before made commit.
Besides adding the amendment as you suggested, I added some error msgs to the code just in case people put wrong gene symbol/transcript to the input file.
@sfchen Could you please merge my second PR? sorry for the trouble

@sfchen
Copy link
Member

sfchen commented Jun 29, 2020

Merged.

@nvolkovaGEL
Copy link
Author

Sorry, should have just made a pull request, but couldn't get my locked local machine to synchronize properly with github. Thanks, please feel free to close this ticket!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants