Skip to content

salmonIndex seq name from genes_bed2fasta #1076

@hanrong498

Description

@hanrong498

Hi!

Sorry for posting many issues in the past few days :) I had an error from salmon index:

[2024-10-29 13:22:09.436] [puff::index::jointLog] [info] Running fixFasta
[2024-10-29 13:22:09.444] [puff::index::jointLog] [error] In FixFasta, two references with the same name but different sequences: RefSeq. We require that all input records have a unique name up to the first whitespace (or user-provided separator) character.

The problem turns out to in genes.fa, as they have the same sequence names:

head /scratch/hhu/Xenopus_laevis_v10_1_lambda_spikein/annotation/genes.original.fa
>RefSeq
acaaactacagctcccagcaaccCTTTGCCACCTCGATAGCAAGAAATGTAACAGTTCTTTCAGTGCAACTGAACTCCAAGCTATTAAACTAG
>RefSeq
TTGAGCCACCCACATCATGGACTTTGCCCCTGAGGGCAGATCAGACCCGACAGAGGGCTTATGGGTTAAATAAATCACCTATTGCactaaa
..

I think the command in the genes_bed2fasta:

bedtools getfasta -name -s -split -fi /scratch/hhu/Xenopus_laevis_v10_1_lambda_spikein/genome_fasta/genome.fa -bed <(cat /scratch/hhu/Xenopus_laevis_v10_1_lambda_spikein/annotation/genes.bed | cut -f1-12) | sed 's/(.*)//g' | sed 's/:.*//g' > annotation/genes.fa 2> annotation/logs/bed2fasta.log

It cannot deal with genes of name like this:

Chr4L   15610   37088   RefSeq:XR_005966836.1   .       -       15610   37088   255,0,0 4       144,83,50,796   0,446,2433,20683
Chr4L   40727   57680   RefSeq:XM_041589621.1   .       +       40727   57680   255,0,0 9       283,98,133,116,111,101,102,139,278      0,1109,6513,9213,9409,11052,12648,16142,16676

Maybe this could be a potential problem for others using not so common GTF of other species..

Thanks a lot!
Hanrong

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions