-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Is there anything available for inferring the FASTA record type from the sequence?
In earlier versions of FASTX I think this was done by default, and all of the records read in by Readers
were returned as variants of LongSequence
rather than strings. Now the same functionality is available optionally if you specify the return type when calling FASTX.sequence({desired_return_type}, record)
What I'm looking for is something along the lines of
FASTX.sequence(BioSequences.LongSequence, record) -> Union{LongDNA, LongRNA, LongAA}
#or BioSequences would be a better home for the inference
BioSequences.LongSequence(FASTX.sequence(record)) -> Union{LongDNA, LongRNA, LongAA}
Expected Behavior
Ambiguous interpretations lead to errors
julia> x = FASTX.FASTA.Record("identifier" , "AAAA")
FASTX.FASTA.Record:
description: "identifier"
sequence: "AAAA"
julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
4nt DNA Sequence:
AAAA
julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
4nt RNA Sequence:
AAAA
julia> FASTX.sequence(BioSequences.LongAA, x)
4aa Amino Acid Sequence:
AAAA
julia> FASTX.sequence(BioSequences.LongSequence, x)
Error -> ambiguous sequence type
unambiguous interpretations lead to auto-inferred sequence types
julia> x = FASTX.FASTA.Record("identifier" , "AAAAJ")
FASTX.FASTA.Record:
description: "identifier"
sequence: "AAAAJ"
julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.DNAAlphabet{4}()
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
[3] encode_chunk
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
[4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
[5] copyto!
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
[6] BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
[7] LongSequence
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
[8] sequence(::Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
@ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
[9] top-level scope
@ REPL[21]:1
julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.RNAAlphabet{4}()
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] throw_encode_error(A::BioSequences.RNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
[3] encode_chunk
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
[4] copyto!(dst::BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
[5] copyto!
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
[6] BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
@ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
[7] LongSequence
@ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
[8] sequence(::Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
@ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
[9] top-level scope
@ REPL[22]:1
julia> FASTX.sequence(BioSequences.LongAA, x)
5aa Amino Acid Sequence:
AAAAJ
julia> FASTX.sequence(BioSequences.LongSequence, x) #can only be an AA seq, per the available alphabets
5aa Amino Acid Sequence:
AAAAJ
Current Behavior
Can't use a generic LongSequence for any record
julia> FASTX.sequence(BioSequences.LongSequence, x)
ERROR: MethodError: no method matching BioSequences.LongSequence(::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true})
Closest candidates are:
BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
...
Stacktrace:
[1] sequence(::Type{BioSequences.LongSequence}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
@ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
[2] top-level scope
@ REPL[19]:1
julia> BioSequences.LongSequence(FASTX.sequence(x))
ERROR: MethodError: no method matching BioSequences.LongSequence(::StringViews.StringView{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}})
Closest candidates are:
BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
...
Stacktrace:
[1] top-level scope
@ REPL[26]:1
Context
In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA
I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like
DNA{2} -> RNA{2} -> DNA{4} -> RNA{4} -> AA
The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA
link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html