Skip to content

Record type inference #268

@cjprybol

Description

@cjprybol

Is there anything available for inferring the FASTA record type from the sequence?

In earlier versions of FASTX I think this was done by default, and all of the records read in by Readers were returned as variants of LongSequence rather than strings. Now the same functionality is available optionally if you specify the return type when calling FASTX.sequence({desired_return_type}, record)

What I'm looking for is something along the lines of

FASTX.sequence(BioSequences.LongSequence, record) -> Union{LongDNA, LongRNA, LongAA}

#or BioSequences would be a better home for the inference
BioSequences.LongSequence(FASTX.sequence(record)) -> Union{LongDNA, LongRNA, LongAA}

Expected Behavior

Ambiguous interpretations lead to errors

julia> x = FASTX.FASTA.Record("identifier" , "AAAA")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAA"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
4nt DNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
4nt RNA Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongAA, x)
4aa Amino Acid Sequence:
AAAA

julia> FASTX.sequence(BioSequences.LongSequence, x)
Error -> ambiguous sequence type

unambiguous interpretations lead to auto-inferred sequence types

julia> x = FASTX.FASTA.Record("identifier" , "AAAAJ")
FASTX.FASTA.Record:
  description: "identifier"
     sequence: "AAAAJ"

julia> FASTX.sequence(BioSequences.LongDNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.DNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[21]:1

julia> FASTX.sequence(BioSequences.LongRNA{4}, x)
ERROR: Cannot encode 0x4a to BioSequences.RNAAlphabet{4}()
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] throw_encode_error(A::BioSequences.RNAAlphabet{4}, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:156
 [3] encode_chunk
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:168 [inlined]
 [4] copyto!(dst::BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}, doff::Int64, src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, soff::Int64, N::Int64, #unused#::BioSequences.AsciiAlphabet)
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:298
 [5] copyto!
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/copying.jl:235 [inlined]
 [6] BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}(src::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}, part::UnitRange{Int64})
   @ BioSequences /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:85
 [7] LongSequence
   @ /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:83 [inlined]
 [8] sequence(::Type{BioSequences.LongSequence{BioSequences.RNAAlphabet{4}}}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [9] top-level scope
   @ REPL[22]:1

julia> FASTX.sequence(BioSequences.LongAA, x)
5aa Amino Acid Sequence:
AAAAJ

julia> FASTX.sequence(BioSequences.LongSequence, x) #can only be an AA seq, per the available alphabets
5aa Amino Acid Sequence:
AAAAJ

Current Behavior

Can't use a generic LongSequence for any record

julia> FASTX.sequence(BioSequences.LongSequence, x)
ERROR: MethodError: no method matching BioSequences.LongSequence(::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] sequence(::Type{BioSequences.LongSequence}, record::FASTX.FASTA.Record, part::UnitRange{Int64}) (repeats 2 times)
   @ FASTX.BioSequencesExt /opt/julia/packages/FASTX/gzHTk/ext/BioSequencesExt.jl:21
 [2] top-level scope
   @ REPL[19]:1

julia> BioSequences.LongSequence(FASTX.sequence(x))
ERROR: MethodError: no method matching BioSequences.LongSequence(::StringViews.StringView{SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int64}}, true}})
Closest candidates are:
  BioSequences.LongSequence(::Kmers.Kmer{A, K, N}) where {A, K, N} at /opt/julia/packages/Kmers/7SNBQ/src/kmer.jl:492
  BioSequences.LongSequence(::BioSequences.LongSequence, ::AbstractUnitRange{var"#s29"} where var"#s29"<:Integer) at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/constructors.jl:49
  BioSequences.LongSequence(::BioSequences.LongSubSeq{A}) where A at /opt/julia/packages/BioSequences/zp2M8/src/longsequences/seqview.jl:76
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[26]:1

Context

In addition to validating whether a FASTA is valid https://biojulia.github.io/FASTX.jl/latest/fasta/#FASTX.FASTA.validate_fasta it would be useful to have functionality to auto-infer the type of records in the FASTA

I'd need to think through the most logical way to check, but I think an order of operations to infer the best alphabet match might be like

DNA{2} -> RNA{2} -> DNA{4} -> RNA{4} -> AA

The AA alphabet (letter codes, not molecules) seems to be a superset of DNA/RNA alphabet, and the T/U difference I think is enough to differentiate between DNA/RNA

link to codes https://www.ddbj.nig.ac.jp/ddbj/code-e.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestA request for some desired or missing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions