Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

several same-score pairwise alignments #3198

Open
notestaff opened this issue Oct 11, 2023 · 6 comments
Open

several same-score pairwise alignments #3198

notestaff opened this issue Oct 11, 2023 · 6 comments
Labels
question a user question how to do certain things

Comments

@notestaff
Copy link

notestaff commented Oct 11, 2023

Platform

  • SeqAn version: 3.3.0
  • Operating system:

Linux bpb23-acc 5.4.0-136-generic #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Compiler: x86_64-conda-linux-gnu-c++ (conda-forge gcc 12.3.0-2) 12.3.0

Question

In pairwise_align(), when several possible alignments have the same score, how does the code pick which one to return? Can it be asked to return all of them (lazily), like BioPython's Bio.Align.PairwiseAligner does? Thanks!

@eseiler

@notestaff notestaff added the question a user question how to do certain things label Oct 11, 2023
@eseiler
Copy link
Member

eseiler commented Oct 15, 2023

Hey @notestaff,

if there are multiple valid alignments, the traceback will choose in this order:

  1. Diagonal (match)
  2. Vertical (gap in the first sequence)
  3. Horizontal (gap in the second sequence)

The alignments are always evaluated lazily, i.e., they are only computed if you access them.

As far as I am aware, there is no way to force an output of all possible (alternative) alignments.

Click to show example
#include <vector>

#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
#include <seqan3/alignment/configuration/align_config_edit.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

int main()
{
    using namespace seqan3::literals;

    std::vector<seqan3::dna4> const sequence_1{"ACGT"_dna4};
    std::vector<seqan3::dna4> const sequence_2{"AGCT"_dna4};

    auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme;

    for (auto const & result : seqan3::align_pairwise(std::tie(sequence_1, sequence_2), config))
        seqan3::debug_stream << result << '\n';
}

Will output:

{sequence1 id: 0, sequence2 id: 0, score: -2, begin: (0,0), end: (4,4), 
alignment:
      0     .
        A-CGT
        | | |
        AGC-T
}

An alternative is:

ACG-T
| | |
A-GCT

but the algorithm will prefer putting the gap in the first sequence first.


@rrahn will also have a look at your questions and correct me/expand on my answers if necessary :)

What would the use case be for iterating over all alternative alignments? Please don't hesitate to ask more questions if something is unclear.

@notestaff
Copy link
Author

Hi @eseiler and @rrahn,

Thanks for the response (and for seqan generally).
My current use case involves converting some code from python/biopython to C++/seqan3, and trying to ensure that the new code exactly reproduces the old results. I've realized now that the code path which used alignment enumeration is inactive, so don't have an immediate need for that function. However, it would help a lot if there was a systematic way to exactly reproduce, in seqan3, biopython's choice of alignment from among the top-scoring alignments. Maybe, the order in which the traceback chooses directions could be explicitly parameterized (with compile-time choices)? Then I could choose the order corresponding to biopython's implementation.

@rrahn
Copy link
Contributor

rrahn commented Oct 19, 2023

Hi @notestaff,

thanks for writing us.
If I understand you correctly, then you are interested in all cooptimal alignments that yield an optimal alignment score?
The short answer is: It is indirectly possible by forcing align_pairwise(...) to output the trace matrix:

#include <seqan3/alignment/configuration/align_config_debug.hpp>
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/align_config_method.hpp>
#include <seqan3/alignment/configuration/align_config_scoring_scheme.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/matrix/detail/trace_directions.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>

auto align_config = seqan3::align_cfg::method_global{} | 
                    seqan3::align_cfg::scoring_scheme{
                        seqan3::nucleotide_scoring_scheme{seqan3::match_score{4},
                                                          seqan3::mismatch_score{-5}}
                    } |
                    seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-10},
                                                       seqan3::align_cfg::extension_score{-1}
                    } |
                    seqan3::align_cfg::detail::debug{};

auto source = "AACCGGTTAACCGGTT"_dna4;
auto target = "ACGTACGTA"_dna4;

using trace_matrix_t = seqan3::detail::two_dimensional_matrix<std::optional<seqan3::detail::trace_directions>>; 
auto alignments = seqan3::align_pairwise(std::tie(source, target), align_config);

for (auto && alignment : alignments) {
    trace_matrix_t trace_matrix{alignment.trace_matrix()};
    // in case you are using local alignments you can use 
    // alignment.sequence1_end_position()/alignment.sequence2_end_position() to find the column and row index where the alignment starts in the trace matrix and 
    // alignment.sequence1_begin_position()/alignment.sequence2_begin_position() to find the column and respectively row index where the alignment ends in the trace matrix.
    // you can then implement a traversal over the trace matrix to enumerate all paths between the end and begin point. 
}

Note that this soulution, however, is not implemented with efficiency in mind but thought as a debug feature to evaluate the correctness of the computed alignments.
Still it would give you an entry point to replicate the desired BioPython functionality.
You can find the corresponding API documentation of the trace_matrix in the developer API: http://docs.seqan.de/seqan3/3-master-dev/classseqan3_1_1detail_1_1two__dimensional__matrix.html.

If anything is unclear or does not work as expected, just contact us.

@ChristopherAdelmann
Copy link

Hi @rrahn and @eseiler,

Thanks for that suggestion. I am trying to extract all cooptimal alignments for local alignments from the debug trace matrix. Here, I try to utilize the seqan3::detail::trace_iterator to construct the trace paths from all max score indices obtained from the score matrix using customized methods:

auto tracePath(const seqan3::detail::matrix_coordinate &trace_begin,
                   const seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>
                       &complete_matrix) {
        using matrix_t = seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>;
        using matrix_iter_t = std::ranges::iterator_t<matrix_t const>;
        using trace_iterator_t = seqan3::detail::trace_iterator<matrix_iter_t>;
        using path_t = std::ranges::subrange<trace_iterator_t, std::default_sentinel_t>;

        seqan3::detail::matrix_offset const offset{trace_begin};

        return path_t{trace_iterator_t{complete_matrix.begin() + offset}, std::default_sentinel};
    };

template <typename sequence_pair_t, typename score_t, typename matrix_coordinate_t>
AlignmentResult makeResult(const sequence_pair_t &sequencePair, score_t score,
                               matrix_coordinate_t endPositions, auto const &alignmentMatrix) {
        const size_t elementsN = alignmentMatrix.rows() * alignmentMatrix.cols();
        std::vector<seqan3::detail::trace_directions> traceDirections;
        traceDirections.reserve(elementsN);

        for (const auto &direction : alignmentMatrix) {
            traceDirections.push_back(direction.value_or(seqan3::detail::trace_directions::none));
        }

        seqan3::detail::number_rows rowsN{alignmentMatrix.rows()};
        seqan3::detail::number_cols colsN{alignmentMatrix.cols()};

        seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions> traceMatrix(
            rowsN, colsN, traceDirections);

        using std::get;
        seqan3::detail::aligned_sequence_builder builder{get<0>(sequencePair),
                                                         get<1>(sequencePair)};
        auto trace_result = builder(CoOptimalPairwiseAligner::tracePath(endPositions, traceMatrix));

        return {score,
                std::make_pair(trace_result.first_sequence_slice_positions.first,
                               trace_result.second_sequence_slice_positions.first),
                std::make_pair(endPositions.row, endPositions.col)};
    };

This works sometimes, but at some point I got the error: Assertion failed: (current_direction == trace_directions::diagonal), function operator++, file trace_iterator_base.hpp, line 169..

After digging, I discovered that the internal trace matrix is modified before being returned as a debug matrix. This seems to cause the issue. After removing these lines, it worked fine:

if constexpr (traits_t::compute_sequence_alignment)
{
std::ranges::copy(
column
| std::views::transform(
[](auto const & tpl)
{
using std::get;
auto trace = get<1>(tpl).current;
if (auto _up = (trace & trace_directions::up_open); _up == trace_directions::carry_up_open)
trace = trace ^ trace_directions::carry_up_open; // remove silent up open signal
else if (_up == trace_directions::up_open)
trace = trace ^ trace_directions::up; // display up open only with single bit.
if (auto _left = (trace & trace_directions::left_open);
_left == trace_directions::carry_left_open)
trace = trace ^ trace_directions::carry_left_open; // remove silent left open signal
else if (_left == trace_directions::left_open)
trace = trace ^ trace_directions::left; // display left open only with single bit.
return trace;
}),
trace_debug_matrix.begin() + offset);

So my questions are twofold: is there any way to obtain the unmodified trace matrix without modifying your source code, and are there any plans for outputting the cooptimal alignments in the future?

Thanks again!

@rrahn
Copy link
Contributor

rrahn commented Jun 3, 2024

Hi @ChristopherAdelmann thanks for digging into this.
Is there any chance you can provide us minimum working example that fails?
Right now, it is hard for me to tell what exactly causes the bug you are experiencing.
I would need to investigate why this modification of the debug tracematrix causes your assertion.
To your first question, I am afraid, there is no simple way to disable this in the current version, except that I might be able to fix this when I can reproduce the bug.
To you second question, yes, we are going to have this in a future versio of SeqAn. Not sure when this will be ready, as it comes with some major changes of the library modularisation in the back.

@ChristopherAdelmann
Copy link

Hi @rrahn,

of course, here is some sample code that reproduces the issue.
I guess the scoring scheme is not strictly necessary, but it was easier for me to reproduce it this way.
Additionally, I was wondering whether it is necessary to return the trace matrix as a vector of optionals or if it is possible to return the non-optional version.

#include <seqan3/alphabet/nucleotide/dna5.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/matrix/detail/aligned_sequence_builder.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/matrix/detail/matrix_coordinate.hpp>
#include <seqan3/alignment/matrix/detail/trace_directions.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>

void testCooptimalAlignments() {
        using namespace seqan3::literals;
        auto source = "GTTCGCTAGTTTCGCGCGTAGTTTCTCGTTCG"_dna5;
        auto target = "CTTCTACTTTGTTCACTTTGATAGCAAACTTAGCCGCTTTA"_dna5;

        seqan3::nucleotide_scoring_scheme scheme{seqan3::match_score{1},
                                                 seqan3::mismatch_score{-1}};

        scheme.score('A'_dna5, 'T'_dna5) = 1;
        scheme.score('T'_dna5, 'A'_dna5) = 1;
        scheme.score('G'_dna5, 'C'_dna5) = 1;
        scheme.score('C'_dna5, 'G'_dna5) = 1;
        scheme.score('G'_dna5, 'T'_dna5) = 1;
        scheme.score('T'_dna5, 'G'_dna5) = 1;

        scheme.score('T'_dna5, 'T'_dna5) = -1;
        scheme.score('A'_dna5, 'A'_dna5) = -1;
        scheme.score('C'_dna5, 'C'_dna5) = -1;
        scheme.score('G'_dna5, 'G'_dna5) = -1;

        scheme.score('A'_dna5, 'G'_dna5) = -1;
        scheme.score('A'_dna5, 'C'_dna5) = -1;
        scheme.score('G'_dna5, 'A'_dna5) = -1;
        scheme.score('C'_dna5, 'A'_dna5) = -1;

        scheme.score('T'_dna5, 'C'_dna5) = -1;
        scheme.score('C'_dna5, 'T'_dna5) = -1;

        auto alignConfig =
            seqan3::align_cfg::method_local{} | seqan3::align_cfg::scoring_scheme{scheme} |
            seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{-2},
                                               seqan3::align_cfg::extension_score{-2}} |
            seqan3::align_cfg::output_score{} | seqan3::align_cfg::output_alignment{} |
            seqan3::align_cfg::detail::debug{};

        auto alignments = seqan3::align_pairwise(std::tie(source, target), alignConfig);

        for (auto &&alignment : alignments) {
            const int score = alignment.score();

            using TraceMatrix = seqan3::detail::two_dimensional_matrix<
                std::optional<seqan3::detail::trace_directions>>;
            using ScoreMatrix = seqan3::detail::row_wise_matrix<std::optional<int>>;
            const TraceMatrix traceMatrixOptional{alignment.trace_matrix()};
            const ScoreMatrix scoreMatrix{alignment.score_matrix()};

            seqan3::debug_stream << "Score: " << score << std::endl;
            seqan3::debug_stream << scoreMatrix << std::endl;

            auto it = std::ranges::find(scoreMatrix, score);
            while (it != scoreMatrix.end()) {
                size_t row = std::distance(scoreMatrix.begin(), it) / scoreMatrix.cols();
                size_t col = std::distance(scoreMatrix.begin(), it) % scoreMatrix.cols();

                seqan3::detail::matrix_coordinate traceBegin{
                    seqan3::detail::row_index_type{row}, seqan3::detail::column_index_type{col}};

                std::vector<seqan3::detail::trace_directions> traceDirections;

                std::transform(
                    traceMatrixOptional.begin(), traceMatrixOptional.end(),
                    std::back_inserter(traceDirections), [](const auto &direction) {
                        return direction.value_or(seqan3::detail::trace_directions::none);
                    });

                seqan3::detail::number_rows rowsN{traceMatrixOptional.rows()};
                seqan3::detail::number_cols colsN{traceMatrixOptional.cols()};

                seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>
                    traceMatrix(rowsN, colsN, traceDirections);

                seqan3::detail::aligned_sequence_builder builder{source, target};

                using matrix_t =
                    seqan3::detail::two_dimensional_matrix<seqan3::detail::trace_directions>;
                using matrix_iter_t = std::ranges::iterator_t<matrix_t const>;
                using trace_iterator_t = seqan3::detail::trace_iterator<matrix_iter_t>;
                using path_t = std::ranges::subrange<trace_iterator_t, std::default_sentinel_t>;

                seqan3::detail::matrix_offset const offset{traceBegin};

                auto cooptimalTracePath =
                    path_t{trace_iterator_t{traceMatrix.begin() + offset}, std::default_sentinel};

                seqan3::debug_stream << "Trace path: " << cooptimalTracePath << std::endl;

                it = std::ranges::find(it + 1, scoreMatrix.end(), score);
            }
        }
    };

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question a user question how to do certain things
Projects
None yet
Development

No branches or pull requests

4 participants