Support degenerate / gap characters #12

fedarko · 2023-10-02T23:44:51Z

Currently, the presence of Ns in a sequence will make matrix construction fail with the following error: Input sequence contains character N; only DNA nucleotides (A, C, G, T) are currently allowed.

This is a very "safe" way of handling this situation, but it's a bit over-cautious. It would be better to just modify things so that these characters are allowed, but any k-mers containing them are just assumed to not have any matches anywhere.

Some workaround options, in the meantime:

Remove these characters from your sequence before creating a dot plot (if you keep track of where the "breaks" are, you can then label these on the dot plot to explain the situation)
- The downside of this, ofc, is that this will create "spurious" k-mers that span the "break".
Split up your sequence into "islands" of non-degenerate/gap characters, and just analyze these independently. I guess you could also concatenate the resulting dot plot matrices together, too, although that would require some extra programming work.
Replace these characters with random (?) DNA nucleotides (as is done, for example, in section 2.7.1 of the BWA paper).

The text was updated successfully, but these errors were encountered:

fedarko added the enhancement New feature or request label Oct 2, 2023

fedarko mentioned this issue Oct 3, 2023

Support protein sequences #13

Open

fedarko mentioned this issue Feb 10, 2025

Support RNA sequences #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support degenerate / gap characters #12

Support degenerate / gap characters #12

fedarko commented Oct 2, 2023 •

edited

Loading

Support degenerate / gap characters #12

Support degenerate / gap characters #12

Comments

fedarko commented Oct 2, 2023 • edited Loading

fedarko commented Oct 2, 2023 •

edited

Loading