Hi,

I think it's probably the EOS symbol.

The discrepancy in array lengths stems from how the two decoders treat the end‐of‐sequence (EOS) token’s probability. In GreedyDecoder, the EOS probability is deliberately excluded, so the token_log_probabilities array matches exactly the number of predicted tokens. In contrast, the BeamSearchDecoder includes the EOS probability as an extra entry, making its token_log_probabilities one element longer than the number of token predictions.

⸻

GreedyDecoder: Excluding EOS

In the GreedyDecoder, the implementation explicitly trims the log‐probabilities to the length of the generated sequence, omitting the EOS entry:

token_log_probabilities = [
    x.cpu().item() 
    for x in all_log_probabilities[i, : len(sequence)]
][::-1]  # list[float] of length == sequence_length, EOS excluded

greedy_search.py:412

• The slice : len(sequence) ensures that only the log‐probabilities for the actual amino‐acid tokens are kept.
• As a result, if the model predicts a sequence of N tokens (excluding EOS), then token_log_probabilities has exactly N entries.
⸻

BeamSearchDecoder: Including EOS

By contrast, the BeamSearchDecoder explicitly appends the EOS token’s log‐probability when constructing its probability arrays:

completed_token_log_probabilities = torch.column_stack((
    local_token_log_probabilities[beam_index],       # probabilities for each token
    last_token_log_probabilities[beam_index, residues],
    eos_log_probabilities[:, self.model.get_eos_index()],  # EOS probability appended
))

beam_search.py:299-304

• Here, eos_log_probabilities[:, self.model.get_eos_index()] is stacked alongside the token scores.
• Therefore, for a prediction of N tokens, the resulting tensor has N + 1 entries—the extra one being the EOS probability.

Strange instanovo results #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions