Skip to content

fix(core): continue backup workflow even after I/O errors & timeouts#6651

Open
romanz wants to merge 9 commits into
mainfrom
romanz/ignore-host
Open

fix(core): continue backup workflow even after I/O errors & timeouts#6651
romanz wants to merge 9 commits into
mainfrom
romanz/ignore-host

Conversation

@romanz
Copy link
Copy Markdown
Contributor

@romanz romanz commented Mar 25, 2026

Related to #3686 (comment).

No more button requests / errors will be sent later during the backup workflow after an I/O-related error/timeout.

Host will be ignored until the backup workflow is over - sending the final "Success" response may fail.

Other workflows can be aborted after user cancellation on _waiting_screen().

Timeout handling

In case the button request is not ACKed for 2s, the "trouble communicating" layout will allow aborting (for regular workflows) or continuing (for backup workflow):

image image image image

TODOs:

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 25, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR moves the unresponsive-warning timeout into the wire protocol handler, adds a module-level timeout and _waiting_screen(), converts ButtonRequestHandler.br_task() to an async coroutine and adjusts put()/join() signatures, updates ContinueOnErrors to retry on UnexpectedMessageException without failing, removes the UI-side delayed waiting coroutine and adapts Layout to await the handler directly, and adds/refactors tests plus updated UI fixtures and a changelog line.

sequenceDiagram
  participant Client
  participant Layout
  participant ButtonRequestHandler
  participant _waiting_screen
  Client->>Layout: initiates operation requiring button ack
  Layout->>ButtonRequestHandler: put(button request)
  Client->>ButtonRequestHandler: send response (or delayed)
  Layout->>ButtonRequestHandler: join()
  alt response before timeout
    ButtonRequestHandler->>Layout: is_done -> join() returns
  else timeout
    ButtonRequestHandler->>_waiting_screen: start timeout and show comm warning
    _waiting_screen->>ButtonRequestHandler: may raise cancellation -> abort flow
    ButtonRequestHandler->>Layout: join() returns (possibly with Failure)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

core

Suggested reviewers

  • mmilata
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: improving backup workflow resilience to I/O errors and timeouts, which aligns with the core modifications across multiple files.
Description check ✅ Passed The PR description provides comprehensive context including related issue, timeout handling explanation with visual examples, and implementation details, but lacks formalized sections for PR setup procedures mentioned in the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch romanz/ignore-host

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@trezor-bot trezor-bot Bot added this to Firmware Mar 25, 2026
@github-project-automation github-project-automation Bot moved this to 🔎 Needs review in Firmware Mar 25, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

en main(all)

model device_test click_test persistence_test
T2T1 test(all) main(all) test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all) test(all) main(all)
Translations

cs main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

de main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

es main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

fr main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

pt main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

Latest CI run: 26782004626

@romanz romanz force-pushed the romanz/ignore-host branch 3 times, most recently from 4d11c88 to b46944e Compare April 7, 2026 14:43
@romanz romanz moved this from 🔎 Needs review to 🏃‍♀️ In progress in Firmware Apr 23, 2026
@romanz romanz force-pushed the romanz/ignore-host branch from b46944e to efc2321 Compare April 23, 2026 19:43
@romanz romanz changed the title fix(core): continue backup workflow even after I/O errors fix(core): continue backup workflow even after I/O errors & timeouts Apr 23, 2026
@romanz romanz changed the base branch from main to romanz/2604/br-rework April 23, 2026 19:45
@romanz romanz added the translations Put this label on a PR to run tests in all languages label Apr 23, 2026
@romanz romanz force-pushed the romanz/ignore-host branch from efc2321 to d419aa0 Compare April 24, 2026 05:39
@romanz romanz force-pushed the romanz/2604/br-rework branch from c754fc8 to 374d32f Compare April 24, 2026 17:35
@romanz romanz force-pushed the romanz/ignore-host branch from d419aa0 to 7d316d2 Compare April 24, 2026 17:35
@romanz romanz force-pushed the romanz/2604/br-rework branch 2 times, most recently from 80f484b to 1a197d9 Compare April 28, 2026 13:25
Base automatically changed from romanz/2604/br-rework to main April 28, 2026 15:15
@romanz romanz force-pushed the romanz/ignore-host branch from 7d316d2 to 59dfbe0 Compare April 28, 2026 15:28
@romanz romanz force-pushed the romanz/ignore-host branch 2 times, most recently from fdb4567 to a000c84 Compare April 30, 2026 14:02
@romanz romanz marked this pull request as ready for review April 30, 2026 14:03
@romanz romanz requested a review from obrusvit as a code owner April 30, 2026 14:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/trezor/wire/protocol_common.py`:
- Around line 267-272: The br_task method currently returns None when
self.ignore is true, which breaks callers (Layout.get_result) that always
schedule the returned task; change br_task in protocol_common.py so that when
self.ignore is set it returns a schedulable no-op generator/task instead of None
(i.e., a generator that yields nothing or immediately completes) so callers like
Layout.get_result and code paths that call br_handler.br_task(...) (e.g.,
layout.show_backup_success) can store and schedule it without special-casing “no
task.”
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79e8abe1-55f8-4539-9230-1a1a6da2f595

📥 Commits

Reviewing files that changed from the base of the PR and between 14c6e43 and a000c84.

📒 Files selected for processing (4)
  • core/.changelog.d/6348.fixed
  • core/src/apps/management/reset_device/__init__.py
  • core/src/trezor/ui/__init__.py
  • core/src/trezor/wire/protocol_common.py

Comment thread core/src/trezor/wire/protocol_common.py Outdated
@romanz romanz changed the base branch from main to romanz/2604/backup-success April 30, 2026 15:20
@romanz romanz force-pushed the romanz/ignore-host branch from a000c84 to f71078d Compare April 30, 2026 16:08
@romanz romanz force-pushed the romanz/ignore-host branch from 36f90ad to 3458712 Compare May 2, 2026 12:19
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
core/src/trezor/wire/protocol_common.py (1)

285-317: 🏗️ Heavy lift

Please lock the ignore-host transition down with a focused regression test.

This path now depends on success surviving UnexpectedMessageException, generic handler failures, and task cancellation, with self.ignore then turning later put()/join()/br_task() calls into no-ops. That behavior is subtle and central to the backup-safety story, so it deserves a dedicated test instead of relying only on fixture updates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/trezor/wire/protocol_common.py` around lines 285 - 317, Add a
focused regression test for the ignore-host transition in ContinueOnErrors:
exercise the _handle flow to reproduce the three paths mentioned (Successful
loop iteration, an UnexpectedMessageException during the inner loop, and task
cancellation/GeneratorExit) and assert that self.ignore is set to False only
when success=True and True otherwise; also assert downstream methods that rely
on ignore (e.g., put(), join(), br_task() on the ButtonRequest handler) become
no-ops when ignore is True. Use the ContinueOnErrors class, trigger
UnexpectedMessageException and a generic Exception/cancellation, and validate
the final value of the instance attribute ignore and the observable behavior of
put()/join()/br_task() to lock this regression in a concise unit test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/src/trezor/wire/protocol_common.py`:
- Around line 285-317: Add a focused regression test for the ignore-host
transition in ContinueOnErrors: exercise the _handle flow to reproduce the three
paths mentioned (Successful loop iteration, an UnexpectedMessageException during
the inner loop, and task cancellation/GeneratorExit) and assert that self.ignore
is set to False only when success=True and True otherwise; also assert
downstream methods that rely on ignore (e.g., put(), join(), br_task() on the
ButtonRequest handler) become no-ops when ignore is True. Use the
ContinueOnErrors class, trigger UnexpectedMessageException and a generic
Exception/cancellation, and validate the final value of the instance attribute
ignore and the observable behavior of put()/join()/br_task() to lock this
regression in a concise unit test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 274b60e8-fb77-4754-bd8d-479d026fb110

📥 Commits

Reviewing files that changed from the base of the PR and between 36f90ad and 3458712.

📒 Files selected for processing (1)
  • core/src/trezor/wire/protocol_common.py

Copy link
Copy Markdown

@lapohoda lapohoda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'd omit the use of icon-only header unless we need to.
  2. There's an update proposal from Chris regarding copy:
    "Communication with your connected device failed. Cancel and try again."
    Button: Cancel.

Design file updated: https://www.figma.com/design/XzPdVcLsDdvJrMfzyLho5Y/GitHub--6348?node-id=8028-49&t=Gqe4fyfXxmgBVyEu-11

@github-project-automation github-project-automation Bot moved this from 🔎 Needs review to 🏃‍♀️ In progress in Firmware May 4, 2026
@romanz
Copy link
Copy Markdown
Contributor Author

romanz commented May 22, 2026

Rebasing to update UI fixtures.

@romanz romanz force-pushed the romanz/ignore-host branch 2 times, most recently from a503508 to bc6da12 Compare May 23, 2026 19:48
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/device_tests/test_msg_delayed_ack.py (1)

42-50: ⚡ Quick win

Add clarifying comment for the second press_yes() call.

The second press_yes() on line 46 is not immediately obvious—it's pressing the abort button on the "waiting" screen that appears after the 2.5-second timeout, not the initial button request confirmation (which was already done in ping_without_ack). A brief comment would help readers understand the flow.

📝 Suggested comment
 `@pytest.mark.models`("core")
 def test_abort(session: Session):
     ping_without_ack(session)
 
+    # Abort the "waiting" screen that appears after the timeout
     session.debug.press_yes()
     assert session.read() == messages.Failure(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/device_tests/test_msg_delayed_ack.py` around lines 42 - 50, Add a brief
clarifying comment above the second session.debug.press_yes() in test_abort to
explain that this press is dismissing the "waiting" screen shown after the 2.5s
timeout (not the initial confirmation already handled by ping_without_ack);
reference the test name test_abort, the helper ping_without_ack, and the press
call session.debug.press_yes() so reviewers understand the flow before the
subsequent assertions against messages.Failure and session.client.ping.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/device_tests/test_msg_delayed_ack.py`:
- Around line 42-50: Add a brief clarifying comment above the second
session.debug.press_yes() in test_abort to explain that this press is dismissing
the "waiting" screen shown after the 2.5s timeout (not the initial confirmation
already handled by ping_without_ack); reference the test name test_abort, the
helper ping_without_ack, and the press call session.debug.press_yes() so
reviewers understand the flow before the subsequent assertions against
messages.Failure and session.client.ping.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3790924d-6a4e-40cb-8ed8-c0fd37a8f325

📥 Commits

Reviewing files that changed from the base of the PR and between a503508 and bc6da12.

📒 Files selected for processing (2)
  • tests/device_tests/test_msg_delayed_ack.py
  • tests/ui_tests/fixtures.json

@romanz
Copy link
Copy Markdown
Contributor Author

romanz commented May 24, 2026

  1. I'd omit the use of icon-only header unless we need to.

Sounds good, will update.

  1. There's an update proposal from Chris regarding copy:
    "Communication with your connected device failed. Cancel and try again."
    Button: Cancel.

Design file updated: https://www.figma.com/design/XzPdVcLsDdvJrMfzyLho5Y/GitHub--6348?node-id=8028-49&t=Gqe4fyfXxmgBVyEu-11

Will it be please possible to update the copy to allow also continuation (in case there is no response from the host during backup flow)?

The current implementation changes the "verb" from "Abort" to "Continue" (in case continuation is possible):

verb = TR.buttons__abort if raise_on_cancel is not None else TR.buttons__continue

EDIT: let's use "Communication with your connected device failed. It’s safe to continue."  + Continue button in this case.

@obrusvit obrusvit added the blocked Blocked by external force. Third party inputs required. label May 25, 2026
@obrusvit
Copy link
Copy Markdown
Contributor

Let's postpone this for 26.07.

romanz added 8 commits June 1, 2026 21:50
Use `None` as a special marker for explicitly hiding the header.

[no changelog]
Implemented for Bolt and Eckhart.

It was done for Delizia in a6090dc.

[no changelog]
No more button requests / errors will be sent later during
the backup workflow after an I/O-related error/timeout.

Host will be ignored until the backup workflow is over -
sending the final "Success" response may fail.

Other workflows can be aborted after user cancellation via `_waiting_screen()`.
@romanz romanz force-pushed the romanz/ignore-host branch from 14dc12b to ad3a86c Compare June 1, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked Blocked by external force. Third party inputs required. translations Put this label on a PR to run tests in all languages

Projects

Status: 🏃‍♀️ In progress

Development

Successfully merging this pull request may close these issues.

3 participants