-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CHIP Persistent data storage writes are not power-loss safe #977
Comments
… Actually I think I'm looking at the wrong code here - the error messages don't match up. Is the code in |
Looking closer, the actual code responsible for this might be in the Matter SDK itself, here: Unfortunately, this code has the same issue, where a power loss during write (prior to the fsync) can result in an incompletely written file which will be corrupt on restart. It's arguably worse in a different way, since it doesn't remove/truncate the file before doing a write, so on a power loss you can get a mix of the old and new files present. I've opened project-chip/connectedhomeip#36654 regarding the Matter SDK code. |
The only way to get that resolved is create a PR on the matter SDK repo. Nobody is going to pick up that issue report as the python bindings are never meant to be used the way we do. They're only created for the testing framework. So if you wnat this fixed, you'll need to create a PR there or wait for one of us to have some spare time ;-) Even though this could be improved I still think you just need to make backups of your server. |
Yes, backups are always needed, but if it can be easily fixed/improved, why not? |
Files getting corrupted on hard power off is entirely preventable in software (assuming you're not running on broken hardware). We should be expecting Home Assistant will be able to automatically recover and keep working when booted up after a hard power off. The current situation, where it can hit a corrupt data file and drop all your Matter devices with nothing but a log message is unacceptable IMO. Fixing this will prevent an entire class of "oh no, there was a power outage last night and none of my automations on matter devices worked this morning - what are you saying i need to restore a backup!?" issues. Unfortunately while I have the ability to write and contribute the fix to the Matter SDK side, I'm not willing to sign their CLA at the moment. |
I'm not saying we should not fix this (we're looking at it) but that besides making the software more forgiving you should always have backups as well. Sorry if that didn't come across. English is not my main language so my Dutch-style English is not always understood haha. We are a super limited team (basically its just me and Stefan helps me out now and then) and low on contributions so we don't have lots of time to spend coding, its a constant balance of what to pickup and our time has very much been consumed by all these issues with Thread and bugs with TREL. I'll do my best to create at least a temporary patch for this (as we now have seen a couple of reports already about this) before/on monday. Thanks! |
I think I just hit this issue myself. Does the Docker container contain this fix? I'm currently running (
|
It seems up-to-date: I'm not a fan of how difficult Github makes it to find the built images with their hash, etc. Should I be looking somewhere else? All my Matter devices stopped responding for some reason. I have it working again now though. I still have the backed up data folder if it's useful. I'll open another bug if it happens again. |
Unfortunately, the fix that has been committed isn't complete - it improves the situation, but there's an important data file written by the Matter SDK itself which also has this issue: project-chip/connectedhomeip#36654 (in logs I've seen for other users hitting this problem, the Matter SDK file is the one that actually got corrupted). |
This comment was marked as resolved.
This comment was marked as resolved.
@aidanharris please do not pollute this issue report with an unrelated report. The issue you mention has been fixed this morning and a new release will be issued soon. |
Sorry, about that. I did mark it as resolved (I should have checked first) but that's too late for people's email notifications. |
In this discord thread: https://discord.com/channels/330944238910963714/1311124568081174628 a user reported an issue where following a power loss, all of their matter devices disappeared. From the Matter server logs:
It looks like the configuration file was present, but had invalid data in it. The user was able to restore a backup correct the issue, but when I looked at the code to persist the storage: (edit: this code doesn't seem to be responsible for the user's issue, which happened in the Matter SDK code - but this code is incorrect.)
python-matter-server/matter_server/server/storage.py
Lines 149 to 163 in a55bb8b
I noticed that the logic here is not power loss safe, and might result in a partially written or missing file for unlucky users.
The current logic does:
chip.json
tochip.json.backup
. Following this, if power is lost, there may be nochip.json
file present at all.chip.json
file. If the write is interrupted, or power is lost before the system flushes the write cache, a partially written or corrupt configuration file may be present on reboot, resulting in errors like this user saw.I'd recommend the following logic instead:
chip.json
tochip.json.backup
, leaving the existing file in place. (Ideally, run fsync on the newly written file to ensure the backup is updated before the original file gets replaced.)chip.json.new
.The text was updated successfully, but these errors were encountered: