Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHIP Persistent data storage writes are not power-loss safe #977

Open
kepstin opened this issue Nov 27, 2024 · 12 comments
Open

CHIP Persistent data storage writes are not power-loss safe #977

kepstin opened this issue Nov 27, 2024 · 12 comments

Comments

@kepstin
Copy link

kepstin commented Nov 27, 2024

In this discord thread: https://discord.com/channels/330944238910963714/1311124568081174628 a user reported an issue where following a power loss, all of their matter devices disappeared. From the Matter server logs:

2024-11-26 20:12:41.236 (MainThread) ERROR [chip.storage] Expecting ',' delimiter: line 29 column 111 (char 4111)
2024-11-26 20:12:41.236 (MainThread) CRITICAL [chip.storage] Could not load configuration from /data/chip.json - resetting configuration...

It looks like the configuration file was present, but had invalid data in it. The user was able to restore a backup correct the issue, but when I looked at the code to persist the storage: (edit: this code doesn't seem to be responsible for the user's issue, which happened in the Matter SDK code - but this code is incorrect.)

async def async_save(self) -> None:
"""Save persistent data to disk."""
assert self.server.loop is not None
def do_save() -> None:
# make backup before we write a new file
if self.filename.is_file():
self.filename.replace(self.filename_backup)
with open(self.filename, "w", encoding="utf-8") as _file:
_file.write(json_dumps(self._data))
LOGGER.debug("Saved data to persistent storage")
async with self._save_lock:
await self.server.loop.run_in_executor(None, do_save)

I noticed that the logic here is not power loss safe, and might result in a partially written or missing file for unlucky users.

The current logic does:

  • Rename chip.json to chip.json.backup. Following this, if power is lost, there may be no chip.json file present at all.
  • Write a new chip.json file. If the write is interrupted, or power is lost before the system flushes the write cache, a partially written or corrupt configuration file may be present on reboot, resulting in errors like this user saw.

I'd recommend the following logic instead:

  • (optional) Copy chip.json to chip.json.backup, leaving the existing file in place. (Ideally, run fsync on the newly written file to ensure the backup is updated before the original file gets replaced.)
  • Write the new data to a temporary file, like chip.json.new.
  • Run fsync (or fdatasync) on the newly written file to ensure the file contents are persisted on the disk. (see the notes on https://docs.python.org/3/library/os.html#os.fsync regarding flushing the python buffers). This is needed to ensure ordering - making sure that the file contents will be written before the file gets renamed.
  • Rename the temporary file to the final filename, atomically replacing the existing file.
  • (optional) Run fsync on the renamed file to ensure the rename is committed to storage. (This is only needed if data stored elsewhere - e.g. in a sqlite database - might depend on this data having been persisted)
@kepstin kepstin changed the title Persistent data storage writes are not power-loss safe CHIP Persistent data storage writes are not power-loss safe Nov 27, 2024
@kepstin
Copy link
Author

kepstin commented Nov 27, 2024

… Actually I think I'm looking at the wrong code here - the error messages don't match up. Is the code in matter-server/server/storage.py used? is it for something else?

@kepstin
Copy link
Author

kepstin commented Nov 27, 2024

Looking closer, the actual code responsible for this might be in the Matter SDK itself, here:

https://github.com/project-chip/connectedhomeip/blob/e98f2b239aae657a10043bad9aa61f4717d059be/src/controller/python/chip/storage/__init__.py#L164-L183

Unfortunately, this code has the same issue, where a power loss during write (prior to the fsync) can result in an incompletely written file which will be corrupt on restart. It's arguably worse in a different way, since it doesn't remove/truncate the file before doing a write, so on a power loss you can get a mix of the old and new files present.

I've opened project-chip/connectedhomeip#36654 regarding the Matter SDK code.

@marcelveldt
Copy link
Contributor

The only way to get that resolved is create a PR on the matter SDK repo. Nobody is going to pick up that issue report as the python bindings are never meant to be used the way we do. They're only created for the testing framework. So if you wnat this fixed, you'll need to create a PR there or wait for one of us to have some spare time ;-)

Even though this could be improved I still think you just need to make backups of your server.

@nerijus
Copy link

nerijus commented Dec 4, 2024

Yes, backups are always needed, but if it can be easily fixed/improved, why not?

@kepstin
Copy link
Author

kepstin commented Dec 16, 2024

Files getting corrupted on hard power off is entirely preventable in software (assuming you're not running on broken hardware). We should be expecting Home Assistant will be able to automatically recover and keep working when booted up after a hard power off. The current situation, where it can hit a corrupt data file and drop all your Matter devices with nothing but a log message is unacceptable IMO.

Fixing this will prevent an entire class of "oh no, there was a power outage last night and none of my automations on matter devices worked this morning - what are you saying i need to restore a backup!?" issues.

Unfortunately while I have the ability to write and contribute the fix to the Matter SDK side, I'm not willing to sign their CLA at the moment.

@marcelveldt
Copy link
Contributor

I'm not saying we should not fix this (we're looking at it) but that besides making the software more forgiving you should always have backups as well. Sorry if that didn't come across. English is not my main language so my Dutch-style English is not always understood haha.

We are a super limited team (basically its just me and Stefan helps me out now and then) and low on contributions so we don't have lots of time to spend coding, its a constant balance of what to pickup and our time has very much been consumed by all these issues with Thread and bugs with TREL.

I'll do my best to create at least a temporary patch for this (as we now have seen a couple of reports already about this) before/on monday. Thanks!

@aidanharris
Copy link

I think I just hit this issue myself. Does the Docker container contain this fix?

I'm currently running (f07a9b2ff8bafeaf1be20ded7f4c2ff3452fe7b87d9c22aa72beaf980978a07b):

ghcr.io/home-assistant-libs/python-matter-server     stable             f07a9b2ff8ba   2 days ago      448MB

@aidanharris
Copy link

docker image inspect f07a9b2ff8bafeaf1be20ded7f4c2ff3452fe7b87d9c22aa72beaf980978a07b

        "RepoDigests": [
            "ghcr.io/home-assistant-libs/python-matter-server@sha256:1e371d6936c179fec67896180a8697448c2ccec628078f709bb4b271b040701b"
        ],

It seems up-to-date:
https://github.com/home-assistant-libs/python-matter-server/actions/runs/12470272943/job/34805323062

I'm not a fan of how difficult Github makes it to find the built images with their hash, etc. Should I be looking somewhere else?

All my Matter devices stopped responding for some reason. I have it working again now though. I still have the backed up data folder if it's useful. I'll open another bug if it happens again.

@kepstin
Copy link
Author

kepstin commented Dec 26, 2024

I think I just hit this issue myself. Does the Docker container contain this fix?

Unfortunately, the fix that has been committed isn't complete - it improves the situation, but there's an important data file written by the Matter SDK itself which also has this issue: project-chip/connectedhomeip#36654 (in logs I've seen for other users hitting this problem, the Matter SDK file is the one that actually got corrupted).

@aidanharris

This comment was marked as resolved.

@marcelveldt
Copy link
Contributor

@aidanharris please do not pollute this issue report with an unrelated report. The issue you mention has been fixed this morning and a new release will be issued soon.

@aidanharris
Copy link

@aidanharris please do not pollute this issue report with an unrelated report. The issue you mention has been fixed this morning and a new release will be issued soon.

Sorry, about that. I did mark it as resolved (I should have checked first) but that's too late for people's email notifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants