Problems with ood-daemon restore backup data #269

lurenpluto · 2023-05-15T08:37:26Z

According to the feedback of some related products, OOD has been found to be missing data during gateway usage use after restoring data according to the following process:

Initialize a new ood environment with ood-installer
In this case, the ood is in an unbound state and only the ood-daemon process is available
Call the remote-restore interface of ood-daemon to restore the data from other ood backups
This process may take longer, depending on the size of the backup data, etc.
The above process includes the binding process because the {cyfs}/etc/desc directory is backed up.
After the recovery is completed, use gateway directly
It's found that in some cases the data is missing, especially when accessing the root-state to get the dec app's state, the state is completely different from the state of the backed up source ood

The text was updated successfully, but these errors were encountered:

lurenpluto · 2023-05-15T09:12:12Z

After analyzing the actual recovery data and the local data, as well as the reviedw of the whole process of remote-store, it was found that the problem should be caused by the restore step of uni restore, as well as the combination of the existing activation detection of ood-daemon

Restore process

The current uni restore recovery steps are defined as follows:

CYFS/src/component/cyfs-backup-lib/src/backup/restore_status.rs

Lines 8 to 16 in 89110cc

    
           #[derive(Clone, Copy, Debug, Serialize, Deserialize)] 
        
           pub enum RestoreTaskPhase { 
        
               Init, 
        
               LoadAndVerify, 
        
               RestoreKeyData, 
        
               RestoreObject, 
        
               RestoreChunk, 
        
               Complete, 
        
           }

It can be seen that the restore step restores the key-data first, then the objects and chunks, and the {cyfs}/etc/desc directory in the key-data of the device.desc/device.sec identity file, so it will lead to the whole restoration process in which the files in etc/desc are recovered first

Binding detection mechanism of ood-daemon

The binding detection logic of ood-daemon is that if the identity file (device.desc/device.sec) in the {cyfs}/etc/desc directory is found to be missing, a monitor process is started to periodically check whether the identity file is created, then it will pull up other services including gateway

So the problem comes, after the gateway is started, because the remote store process is still continuing, so the local objects and chunks are still incomplete at this time, the gateway loads the root-state, in the case of loading the root, there will be
object meta exists but the corresponding object blob is missing, This leads to the belief that the object does not exist and get_object will return None:

CYFS/src/component/cyfs-lib/src/prelude/named_object_cache.rs

Lines 142 to 148 in 89110cc

    
           #[derive(Clone)] 
        
           pub struct NamedObjectCacheObjectRawData { 
        
               // object maybe missing while meta info is still here 
        
               pub object: Option<NONObjectInfo>, 
        
               pub meta: NamedObjectMetaData, 
        
           }

CYFS/src/component/cyfs-lib/src/prelude/named_object_cache.rs

Lines 261 to 281 in 89110cc

    
           async fn get_object( 
        
               &self, 
        
               req: &NamedObjectCacheGetObjectRequest, 
        
           ) -> BuckyResult<Option<NamedObjectCacheObjectData>> { 
        
               match self.get_object_raw(req).await? { 
        
                   Some(ret) => match ret.object { 
        
                       Some(object) => Ok(Some(NamedObjectCacheObjectData { 
        
                           object, 
        
                           meta: ret.meta, 
        
                       })), 
        
                       None => { 
        
                           warn!( 
        
                               "get object meta from noc but object missing! {}", 
        
                               req.object_id 
        
                           ); 
        
                           Ok(None) 
        
                       } 
        
                   }, 
        
                   None => Ok(None), 
        
               } 
        
           }

So gateway will initialize a new global-state, the content is also completely empty, resulting in the above problem

lurenpluto · 2023-05-15T11:13:54Z

This problem is considered to be fixed from the following two sides:

1. Adjust the steps of uni restore

Give priority to restore objects and chunks, and restore key-data last, make sure {cyfs}/etc/desc is released after objects and chunks are restored to avoid the above problem

Also during the key-data recovery process, we need to improve the order of the files recovered, make sure the etc directory is recovered last, and the db files should be recovered first to ensure that ood-daemon has recovered all the data after detecting the identity files in etc/desc dir

2. Improves the binding detection logic of ood-daemon service

Since the remote-store is carried out inside ood-dameon, ood-daemon can stop detecting bindings when there is a remote store task, and wait until the restore task is completed, then continue with the binding monitor logic.

However, this logic improvement is limited and can only play a supplementary role. If an external independent process is used for restorer-store operation, then ood-daemon has no way to know the progress of restore.

…d-daemon-restore-backup-data' into main

lurenpluto · 2023-05-16T06:41:47Z

This problem is considered to be fixed from the following two sides:

1. Adjust the steps of uni restore

Give priority to restore objects and chunks, and restore key-data last, make sure {cyfs}/etc/desc is released after objects and chunks are restored to avoid the above problem

Also during the key-data recovery process, we need to improve the order of the files recovered, make sure the etc directory is recovered last, and the db files should be recovered first to ensure that ood-daemon has recovered all the data after detecting the identity files in etc/desc dir

2. Improves the binding detection logic of ood-daemon service

Since the remote-store is carried out inside ood-dameon, ood-daemon can stop detecting bindings when there is a remote store task, and wait until the restore task is completed, then continue with the binding monitor logic.

However, this logic improvement is limited and can only play a supplementary role. If an external independent process is used for restorer-store operation, then ood-daemon has no way to know the progress of restore.

The First option above has been adopted to make improvements, including the following two major changes. 1.

In the uni restore logic, the restore order is adjusted to objects->chunks->key-data
In the key-data restore logic, the etc/desc directory is restored last.

The above two points ensure that even with the existing bind detection logic of ood-daemon, it is possible to restore and bind ood at the same time by restore operation

See at 3cd354a

lurenpluto added bug Something isn't working Backup & Restore The OOD backup and restore related OOD-daemon The OOD-daemon basic service labels May 15, 2023

lurenpluto added this to CYFS-Stack & Services May 15, 2023

lurenpluto moved this to 🐞Discovered Bugs in CYFS-Stack & Services May 15, 2023

lurenpluto added this to the Group-supported Release milestone May 15, 2023

lurenpluto moved this from 🐞Discovered Bugs to 📝Todo in CYFS-Stack & Services May 15, 2023

lurenpluto moved this from 📝Todo to 🚧In Progress in CYFS-Stack & Services May 15, 2023

lurenpluto added a commit that referenced this issue May 15, 2023

Issue #269: Merge remote-tracking branch 'origin/269-problems-with-oo…

3cd354a

…d-daemon-restore-backup-data' into main

lurenpluto moved this from 🚧In Progress to 🧪To Test in CYFS-Stack & Services May 16, 2023

lurenpluto self-assigned this May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with ood-daemon restore backup data #269

Problems with ood-daemon restore backup data #269

lurenpluto commented May 15, 2023

lurenpluto commented May 15, 2023

lurenpluto commented May 15, 2023

lurenpluto commented May 16, 2023

1. Adjust the steps of uni restore

2. Improves the binding detection logic of ood-daemon service

Problems with ood-daemon restore backup data #269

Problems with ood-daemon restore backup data #269

Comments

lurenpluto commented May 15, 2023

lurenpluto commented May 15, 2023

Restore process

Binding detection mechanism of ood-daemon

lurenpluto commented May 15, 2023

1. Adjust the steps of uni restore

2. Improves the binding detection logic of ood-daemon service

lurenpluto commented May 16, 2023

1. Adjust the steps of uni restore

2. Improves the binding detection logic of ood-daemon service