Skip to content

Architecture: backup

Edsko de Vries edited this page Aug 22, 2013 · 3 revisions

May be out of date

hackage-server has the ability to perform periodic backups (snapshots) of the server state at a given time. Data is exported and imported per feature, so a file "spammers.csv" from a spam feature would end up at "export/spam/spammers.csv" if the backup tarball were unzipped. Backups are distinct from happstack-state's persistent storage: the data in them is only representative of the server state at the time it was created, and instead of serializing data in binary, it is stored in simple text formats, like CSV, which humans can view and edit. This ensures data can be recovered easily, a harder task with binary data.

Backup functions are used in all server modes. backup mode creates a backup tarball from the enabled features and then shuts down. restore takes a backup tarball and reads it entry by entry, writing the parsed data structures to persistent storage. new mode behaves like restore mode, except it finalizes the restore process immediately without loading any tar entries. convert mode creates an export tarball in an ad-hoc manner from legacy files. run mode can also do backup while the server is running.

Import and export

The type synonym Hackage uses for backup is BackupEntry, which is short for ([FilePath], ByteString). The first part of the pair is the path split on slashes. So the core feature exporting a file at "package/reify-0.1.1/reify.cabal" should provide ["package", "reify-0.1.1", "reify.cabal"], along with the bytestring for the cabal file, to store it at "export/core/package/reify-0.1.1/reify.cabal" in the backup tarball. Feature names act as a primitive namespace for backup files, so a feature has full control over naming its own backup files, so long as they are valid filenames and are less than 155 characters (a limitation of the tar format). On import, this same file would be received as ["package", "reify-0.1.1", "reify.cabal"] along with the ByteString contents.

It's recommended that features uses as much files as they need, but try to avoid clutter if possible. If the feature is a single map from PackageId to Maybe String, one CSV file would suffice, with lines like:

base-4.2.0.0,nothing
containers-0.3.0.0,just,specialstring

Export

The conventions for export are pretty simple. If a feature doesn't need to export anything, it can put Nothing in its dumpBackup field (type Maybe (BlobStorage -> IO [BackupEntry])). Features with data redundant to other features' data should not store backup entries, even if they have persistent state, so long as this data can be reconstructed on restore (in the last stage of import--see below). This makes it easier to adjust the backup tarball, if necessary.

Otherwise, there are two approaches: if the blob storage doesn't need to be used, then query state all at once, create a list of BackupEntry, and return them. For example:

dumpBackup = Just $ \_ -> do
    posts <- fmap (Map.toList . blogPosts) $ query BlogState   
    let authorBackup :: [[String]]
        authorBackup = map (\(pid, post) -> [show pid, show $ postAuthor post]) posts
        postBackup :: [BackupEntry]
        postBackup = map (\(pid, post) -> (["post", show pid], BS.pack $ postText post)) posts
    return $ [csvToBackup ["authors.csv"] authorBackup] ++ postBackup

csvToBackup :: [String] -> CSV -> BackupEntry is a utility function provided by Distribution.Server.Backup.Export to make a backup entry from a CSV type, where type CSV = [[String]], from the csv package. It automatically escapes the fields to be a proper unambiguous CSV file.

If storage needs to be used, try creating a list of ExportEntry instead of BackupEntry. It's a very similar type, type ExportEntry = ([FilePath], Either ByteString BlobId). What's special about ExportEntry is that it postpones the operation of loading the blob into memory until the ByteString itself is needed. It does this using unsafeInterleaveIO magic in readExportBlobs. You might find these functions useful, from Distribution.Server.Backup.Export:

csvToExport :: [String] -> CSV -> ExportEntry
blobToExport :: [String] -> BlobId -> ExportEntry
readExportBlobs :: BlobStorage -> [ExportEntry] -> IO [BackupEntry]

An example of its usage:

dumpBackup = Just $ \store -> do
    doc <- query GetDocumentation
    let exportFunc (pkgid, (blob, _)) = blobToExport [display pkgid, "documentation.tar"] blob
    readExportBlobs storage . map exportFunc . Map.toList $ documentation doc

Note that exportFunc throws away some data. This is fine, because it can be reconstructed on import from the blob itself. Whatever your export requirements, try to keep as much of the backup-file-creating code as pure as possible, so happstack-state isn't required to format complicated entries.

Import

Import is a bit more complicated. there's no guaranteed order for import entries, so hackage-server takes a 3-stage approach.

  1. Import entry-by-entry, routing each BackupEntry to the proper feature
  2. Organize any partial results from the first part.
  3. Store the results. a. Write any data to happstack-state. b. Read happstack-state data for features this feature depends on, if necessary.

Stages 1 and 2 are allowed to fail. State 1 can complain about improperly formatted entries, and State 2 can fail if the result is inconsistent or there are missing bits. It should not fail if no entries were imported. Stage 3, which modifies the persistent state, should not fail and potentially leave the server data in an inconsistent state.

This is accomplished with a RestoreBackup type which emulates OOP, in that all of the feature-specific data is hidden, but all RestoreBackup objects must implement certain functions.

data RestoreBackup = RestoreBackup {
    restoreEntry :: BackupEntry -> IO (Either String RestoreBackup),
    restoreFinalize :: IO (Either String RestoreBackup),
    restoreComplete :: IO ()
}

For examples of using this object, see the PackagesBackup and UserBackup modules. If the feature stores a data component FooBar, then backup would look something like:

fooBarBackup :: RestoreBackup
fooBarBackup = doRestoreFooBar emptyFooBar

doRestoreFooBar :: FooBar -> RestoreBackup
doRestoreFooBar foo = fix $ \r -> RestoreBackup
  { restoreEntry = \entry -> do
        res <- importFooBar foo entry
        case res of
            Left str   -> return $ Left str
            Right foo' -> return $ Right (doRestoreFooBar foo')
  , restoreFinalize = return (Right r) -- no special finalization 
  , restoreComplete = foo
  }

importFooBackup :: FooBar -> BackupEntry -> IO (Either String FooBar)
importFooBackup foo (["foo.csv"], bs) = ...
...
importFooBackup foo _ = return $ Right foo  --ignore unknown entries

RestoreBackup is an instance of Monoid, so different RestoreBackups for different data structures can be combined into one.

The Import monad is helpful for incrementally modifying the state of a feature and possibly indicating failure. fail str produces a Left str. It's also an instance of MonadIO and MonadState. Use these functions with it:

-- Run the Import monad.
runImport :: s -> Import s a -> IO (Either String s)

-- Try to read a string (second argument). In the case of failure, the label
-- (first argument) is used to indicate what sort of result was expected
-- (e.g. "user id", "package name").
parseRead :: Read a => String -> String -> Import s a

-- Try to parse a string using its Distribution.Text instance. The first argument is the label.
parseText :: Text a => String -> String -> Import s a

-- Read a time in the standard export format (in Distribution.Server.Backup.Utils)
parseTime :: String -> Import s UTCTime

-- A combinator to read a CSV file. The first argument is the file name; the
-- second is the CSV file itself, and the third is a function to be called if
-- the file was parsed correctly.
importCSV :: String -> ByteString -> (CSV -> Import s a) -> Import s a

As for re-importing blobs, you can use the add :: BlobStorage -> ByteString -> IO BlobId in Distribution.Server.Util.BlobStorage.

This structure of the import process makes it possible, in theory, to import features individually rather than in bulk. The main issue is making sure states are consistent: a feature can't have a user id which doesn't exist in the user database.

One last twist on the import scheme is that it can be used to store data redundant with other features' data, so long as that data is retrieved in restoreComplete and can't fail. The order of completion is the same order the features are initialized in, so if feature A depends on feature B's data, feature B's import will complete before feature A's. This is used for reverse dependencies:

restoreBackup = Just $ \_ -> fix $ \r -> RestoreBackup
  { restoreEntry    = \_ -> return $ Right r
  , restoreFinalize = return $ Right r
  , restoreComplete = do
        putStrLn "Calculating reverse dependencies"
        index <- fmap packageList $ query GetPackagesState
        let revs = constructReverseIndex index
        update $ ReplaceReverseIndex revs
  }