Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary cache for journal files #2325

Open
adept opened this issue Feb 13, 2025 · 2 comments
Open

Binary cache for journal files #2325

adept opened this issue Feb 13, 2025 · 2 comments
Labels
A-WISH Some kind of improvement request or proposal. performance Anything performance-related (run time, memory usage, disk space..)

Comments

@adept
Copy link
Collaborator

adept commented Feb 13, 2025

This is a summary of discussion that happened on Matrix.

It would be nice to write down .file.journal.bin whenever file.journal is read, and then use .file.journal.bin as a fast binary cache for as long as file.journal does not change. This would avoid the considerable cost of re-parsing the journal

A small speedbump here is that some datatypes used by hledger do not derive General and so can't be readily used with serialization libraries (like binary or cereal) that support General dataclass. Those types are: Regexp, LocalTime, POSIXTime, Day, SourcePos, Decimal.

HLedger journal needs to be "sanitized" a bit before it could be serialized, otherwise circular references in postings would cause serialization to go on forever. Specifically, circular loops are introduced by:

  • ptransaction in Posting
  • ptransaction inside the original posting (poriginal) in Posting
  • jfiles in journal, while not being circular, is certainly wasteful, as it would contain the original text of all journal files

A minimal inefficient test with binary that serializes all "inconvenient" types as strings using their Read/Show instances:

{-# LANGUAGE DeriveGeneric #-}

import GHC.Generics ()
import Data.Binary
import Data.Text (Text)
import Data.Time.LocalTime (LocalTime)
import Data.Time.Clock.POSIX (POSIXTime)
import Data.Time.Calendar (Day)
import Data.Decimal
import Hledger
import Hledger.Data.Types
import System.Environment (getArgs)

instance Binary Regexp where
  put rex = put ((reString rex)::Text)
  get = do rs <- get
           return (toRegex' rs)

instance Binary LocalTime where
  put lt = put ((show lt)::String)
  get = do s <- get
           return ((read s)::LocalTime)

instance Binary POSIXTime where
  put lt = put ((show lt)::String)
  get = do s <- get
           return ((read s)::POSIXTime)

instance Binary Day where
  put lt = put ((show lt)::String)
  get = do s <- get
           return ((read s)::Day)

instance Binary SourcePos where
  put (SourcePos a b c) = do
    put (a::String)
    put ((unPos b)::Int)
    put ((unPos c)::Int)
  
  get = do
    a<-get
    b<-get
    c<-get
    return (SourcePos a (mkPos b) (mkPos c))

instance Binary Decimal where
  put x = do
    put ((decimalPlaces x)::Word8)
    put ((decimalMantissa x)::Integer)
    
  get = do
    p<-get
    m<-get
    return (Decimal {decimalPlaces=p, decimalMantissa=m})

instance Binary MixedAmountKey
instance Binary BalanceAssertion
instance Binary PostingType
instance Binary AmountCost
instance Binary MixedAmount
instance Binary Status
instance Binary Amount
instance Binary Posting
instance Binary EFDay
instance Binary DateSpan
instance Binary TMPostingRule
instance Binary Interval
instance Binary Transaction
instance Binary PeriodicTransaction
instance Binary TransactionModifier
instance Binary MarketPrice
instance Binary PriceDirective
instance Binary Commodity
instance Binary TimeclockCode
instance Binary AccountType
instance Binary AccountDeclarationInfo
instance Binary Rounding
instance Binary TagDeclarationInfo
instance Binary AmountPrecision
instance Binary DigitGroupStyle
instance Binary PayeeDeclarationInfo
instance Binary Side
instance Binary TimeclockEntry
instance Binary AccountAlias
instance Binary AmountStyle
instance Binary Journal

txnUntie :: Transaction -> Transaction
txnUntie t@Transaction{tpostings=ps} =
  t{tpostings=map (\p ->
     case poriginal p of
       Just orig -> p{ptransaction=Nothing,poriginal=Just orig{ptransaction=Nothing}}
       Nothing    -> p{ptransaction=Nothing}) ps}


-- tie :: Journal -> Journal
-- tie j@Journal{jtxns=ts} = j{jtxns=map txnTieKnot ts}

untie :: Journal -> Journal
untie j@Journal{jtxns=ts} = j{jtxns=map txnUntie ts, jfiles=[]}

main = do
  args <- getArgs
  j <- readJournalFiles' args
  let untied = untie j
  encodeFile "test.bin" untied
  j' <- decodeFile "test.bin"
  print (untied == j')
@simonmichael simonmichael added A-WISH Some kind of improvement request or proposal. performance Anything performance-related (run time, memory usage, disk space..) labels Feb 13, 2025
@simonmichael
Copy link
Owner

simonmichael commented Feb 13, 2025

More notes and context from chat log starting here:

I was just thinking again about caching state on hard drive, like Ledger used to

I don't think making the journal more explicit has much noticeable impact on speed, but I could be wrong
I would guess caching as json or sqlite might help more.. but of course only trying it would tell for sure

"starting balances cache"
vs
a fast-enough cache of everything ... that would probably be superior
though, explicitly declared boundaries and checkpoints are still valuable

D: I've pushed "read journal once, process many times" script as #2323

I woke up with a desire to de/serialise and have done a quick research sweep. We currently don't have any complete de/serialising of hledger data, other than print/read. We use "exotic" types like Day, Decimal, Regex, that most libs don't handle easily - tried protobuf, cereal, binary, serialise, aeson. Hledger.Data.JSON is closest, but not there yet.

it sounds straightforward to automatically cache parsed data from foo.journal as .cache.foo.journal

D: I've also decided to experiment. packman failed to build for me...
I've made minimal-implementation-that-compiles with binary

very nice. It looked like cereal is a newer more popular binary, maybe it's similar

D: There's 460 packages using cereal and 1122 using binary according to https://hackage.haskell.org/packages/reverse , so I'd say binary is more popular :)

To help keep the UX impact minimal I suggested hidden files (consistent with our .latest files), one per input file (or just journal files if easier). These would be disposable, regenerated automatically as needed. I suggested a standard prefix like .cache so they could be seen and erased easily in a directory listing.

The goal would be to speed up hledger by a factor of 2 on large files, since past measurements show parsing and report calculation take about half the time each, and we assume that reading parsed data from a cached file could be essentially free compared to parsing. Testing could show otherwise, eg parsing and calculating phases might not always be easy to distinguish.

The simplest serialisation of haskell values is with Show and Read. This would be verbose and might not be the fastest.

JSON is the most widespread serialisation format and probably the most human readable and editable option discussed here. Also we already use it for hledger report output and for hledger-web's HTTP API. It needs more work in hledger to make it a viable input format. Like Show/Read it probably (?) wouldn't be the fastest to read.

One of the binary formats (binary's, cereal's, serialise's, CBOR etc.) probably would be the fastest to read at runtime. Binary formats are also less human-readable and more opaque, which can be a disadvantage or an advantage. CBOR has generic tools for inspecting it. Hidden cache files increase the risk of forgetting and leaking private data.

Ledger used to have this feature but later dropped it, because it added complexity/bugs/maintenance cost and/or because Ledger was fast enough without it. Speeding up hledger generally in other ways, if feasible, reduces the need for caching.

@adept
Copy link
Collaborator Author

adept commented Feb 13, 2025

Quick measurement using the code in the top message:

  • it takes 7.5 seconds to parse my largest journal with 40K transactions and ~400 included files, and save it to binary file
  • it takes 1.7 seconds to read it back from binary (and print the number of transactions)
  • it takes 4.5 seconds to do hledger print| head on it, so serialization (at the moment) has rather high cost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-WISH Some kind of improvement request or proposal. performance Anything performance-related (run time, memory usage, disk space..)
Projects
None yet
Development

No branches or pull requests

2 participants