-
-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary cache for journal files #2325
Comments
More notes and context from chat log starting here:
To help keep the UX impact minimal I suggested hidden files (consistent with our .latest files), one per input file (or just journal files if easier). These would be disposable, regenerated automatically as needed. I suggested a standard prefix like .cache so they could be seen and erased easily in a directory listing. The goal would be to speed up hledger by a factor of 2 on large files, since past measurements show parsing and report calculation take about half the time each, and we assume that reading parsed data from a cached file could be essentially free compared to parsing. Testing could show otherwise, eg parsing and calculating phases might not always be easy to distinguish. The simplest serialisation of haskell values is with Show and Read. This would be verbose and might not be the fastest. JSON is the most widespread serialisation format and probably the most human readable and editable option discussed here. Also we already use it for hledger report output and for hledger-web's HTTP API. It needs more work in hledger to make it a viable input format. Like Show/Read it probably (?) wouldn't be the fastest to read. One of the binary formats ( Ledger used to have this feature but later dropped it, because it added complexity/bugs/maintenance cost and/or because Ledger was fast enough without it. Speeding up hledger generally in other ways, if feasible, reduces the need for caching. |
Quick measurement using the code in the top message:
|
This is a summary of discussion that happened on Matrix.
It would be nice to write down
.file.journal.bin
wheneverfile.journal
is read, and then use.file.journal.bin
as a fast binary cache for as long asfile.journal
does not change. This would avoid the considerable cost of re-parsing the journalA small speedbump here is that some datatypes used by hledger do not
derive General
and so can't be readily used with serialization libraries (likebinary
orcereal
) that support General dataclass. Those types are:Regexp
,LocalTime
,POSIXTime
,Day
,SourcePos
,Decimal
.HLedger journal needs to be "sanitized" a bit before it could be serialized, otherwise circular references in postings would cause serialization to go on forever. Specifically, circular loops are introduced by:
ptransaction
in Postingptransaction
inside the original posting (poriginal
) in Postingjfiles
in journal, while not being circular, is certainly wasteful, as it would contain the original text of all journal filesA minimal inefficient test with
binary
that serializes all "inconvenient" types as strings using their Read/Show instances:The text was updated successfully, but these errors were encountered: