Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement encoding in .rules #2319

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jokesper
Copy link

@jokesper jokesper commented Jan 21, 2025

An implementation for encoding in rules file:
The cases are:

  • -f *.rules
  • -f *.csv
  • -f *.csv --rules *.rules

The second and third case are effectively the same. The problem is that for them we call a Readers rReadFn value taking the contents rather then the filepath / handle.
So we either would have to change Reader to take a FilePath instead of Text, which would break how we do Tests,
or we add an extra field rGetContents :: FilePath -> IO Text.
I would rather get some feedback on this before I implement a potentially bigger change.

If implemented this PR should:

@jokesper jokesper force-pushed the feat-rules-encoding branch from e30d9ad to 5966c8a Compare January 21, 2025 22:43
@jokesper jokesper changed the title Implement encoding for -f *.rules Implement encoding in .rules Jan 21, 2025
@simonmichael
Copy link
Owner

Thanks for exploring this. It's probably a good idea. But what if rules files made it easier to run a preprocessor, eg

source !iconv -f CP1250 FILE.csv

That requires a little more expertise from the user, I guess ?

@simonmichael simonmichael added A-WISH Some kind of improvement request, hare-brained proposal, or plea. i18n Internationalisation/localisation-related. csv The csv file format, csv output format, or generally CSV-related. labels Jan 22, 2025
@simonmichael
Copy link
Owner

The case of reading CSV from stdin might also be worth thinking about.

@jokesper
Copy link
Author

But what if rules files made it easier to run a preprocessor

I like the idea, though implementing it with the other cases is a bit wierd, since we ignore source with -f.

We could split it into two options with the other being preprocessor iconv -f CP1250 {}.
This has the drawback of either reserving a replacement or allowing one to specify it.

The case of reading CSV from stdin might also be worth thinking about.

Is it even possible to read a csv from stdin?

  • -f - --rules *.rules doesn't work, since we first figure out whether it's a csv based on the file ending.
  • source - doesn't work, since it searches for - as a file and can't find it.
  • source /dev/stdin works, but is a bit hacky.

Though if they would work, they would be handled the same as the cases mentioned above.
Opening the handle and reading from it are indpendent.

@simonmichael
Copy link
Owner

simonmichael commented Jan 22, 2025 via email

@simonmichael
Copy link
Owner

Hi @jokesper, some more thoughts.
Docs and tests will help motivate this, when the time comes.
But I see this adds text-icu, which might be a heavy dependency. I have been testing a lttle.

On mac, it doesn't install cleanly for me:

$ stack install text-icu
text-icu> configure
text-icu> Configuring text-icu-0.8.0.5...
text-icu> Error: [Cabal-7123]
text-icu> The pkg-config package 'icu-i18n' version >=62.1 is required but it could not be found.
text-icu> 

Error: [S-7282]
       Stack failed to execute the build plan.
       
       While executing the build plan, Stack encountered the error:
       
       [S-7011]
       While building package text-icu-0.8.0.5 (scroll up to its section to see the error) using:
       /Users/simon/.stack/setup-exe-cache/aarch64-osx/Cabal-simple_w2MFVN35_3.12.0.0_ghc-9.10.1 --verbose=1 --builddir=.stack-work/dist/aarch64-osx/ghc-9.10.1 configure --with-ghc=/Users/simon/.ghcup/ghc/9.10.1/bin/ghc-9.10.1 --with-ghc-pkg=/Users/simon/.ghcup/ghc/9.10.1/bin/ghc-pkg-9.10.1 --user --package-db=clear --package-db=global --package-db=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/pkgdb --libdir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/lib --bindir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/bin --datadir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/share --libexecdir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/libexec --sysconfdir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/etc --docdir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/doc/text-icu-0.8.0.5 --htmldir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/doc/text-icu-0.8.0.5 --haddockdir=/Users/simon/.stack/snapshots/aarch64-osx/ddce8e417d90ad6b1b4786c8f51cb8baeb9394e58ea7e0e4b14902fda6581e02/9.10.1/doc/text-icu-0.8.0.5 --dependency=base=base-4.20.0.0-380b --dependency=bytestring=bytestring-0.12.1.0-5f32 --dependency=deepseq=deepseq-1.5.0.0-c88a --dependency=text=text-2.1.1-05f2 --dependency=time=time-1.12.2-c82d -fhomebrew --exact-configuration --ghc-option=-fhide-source-paths
       Process exited with code: ExitFailure 1 

I have brew's icu4c installed:

$ brew info icu4c
==> icu4c@76: stable 76.1 (bottled) [keg-only]
C/C++ and Java libraries for Unicode and globalization
https://icu.unicode.org/home
Installed

I haven't been able to test on windows (I have only an arm vm, which GHC doesn't support yet).

@simonmichael
Copy link
Owner

But it can run x86_64 ghc. And yes, it's similar on Windows: text-icu isn't an easy thing to install.

C:\Users\Simon>stack install text-icu
[1 of 3] Compiling Main             ( C:\sr\setup-exe-src\setup-O_vy6YIf.hs, C:\sr\setup-exe-src\setup-O_vy6YIf.o )
[2 of 3] Compiling StackSetupShim   ( C:\sr\setup-exe-src\setup-shim-O_vy6YIf.hs, C:\sr\setup-exe-src\setup-shim-O_vy6YIf.o )
[3 of 3] Linking C:\sr\setup-exe-cache\x86_64-windows\tmp-Cabal-simple_O_vy6YIf_3.10.3.0_ghc-9.8.4.exe
text-icu> configure
text-icu> Configuring text-icu-0.8.0.5...
text-icu> Error: Cabal-simple_O_vy6YIf_3.10.3.0_ghc-9.8.4.exe: The program 'pkg-config'
text-icu> version >=0.9.0 is required but it could not be found.
text-icu>

Error: [S-7282]
       Stack failed to execute the build plan.

       While executing the build plan, Stack encountered the error:

       [S-7011]
       While building package text-icu-0.8.0.5 (scroll up to its section to see the error) using:
       C:\sr\setup-exe-cache\x86_64-windows\Cabal-simple_O_vy6YIf_3.10.3.0_ghc-9.8.4.exe --verbose=1 --builddir=.stack-work\dist\f24b2e15 configure --with-ghc=C:\Users\Simon\AppData\Local\Programs\stack\x86_64-windows\ghc-9.8.4\bin\ghc-9.8.4.exe --with-ghc-pkg=C:\Users\Simon\AppData\Local\Programs\stack\x86_64-windows\ghc-9.8.4\bin\ghc-pkg-9.8.4.exe --user --package-db=clear --package-db=global --package-db=C:\sr\snapshots\ce0f12f1\pkgdb --libdir=C:\sr\snapshots\ce0f12f1\lib --bindir=C:\sr\snapshots\ce0f12f1\bin --datadir=C:\sr\snapshots\ce0f12f1\share --libexecdir=C:\sr\snapshots\ce0f12f1\libexec --sysconfdir=C:\sr\snapshots\ce0f12f1\etc --docdir=C:\sr\snapshots\ce0f12f1\doc\text-icu-0.8.0.5 --htmldir=C:\sr\snapshots\ce0f12f1\doc\text-icu-0.8.0.5 --haddockdir=C:\sr\snapshots\ce0f12f1\doc\text-icu-0.8.0.5 --dependency=base=base-4.19.2.0-943b --dependency=bytestring=bytestring-0.12.1.0-290b --dependency=deepseq=deepseq-1.5.1.0-f918 --dependency=text=text-2.1.1-3967 --dependency=time=time-1.12.2-0f79 -fhomebrew --extra-include-dirs=C:\Users\Simon\AppData\Local\Programs\stack\x86_64-windows\msys2-20240727\mingw64\include --extra-lib-dirs=C:\Users\Simon\AppData\Local\Programs\stack\x86_64-windows\msys2-20240727\mingw64\lib --extra-lib-dirs=C:\Users\Simon\AppData\Local\Programs\stack\x86_64-windows\msys2-20240727\mingw64\bin --exact-configuration --ghc-option=-fhide-source-paths
       Process exited with code: ExitFailure 1

@jokesper
Copy link
Author

Should I instead switch to something like encoding?
It would be easier to install, as it's installed entirely by cabal / stack and would be statically build into hledger.
The drawback is: It works on String not Text. I.e. we would need a possibly costly pack (not tested).

@simonmichael
Copy link
Owner

simonmichael commented Jan 25, 2025 via email

@jokesper jokesper force-pushed the feat-rules-encoding branch from 5966c8a to 88749fd Compare January 26, 2025 14:15
@jokesper
Copy link
Author

I just switched the part I already implemented to use encoding instead of text-icu.

I noticed encoding uses a custom setup which is currently affected by this bug. For some developers (on Arch btw.) this leads to an extra option they should pass: --ghc-options=-dynamic when using cabal.

Should I document this anywhere, as it's already documented in the Arch wiki.

@jokesper jokesper force-pushed the feat-rules-encoding branch from 88749fd to 3fd57bb Compare February 1, 2025 15:36
@jokesper
Copy link
Author

jokesper commented Feb 1, 2025

I implemented the remaining two cases. To do this I had to change the arguments to rReadFn. The details on why are in the commit message.

I think this is ready for review.

@jokesper jokesper marked this pull request as ready for review February 1, 2025 15:39
@jokesper jokesper force-pushed the feat-rules-encoding branch from 3fd57bb to 0512800 Compare February 1, 2025 16:13
@jokesper
Copy link
Author

jokesper commented Feb 1, 2025

Docs and tests will help motivate this, when the time comes.

Writing tests for this would require quite a bit of rework in how they are treated.
Currently the readers for csv and rules only look at the filepath and not at their passed contents.
Instead they get them themselves.

@simonmichael
Copy link
Owner

Hi @jokesper. Sorry for the slow reply. If this would make hledger installers need to use special build flags, that would be a problem unfortunately.

@simonmichael
Copy link
Owner

I have inquired at dmwit/encoding#23

Instead of `text-icu` as [recommended by `text`](https://hackage.haskell.org/package/text-2.1.2/docs/Data-Text-Encoding.html):

> To gain access to a much larger family of encodings, use the `text-icu` package.

we use `encoding`, since `text-icu` requires an external library.
`encoding` does require a custom setup, which is currently
affected by a [bug in Cabal](haskell/cabal#6505)
but it doesn't require an external runtime dependency, which makes it
easier to distribute on macos and windows.

I also changed the `rReadFn` to get a `Handle` rather than a `Text`.
Such that we can use non standard functions for reading it.
This is kind of sideways, since the one place we care about it,
we don't use it at all. Though it still caused issues, since the input
wasn't parseable as text.
I think not using the handle has something to do with how `-` is treated
as stdin but I didn't do any further tests on that, since this is out of
the scope of this PR.
@jokesper
Copy link
Author

jokesper commented Feb 2, 2025

While trying to fix the CI, I noticed you already needed to supply --ghc-options=-dynamic to build hledger-web, since it depends on xml-conduit which has a custom setup.
Furthermore this is only a problem with cabal (stack should work fine) and on the first build (since then the dependencies are build).

Though I would still prefer if we didn't need a flag just to build the core hledger-lib, which would also affect every project depending on hledger. A fix would also (probably) have the added bonus of potentially reducing dependencies required (indirectly) by hledger.

@jokesper jokesper force-pushed the feat-rules-encoding branch from 2c19779 to 0b6ce73 Compare February 2, 2025 19:04
This makes sure we don't have to build a `build-type: Custom` package
@jokesper
Copy link
Author

jokesper commented Feb 10, 2025

Since encoding now dropped the custom build type, the associated problems have now been fixed*

Though the new encoding version is not yet in stackage.

@simonmichael
Copy link
Owner

Thanks @jokesper . Adding encoding-0.9 to stack*.yaml should fix the current CI failure.

@simonmichael
Copy link
Owner

(Or in the case of stack.yaml, bumping it to the latest nightly-2025-02-11.)

@jokesper jokesper force-pushed the feat-rules-encoding branch from 85deea2 to d775b01 Compare February 11, 2025 18:27
@jokesper
Copy link
Author

Thanks for the info. It's fixed now

@simonmichael
Copy link
Owner

simonmichael commented Feb 11, 2025

Thanks @jokesper.
Now it's time for more review, and I'd welcome help from other reviewers.
Here's some relevant info:

26 files changed

encoding >0.9 dependency added

Main commit description

feat: Implement encoding in rules file

Instead of `text-icu` as [recommended by `text`](https://hackage.haskell.org/package/text-2.1.2/docs/Data-Text-Encoding.html):

> To gain access to a much larger family of encodings, use the `text-icu` package.

we use `encoding`, since `text-icu` requires an external library.
`encoding` does require a custom setup, which is currently
affected by a [bug in Cabal](https://github.com/haskell/cabal/issues/6505)
but it doesn't require an external runtime dependency, which makes it
easier to distribute on macos and windows.

I also changed the `rReadFn` to get a `Handle` rather than a `Text`.
Such that we can use non standard functions for reading it.
This is kind of sideways, since the one place we care about it,
we don't use it at all. Though it still caused issues, since the input
wasn't parseable as text.
I think not using the handle has something to do with how `-` is treated
as stdin but I didn't do any further tests on that, since this is out of
the scope of this PR.

Github AI summary

This pull request includes several changes to the hledger-lib library, primarily focusing on modifying the journal reading functions to use Handle instead of Text and adding support for handling different text encodings. The changes span multiple files and involve updates to function signatures, imports, and helper functions.

Updates to journal reading functions:

  • Changed readJournal and related functions to accept a Handle instead of Text for reading journal data (hledger-lib/Hledger/Read.hs). [1] [2]
  • Added a new function readJournal'' that allows reading from Text by converting it to a Handle (hledger-lib/Hledger/Read.hs).

Import and helper function updates:

  • Updated imports to include Handle and other necessary modules (hledger-lib/Hledger/Read.hs, hledger-lib/Hledger/Read/Common.hs, hledger-lib/Hledger/Read/CsvReader.hs, hledger-lib/Hledger/Read/RulesReader.hs). [1] [2] [3] [4]
  • Added helper functions such as handleReadFnToTextReadFn to facilitate the transition from Text to Handle (hledger-lib/Hledger/Read/Common.hs).

Changes in specific readers:

  • Updated CsvReader, JournalReader, RulesReader, TimeclockReader, and TimedotReader to use the new Handle-based functions (hledger-lib/Hledger/Read/CsvReader.hs, hledger-lib/Hledger/Read/JournalReader.hs, hledger-lib/Hledger/Read/RulesReader.hs, hledger-lib/Hledger/Read/TimeclockReader.hs, hledger-lib/Hledger/Read/TimedotReader.hs). [1] [2] [3] [4] [5]

Test updates:

  • Modified tests in PostingsReport to use the new readJournal'' function (hledger-lib/Hledger/Reports/PostingsReport.hs). [1] [2] [3]

Miscellaneous:

  • Added new language pragmas and updated the module exports to include new functions (hledger-lib/Hledger/Utils/IO.hs). [1] [2]

These changes improve the flexibility and robustness of the journal reading functions by allowing them to handle different input sources and text encodings.

SM summary/doc draft

Previously, hledger could read CSV files containing non-ascii characters only if they are UTF8-encoded.
Now there is a new CSV rule, encoding ENCODING, which allows reading CSV files with other encodings.
The following encodings are supported:
...

Actions

  • Let's squash these commits into one when you get a chance

  • Let's update the message. I usually don't bother with markdown hyperlinks in these - Github doesn't render them.
    And let's clarify a bit the user-facing changes as well as the developer-facing ones.
    It could be something like

    feat: Add an `encoding` CSV rule, allowing non-UTF8-encoded CSV files to be read.
    
    This uses the encoding library, which supports fewer encodings than text-icu
    but is more portable, not requiring a C library.
    

    This seems more temporary PR discussion than informative permanent commit message -
    if it should stay, maybe we can clarify it a bit more:

    I also changed the `rReadFn` to get a `Handle` rather than a `Text`.
    Such that we can use non standard functions for reading it.
    This is kind of sideways, since the one place we care about it,
    we don't use it at all. Though it still caused issues, since the input
    wasn't parseable as text.
    I think not using the handle has something to do with how `-` is treated
    as stdin but I didn't do any further tests on that, since this is out of
    the scope of this PR.
    

@simonmichael
Copy link
Owner

I see ImplicitParams is added, was that intended ?

@jokesper
Copy link
Author

Yes, this is due to the interface of hGetContents from encoding using an implicit parameter.

@simonmichael
Copy link
Owner

simonmichael commented Feb 11, 2025

Edited my recent comment a little (re commit message).

I see. I have used ImplicitParams in the past and don't like it much, it always confuses me. But it seems a requirement here.. and maybe a benefit. Do you have any insight into what it does for us here ?

@simonmichael
Copy link
Owner

simonmichael commented Feb 11, 2025

I see:

https://hackage.haskell.org/package/encoding-0.9/docs/System-IO-Encoding.html

By using implicit parameters, it can be used almost as a drop-in replacement

An encoding can be assigned to ?enc, which works like a global variable for the System.IO.Encoding functions, instead of needing to provide it as an argument to each of them.

@jokesper
Copy link
Author

Alternatively we could inline the body of hGetContents. It's just reading a lazy bytestring and decoding it.

@simonmichael
Copy link
Owner

simonmichael commented Feb 11, 2025

I kicked off the multi-platform CI builds. encoding is not building on Windows unfortunately, because of:

encoding                     > Configuring encoding-0.9...
encoding                     > Error: [Cabal-4345]
encoding                     > Missing dependency on a foreign library:
encoding                     > * Missing (or bad) header file: system_encoding.h
encoding                     > If the header file does exist, it may contain errors that are caught by the C compiler at the preprocessing stage. In this case you can re-run configure with the verbosity flag -v3 to see the error messages.

Here's related discussion. It sounds like this file wasn't and perhaps still isn't present on a stock Windows.

And dmwit/encoding#14 mentions a build flag that can work around it.

@simonmichael
Copy link
Owner

I opened dmwit/encoding#26

@jokesper
Copy link
Author

I applied the workaround. Though I don't know whether this fixes the issue for cabal on windows. Is this (cabal+windows) even part of our target developer base?

@simonmichael
Copy link
Owner

Yes, being easy to build with stack or cabal on any platform is a hledger requirement. Better if we can fix this upstream if possible.

Here's another note for the doc:

The following encodings are supported (case-insensitive, can also be written with inner spaces or hyphens):

ASCII
UTF8
UTF16
UTF32
ISO88591
ISO88592
ISO88593
ISO88594
ISO88595
ISO88596
ISO88597
ISO88598
ISO88599
ISO885910
ISO885911
ISO885913
ISO885914
ISO885915
ISO885916
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
KOI8R
KOI8U
GB18030
MacOSRoman
JISX0201
JISX0208
ISO2022JP
ShiftJIS
CP437
CP737
CP775
CP850
CP852
CP855
CP857
CP860
CP861
CP862
CP863
CP864
CP865
CP866
CP869
CP874
CP932

@jokesper
Copy link
Author

I added the list to the manual. With a link to the cases for the alternative spellings.
though I don't know whether the either is necessary, as the intention of this function is to allow any spelling for an encoding one may have choosen to be work.

@simonmichael
Copy link
Owner

@jokesper, beyond the scope of this PR but do you have any thoughts about wider support for text encodings - would we ideally support this for all input files ? Output files ? Or is it best left to external tools, with CSV reading being a special case because we are more often dealing with CSV from diverse sources ?

@jokesper
Copy link
Author

I think having it just for CSVs is fine, since this is the common use case. It's unlikely to get a journal file in a different encoding, though what would be possible, since we changed from Text to Handle one could maybe implement parsing electronic invoicing. Though this is way out of the scope of HLedger.

Expanding it to other files wouldn't be that hard, except for the design choice of where to document the encoding (command line parameter?). Though in my opinion it's really not needed at all for any other cases.
It doesn't help anyone now, since we effectively already enforced an encoding, and the journal format isn't any common import / export format so no benefit there.

What could be useful is in an csv export, though I don't quite know if this is really relevent. If anyone wants to use it in an CSV export please open an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-WISH Some kind of improvement request, hare-brained proposal, or plea. csv The csv file format, csv output format, or generally CSV-related. i18n Internationalisation/localisation-related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV import rules, encoding
2 participants