Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Feature: automatic document translation #6386

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

bjesus
Copy link

@bjesus bjesus commented Apr 13, 2024

Proposed change

This PR adds support for automatically translating imported documents to whatever destination language is configured. The translated content can then be used when searching, just like the normal content field. I think it is very useful when having a lot of content in a language you aren't fluent in. For example, I live in the Netherlands and searching for "tax" is much easier for me than searching for "belastingdienst". The PR currently uses DeepL for translation. It is a POC stage - I wanted to check with you that you'd be interested in such a feature to begin with, as well as ask some questions regarding your prefered implementation method.

Closes #269

Type of change

  • Bug fix: non-breaking change which fixes an issue.
  • New feature / Enhancement: non-breaking change which adds functionality. Please read the important note above.
  • Breaking change: fix or feature that would cause existing functionality to not work as expected.
  • Documentation only.
  • Other. Please explain:

Checklist:

The PR is a WIP so I haven't added tests yet.

  • I have read & agree with the contributing guidelines.
  • If applicable, I have included testing coverage for new code in this PR, for backend and / or front-end changes.
  • If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • I have checked my modifications for any breaking changes.

My questions

  • Should we go with DeepL? Do you have another preference? some users recommended LibreTranslate. DeepL has a free API as well as a paid one, and LibreTranslation seems to be able to run locally. Another alternative is asking some LLM to do the translation I assume. I think it would be nice to allow users to choose a translation backend.
  • Is it okay that I've added a new translation field to the Document model?
  • At what point should the translation happen? For now it runs as part of update_document_archive_file but it probably isn't the best place.
  • Should the translate_content live inside documents/tasks.py, or should I separate this even further?
  • Currently I'm presenting the translation right below to "Content" text area, in the "Content" tab. Is that okay? Do you have a better idea?

Thank you very much for Paperless-ngx!

@bjesus bjesus requested review from a team as code owners April 13, 2024 07:45
@paperless-ngx-secretary
Copy link

Hello @bjesus,

Thank you very much for submitting this PR to us!

This is what will happen next:

  1. Once enabled by a maintainer, our ci tests will run against your PR to ensure quality and consistency.
  2. Next, human contributors from paperless-ngx review your changes. Since this is a non-trivial change, a review from at least two contributors is required.
  3. Please address any issues that come up during the review as soon as you are able to.
  4. If accepted, your pull request will be merged into the dev branch and changes there will be tested further.
  5. Eventually, changes from you and other contributors will be merged into main and a new release will be made.

You'll be hearing from us soon, and thank you again for contributing to our project.

@paperless-ngx-secretary paperless-ngx-secretary bot added backend frontend non-trivial Requires approval by several team members labels Apr 13, 2024
@shamoon shamoon marked this pull request as draft April 13, 2024 13:26
@shamoon
Copy link
Member

shamoon commented Apr 13, 2024

Thanks for the PR!

Before we get to the implementation questions, I think the first question really needs to be answered. That said, I’m not sure what the definitive answer is (or whether there is a definitive answer).

I think there are two philosophical issues using a commercial service for translation. The first is that I have come to realize that pngx users are extremely privacy-focused, (for example go read thread about integrating ChatGPT). Second issue is that it would be supporting a closed-source commercial service, which is sort of against the ethos of the project. On the other hand, of course, if DeepL is head-and-shoulders above the rest then maybe the question is moot. I have almost no knowledge of the translation ecosystem (FOSS or otherwise).

Of course this is opt-in, which is good, but my gut tells me if there’s a viable FOSS alternative out there we should use it, or at least have it as an option.

@stumpylog
Copy link
Member

At the very least, I think this should be "pluggable", with initial support for at least one foss, local service. That makes it easier to expand to new providers if users are wanting it.

@bjesus
Copy link
Author

bjesus commented Apr 13, 2024

To be honest, adding support for multiple providers isn't difficult at all, at the end it's just a question of changing the request scheme a little bit. At the end of the day LibreTranslate or even using an LLM (which could be open source) are all working over HTTP.

I'll update the PR to include support for more providers and update here!

@stumpylog
Copy link
Member

It's also not currently opt in. It will always run when updating the archive file, which is undesirable. It should be controlled by some setting(s) and only be run if the user has configured it

@shamoon
Copy link
Member

shamoon commented Apr 13, 2024

Oh yea I hadnt looked that closely, thought it would be optional in terms of requiring the settings (settings.DEEPL_TOKEN, etc). Agree it should absolutely be opt-in. I do have some thoughts about the implementation part when we get there (Im mostly the frontend guy)

@bjesus
Copy link
Author

bjesus commented Apr 13, 2024

Yes will definitely make it an opt-in!

@shamoon shamoon changed the title draft: add automatic translation support [Draft] Feature: automatic document translation Apr 14, 2024
@bjesus
Copy link
Author

bjesus commented Apr 19, 2024

hey, I made two important changes now:

  • dropped the DeepL dependency and use https://github.com/browsermt/bergamot-translator instead. Bergamot (https://browser.mt/) is the project powering Firefox Translations, it runs 100% locally, it's open source and hopefully the fact that Firefox uses it means it will be kept somewhat maintained. It also has some python bindings.
  • made the feature opt-in: unless you set PAPERLESS_TRANSLATION_MODEL with a model name, no translation will be done. There would still be a database column for translations, but I'm not sure if it's possible to avoid this and how.

Overall I think that since this runs locally the risk of users objecting to this feature is a quite lower now.

There are still two things I'd love to add:

  • Support for using multiple models: now it's only possible to choose one (from language X to language Y). It would be nice to be able to use more.
  • Support for pre-detection of the language so that we will know if to run the translation function or just skip it (either because it's already in the target language, or because there's no model translating it)

Both of these features generally require some understanding of what language the document is in. I'm trying to figure out what' the best way to do that.

@bjesus
Copy link
Author

bjesus commented Apr 19, 2024

auto detecting the language seem to be working pretty well with https://github.com/pemistahl/lingua-py . I'll add that soon and then we can support multiple languages and not run the translation if it isn't needed. please let me know what other concerns or ideas you might have!

@hendrik1120 I noticed you voted the feature down - I'd be happy to hear what objections you have to it 🙏

@shamoon
Copy link
Member

shamoon commented Apr 19, 2024

One note in the meantime, I’d be careful about package bloat for what is an optional feature at the moment. I haven’t looked at either of the two mentioned ones yet, might be fine. Also I don’t think bergamot is properly added atm, eg not in pipfile/lock

Finally, we already have langdetect as a dep

@bjesus
Copy link
Author

bjesus commented Apr 20, 2024

Great, I'll try using langdetect. Indeed I haven't edited the Pipfile yet - I will add bergamot as an optional dependency.

@hendrik1120
Copy link

@bjesus sure, my response is almost identical to the one from shamoon.
The whole reason why I am using paperless is to have complete control over my data. Shipping the ocr data of highly personal documents to some random translation service is completely against that. I also thought of some possible GDPR issues, as this would be enabled for all users of the paperless instance which possibly didn't consent to their data being externally processed. (No expert on that, though)
I definitely see your use case for it, for me personally I have no issue searching for documents in either German or English.

From the recent changes, I see that all of my concerns seem to be have been mitigated already.

@stumpylog
Copy link
Member

I still have some things I think need to be addressed before really reviewing:

  • Although bergamont is interesting, it still has to download the models at run time. For some users, they are entirely offline. For Docker users, these downloads would have to be re-done when the container is recreated. For something I feel is a niche feature, those are not great.
    • I would prefer this uses something with a REST API, as that can be mocked for testing, swapped out, etc
  • This doesn't really match my idea of being "pluggable" or swappable, to allow swapping from LibreTranslate to DeepL to Google to .... with only some configuration changes. Ideally, some class is instantiated, does some setup, does the translate, then does cleanup, with the user easily able to choose what server. Bergamont has https://github.com/mozilla/translation-service, though updates appear to be stalled
  • There's a lack of error handling. Translation shouldn't fail everything if it fails
  • Opt in is good, thanks for that

@sevketcaba
Copy link

I think this would be an excellent feature, and I'd definitely use it if it runs locally. About the workflow, paperless-ngx already has PAPERLESS_OCR_LANGUAGES and PAPERLESS_OCR_LANGUAGE environment variables at hand. Prior can be used as a default source languages, and the latter can be used as a target language. Of course, these could be the defaults, and I think you can introduce PAPERLESS_TRA_LANGUAGES, PAPERLESS_TRA_LANGUAGE and PAPERLESS_TRA_BACKEND environment variables, of which the first two would default to the previous OCR_ equivalents, and the last one would default to None or the backend of the translating engine depending on the community desire. I'd like to see a new tab in the document edit page named as '$PAPERLESS_TRA_LANGUAGE' to show the translated content as well. Great idea, I really like to see it happening.

TLDR:
New ENV VARS:
PAPERLESS_TRA_LANGUAGES=$PAPERLESS_OCR_LANGUAGES # Defines the source languages to translate from, defaults to OCR equivalent
PAPERLESS_TRA_LANGUAGE =$PAPERLESS_OCR_LANGUAGE # Defines the target language to translate to, defaults to OCR equivalent
PAPERLESS_TRA_BACKEND = None|str # Defines the state of the feature, defaults to community desire

I guess additional variables per backend should be defined as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend frontend non-trivial Requires approval by several team members
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

None yet

5 participants