Support for Office-Formats with Apache-Tika #600

Tooa · 2020-01-09T21:49:22Z

Motivation and Description

This pull request adds support for office formats (such as odt, ods, docx, etc.) in paperless. In order to process these files and extract their content, Apache-Tika is applied:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

I started implementing this feature because I have office files stored on NFS not being part of my paperless instance. My goal is to have my personal documents in one place accessible via full-text search. This place should be paperless!

This is a work in progress PR as I would like to receive initial feedback from the community before I start polishing it towards merging. Especially, I would like to know:

What do you think about the current way of including Apache-Tika into paperless?
Do you have any high-level remarks regarding this first working prototype?
Is this feature something people might find useful?
Is there a chance of getting this feature merged?
Any other comments?

This is a first working prototype - the feature is not complete and still under development (see open tasks).

In the short term, I would like to handle ms-office and open-office formats with this parser. In the long term, however, Apache-Tika might replace even the PDF/OCR parser since Tika also comes with support for tesseract. I haven't decided on the latter yet and the change will definitely not be part of this PR.

Open Questions

I had to make two changes in order to get the original master working. I wonder, why nobody else has problems with the docker setup of the current master.

1. `CORS_ORIGIN_WHITELIST` issue

I had to add the protocol to the value in CORS_ORIGIN_WHITELIST:

CORS_ORIGIN_WHITELIST = tuple(os.getenv("PAPERLESS_CORS_ALLOWED_HOSTS", "http://localhost:8080").split(","))

Otherwise, I get the following error:

SystemCheckError: System check identified some issues:
consumer_1     |
consumer_1     | ERRORS:
consumer_1     | ?: (corsheaders.E013) Origin 'localhost:8080' in CORS_ORIGIN_WHITELIST is missing  scheme or netloc
consumer_1     | 	HINT: Add a scheme (e.g. https://) or netloc (e.g. example.com).

Is this an issue for someone else too?

2. Missing Fonts Issue

For some reason, I had to add msttcorefonts-installer and fontconfig to the Dockerfile. Otherwise, I get the following error:

Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-5dfrf94n/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-5dfrf94n/tx.png')
consumer_1     | Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-ys5cdlyn/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-ys5cdlyn/tx.png')

Does anybody else have such an error on master? If not, I'm fine to drop the change as it seems that it is a particular problem with my local machine setup.
Otherwise, I can separate the change in its own commit and combine it with the other RUN command.

Open Tasks

Add ms-office file formats to TikaDocumentParser
Add tika to non docker installation (see requirements and setup)
- I probably have to make the tika URL configurable
Add Unit-Tests
Add Exception Handling
Issue UI text field contains a lot of newlines after odt import
Add documentation
...
Comply to Pep8 and additional style guides (See guidlines)
Rebase and Squash

bauerj · 2020-01-14T14:43:56Z

Hey,

I wonder if it would make sense to convert those files to PDF in the consumer.

Having all documents stored as a PDF file has some advantages IMO:

Documents are more portable, more systems can display a PDF document than a Word document
All fonts etc. are embedded, so your document looks the same even if you use a different system to view it
Makes it easier to switch to a different system in X years

What do you think?

Tooa · 2020-01-15T08:53:57Z

I wonder if it would make sense to convert those files to PDF in the consumer.

I see your point. Let me describe my use-case in more detail:

I have an office document for let's say to cancel the insurance. I convert it to PDF, archive it in paperles and send the PDF via e-mail to the insurance. The next time I have a similar affair, I want to grab the original office document from paperless, change the details and send it as PDF to the next insurance. So the office documents function as templates somehow. At the moment, I store these templates on a separate NFS share.

Probably the use-case is really specific to me and the NFS share solution might be sufficient. However, I don't like different ways of accessing my documents. Maybe it's not worse the effort. Should have asked beforehand.

All fonts etc. are embedded, so your document looks the same even if you use a different system to view it

Ah! That explains the problems with office documents and the font in the container. It was not necessary to provide them until now.

Whisprin · 2020-01-22T18:30:10Z

Just commenting on "1. CORS_ORIGIN_WHITELIST issue"
I set PAPERLESS_CORS_ALLOWED_HOSTS="http://localhost:8000" directly in paperless.conf and it works.

MasterofJOKers · 2020-05-14T20:09:15Z

While I see your use-case, I don't see tika being a part of paperless. It's one thing to run a couple of binaries to parse a file, but it's imho another one to run a java based webserver next to a small app like paperless for this purpose. I'd rather see this implemented in another project, as a django app people can include, if they want to - maybe even under the hood of https://github.com/the-paperless-project.

That being said, some of your already proposed changes - like supporting more document types in the model - seem to be neccessary. Others, like the tika dependency and the additional django module for paperless_tika, should imho be moved. I'd be happy to have that project referenced in the README and docs.

That's my 2 cents.

jonaswinkler · 2020-12-10T14:13:53Z

@Tooa Hi there. I've been working on a fork of paperless for a while now. If you're still around and want to make this happen, maybe we can work something out. Contact me if you're interested in contributing and we can discuss the details.

Tooa force-pushed the feat_add_tika_parser branch 3 times, most recently from 06c73d4 to 8544d8f Compare January 11, 2020 12:44

Tooa added 3 commits January 11, 2020 13:46

fix highlighting skipped warning

3ffd149

feat(): add tika parser

e1f2fa6

feat(): add ms-office formats

5f677e1

Tooa force-pushed the feat_add_tika_parser branch from 8544d8f to 5f677e1 Compare January 11, 2020 12:47

Tooa mentioned this pull request Feb 23, 2020

Upgrade Docker image to Alpine 3.11 #612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Office-Formats with Apache-Tika #600

Support for Office-Formats with Apache-Tika #600

Tooa commented Jan 9, 2020 •

edited

bauerj commented Jan 14, 2020

Tooa commented Jan 15, 2020 •

edited

Whisprin commented Jan 22, 2020

MasterofJOKers commented May 14, 2020

jonaswinkler commented Dec 10, 2020 •

edited

Support for Office-Formats with Apache-Tika #600

Are you sure you want to change the base?

Support for Office-Formats with Apache-Tika #600

Conversation

Tooa commented Jan 9, 2020 • edited

Motivation and Description

Open Questions

1. CORS_ORIGIN_WHITELIST issue

2. Missing Fonts Issue

Open Tasks

bauerj commented Jan 14, 2020

Tooa commented Jan 15, 2020 • edited

Whisprin commented Jan 22, 2020

MasterofJOKers commented May 14, 2020

jonaswinkler commented Dec 10, 2020 • edited

Tooa commented Jan 9, 2020 •

edited

1. `CORS_ORIGIN_WHITELIST` issue

Tooa commented Jan 15, 2020 •

edited

jonaswinkler commented Dec 10, 2020 •

edited