This repository has been archived by the owner on Feb 19, 2021. It is now read-only.
Support for Office-Formats with Apache-Tika #600
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Description
This pull request adds support for office formats (such as
odt
,ods
,docx
, etc.) inpaperless
. In order to process these files and extract their content, Apache-Tika is applied:I started implementing this feature because I have office files stored on NFS not being part of my
paperless
instance. My goal is to have my personal documents in one place accessible via full-text search. This place should bepaperless
!This is a work in progress PR as I would like to receive initial feedback from the community before I start polishing it towards merging. Especially, I would like to know:
paperless
?This is a first working prototype - the feature is not complete and still under development (see open tasks).
In the short term, I would like to handle ms-office and open-office formats with this parser. In the long term, however, Apache-Tika might replace even the PDF/OCR parser since Tika also comes with support for
tesseract
. I haven't decided on the latter yet and the change will definitely not be part of this PR.Open Questions
I had to make two changes in order to get the original
master
working. I wonder, why nobody else has problems with the docker setup of the current master.1.
CORS_ORIGIN_WHITELIST
issueCORS_ORIGIN_WHITELIST
:Otherwise, I get the following error:
2. Missing Fonts Issue
msttcorefonts-installer
andfontconfig
to theDockerfile
. Otherwise, I get the following error:master
? If not, I'm fine to drop the change as it seems that it is a particular problem with my local machine setup.RUN
command.Open Tasks
TikaDocumentParser