Docsplit.extract_text generates a String with a null byte #152

cedricpim · 2019-07-09T15:40:27Z

Hello,

First of all, thank you for the gem.

Second, I currently have a pdf that, when put through Docsplit.extract_text, it creates a file with a null byte character. Shouldn't this be handled by TextCleaner#clean? Or do you think that the issue is within pdftotext/tesseract?

Unfortunately, the pdf that I am using is from a client and I can't provide it. I also haven't been able to manually create one that causes this.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docsplit.extract_text generates a String with a null byte #152

Docsplit.extract_text generates a String with a null byte #152

cedricpim commented Jul 9, 2019

Docsplit.extract_text generates a String with a null byte #152

Docsplit.extract_text generates a String with a null byte #152

Comments

cedricpim commented Jul 9, 2019