Indexing: Index PDF files #70

m-i-l · 2022-09-10T09:51:50Z

There are a couple of sites that people have mentioned which consist primarily of PDF files:

https://icann-hamster.nl/
http://radar.oreilly.com/r2/release1-0 (main index appears to be gone now, but individual PDF files still seem to be available e.g. via http://cdn.oreillystatic.com/radar/r1/01-83.pdf)

Unfortunately searchmysite.net doesn't currently index PDF files. In fact, parse_item in search_my_site_script.py does a isinstance(response, TextResponse) to ensure it only indexes text content to exclude all binary files (i.e. exclude images etc. not just PDF files). This would have to be updated, and of course there would need to be a way to extract text from PDF files for indexing.

Might want to have a look at the Python bindings in https://cwiki.apache.org/confluence/display/TIKA/API+Bindings+for+Tika or something like that.

m-i-l added the enhancement New feature or request label Sep 10, 2022

This was referenced May 7, 2024

Don't return CSS files as results #147

Closed

Indexing: Restrict indexing to a known list of content types #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing: Index PDF files #70

Indexing: Index PDF files #70

m-i-l commented Sep 10, 2022

Indexing: Index PDF files #70

Indexing: Index PDF files #70

Comments

m-i-l commented Sep 10, 2022