You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately searchmysite.net doesn't currently index PDF files. In fact, parse_item in search_my_site_script.py does a isinstance(response, TextResponse) to ensure it only indexes text content to exclude all binary files (i.e. exclude images etc. not just PDF files). This would have to be updated, and of course there would need to be a way to extract text from PDF files for indexing.
There are a couple of sites that people have mentioned which consist primarily of PDF files:
Unfortunately searchmysite.net doesn't currently index PDF files. In fact, parse_item in search_my_site_script.py does a isinstance(response, TextResponse) to ensure it only indexes text content to exclude all binary files (i.e. exclude images etc. not just PDF files). This would have to be updated, and of course there would need to be a way to extract text from PDF files for indexing.
Might want to have a look at the Python bindings in https://cwiki.apache.org/confluence/display/TIKA/API+Bindings+for+Tika or something like that.
The text was updated successfully, but these errors were encountered: