Skip to content
This repository has been archived by the owner on May 5, 2020. It is now read-only.

Harvest PDFs as well #23

Open
jqnatividad opened this issue Feb 18, 2015 · 4 comments
Open

Harvest PDFs as well #23

jqnatividad opened this issue Feb 18, 2015 · 4 comments

Comments

@jqnatividad
Copy link

PDF is the worst kind of open data - 1-star open data.
Still, have the ability to optionally harvest PDFs.

@waldoj
Copy link
Member

waldoj commented Feb 20, 2015

That's not something that we'll be adding to our deployed version, but I can see how that would be useful for folks. That can be done trivially, by modifying constants.py to add pdf to the list of filenames.

@knowtheory, is there any reason why this list of suffixes couldn't be moved to the config file? Would just up and adding pdf make problems, such as with the code that tries to get the title of the file?

@knowtheory
Copy link
Contributor

Nope, listing PDFs should function fine. The reason we'd left them out of the list by default is just that there are so many possible non-data pdfs up on sites (reports, forms, all manner of things) that it'd be a pretty noisy signal.

And as for the list of suffixes... you mean moving them into the settings.py? Could move them in there too yep.

@waldoj
Copy link
Member

waldoj commented Feb 22, 2015

And as for the list of suffixes... you mean moving them into the settings.py?

Yeah, it seems like that'd be a good way to let people control the kinds of files that they want to look for.

@masinter
Copy link

masinter commented Mar 3, 2015

I'd like to set up some way that PDF metadata (using XMP) could catalog embedded or linked data, with the possibility of using annotations, bookmarks, form-data, and attached files. With an explicit manifest, you'll get less "noisy signal".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants