Harvest PDFs as well #23

jqnatividad · 2015-02-18T23:53:05Z

PDF is the worst kind of open data - 1-star open data.
Still, have the ability to optionally harvest PDFs.

waldoj · 2015-02-20T03:31:48Z

That's not something that we'll be adding to our deployed version, but I can see how that would be useful for folks. That can be done trivially, by modifying constants.py to add pdf to the list of filenames.

@knowtheory, is there any reason why this list of suffixes couldn't be moved to the config file? Would just up and adding pdf make problems, such as with the code that tries to get the title of the file?

knowtheory · 2015-02-22T06:25:43Z

Nope, listing PDFs should function fine. The reason we'd left them out of the list by default is just that there are so many possible non-data pdfs up on sites (reports, forms, all manner of things) that it'd be a pretty noisy signal.

And as for the list of suffixes... you mean moving them into the settings.py? Could move them in there too yep.

waldoj · 2015-02-22T18:09:26Z

And as for the list of suffixes... you mean moving them into the settings.py?

Yeah, it seems like that'd be a good way to let people control the kinds of files that they want to look for.

masinter · 2015-03-03T00:10:00Z

I'd like to set up some way that PDF metadata (using XMP) could catalog embedded or linked data, with the possibility of using annotations, bookmarks, form-data, and attached files. With an explicit manifest, you'll get less "noisy signal".

waldoj added the enhancement label Feb 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest PDFs as well #23

Harvest PDFs as well #23

jqnatividad commented Feb 18, 2015

waldoj commented Feb 20, 2015

knowtheory commented Feb 22, 2015

waldoj commented Feb 22, 2015

masinter commented Mar 3, 2015

Harvest PDFs as well #23

Harvest PDFs as well #23

Comments

jqnatividad commented Feb 18, 2015

waldoj commented Feb 20, 2015

knowtheory commented Feb 22, 2015

waldoj commented Feb 22, 2015

masinter commented Mar 3, 2015