Validation should fail on unrecognized file type #109

konklone · 2014-08-04T03:46:31Z

Right now, validation will fail if the file_type wasn't detected (the URL has no file extension) but will not fail if the detected file_type is unknown.

Since we only have text processors for HTML and PDF files, the file_type should be either auto-detected, or set by a scraper, to html or pdf. If it's not, it should choke and force the scraper to pick one -- and if we come across a report format that isn't HTML or PDF, then it's time to extend the system to process text from that format.

The text was updated successfully, but these errors were encountered:

konklone · 2014-08-04T03:52:19Z

This can build on @divergentdave's work in 1fa8f5d, but that only patches the problem -- the file_type field should be html, for a report whose URL ends in .aspx, and the saved file should be report.html.

audiodude · 2014-08-04T04:53:15Z

👍 I agree this is the more correct way to do it.

konklone mentioned this issue Aug 4, 2014

Add EEOC #107

Merged

konklone added the Data Improvement label Aug 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation should fail on unrecognized file type #109

Validation should fail on unrecognized file type #109

konklone commented Aug 4, 2014

konklone commented Aug 4, 2014

audiodude commented Aug 4, 2014

Validation should fail on unrecognized file type #109

Validation should fail on unrecognized file type #109

Comments

konklone commented Aug 4, 2014

konklone commented Aug 4, 2014

audiodude commented Aug 4, 2014