Load all files into Overview #9

jpmckinney · 2015-11-23T06:58:21Z

For ca_bc and ca_nl.

Explore OCR options

Determine the price of ABBYY FineReader
- $6,400 for the development kit, and then an additional fee for the runtime license, which varies at least according to whether you’re doing full-page or zonal OCR

Google put time into Tesseract. There is also OCRopus, Cuneiform, Ocrad, GOCR.

Create the CSV

Get pdfium to install using Homebrew
Use the pdfshaver branch of docsplit
- Actually, the branch still uses GraphicsMagick to generate the images for Tesseract PDFtk dependency issues with CentOS-7/RHEL-7 | Build Fails | Dependencies libgc Unavailable documentcloud/docsplit#123 (comment)
Generate a CSV with docsplit (using docs2csv as reference)

Check whether the CSV is under 2GB and whether each document is under 640,000 characters. Truncating each document to 640,000 characters should keep the CSV under 2GB.

Upload to Overview

Upload the CSV to Overview [1] [2]
- ca_bc: Uploaded 2011 to test

jpmckinney · 2015-12-28T19:48:55Z

40GB seems like a lot to upload from local computer. Upload to S3 first #6.

jpmckinney · 2016-06-06T18:04:33Z

Using OCR'd text generated by @StormTide.

jpmckinney added code data exploration icebox and removed code labels Nov 23, 2015

jpmckinney changed the title ~~Load all BC files into Overview~~ Load all files into Overview Nov 26, 2015

jpmckinney added icebox and removed icebox labels Dec 18, 2015

jpmckinney modified the milestones: 3, 2 Dec 21, 2015

jpmckinney removed the icebox label Dec 28, 2015

jpmckinney mentioned this issue Dec 29, 2015

Get access to DocumentCloud #16

Closed

jpmckinney removed the ca_bc label Jan 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load all files into Overview #9

Load all files into Overview #9

jpmckinney commented Nov 23, 2015 •

edited

Loading

jpmckinney commented Dec 28, 2015

jpmckinney commented Jun 6, 2016

Load all files into Overview #9

Load all files into Overview #9

Comments

jpmckinney commented Nov 23, 2015 • edited Loading

Explore OCR options

Create the CSV

Upload to Overview

jpmckinney commented Dec 28, 2015

jpmckinney commented Jun 6, 2016

jpmckinney commented Nov 23, 2015 •

edited

Loading