Skip to content
This repository has been archived by the owner on Jan 30, 2019. It is now read-only.

Load all files into Overview #9

Open
4 of 5 tasks
jpmckinney opened this issue Nov 23, 2015 · 2 comments
Open
4 of 5 tasks

Load all files into Overview #9

jpmckinney opened this issue Nov 23, 2015 · 2 comments
Milestone

Comments

@jpmckinney
Copy link
Owner

jpmckinney commented Nov 23, 2015

For ca_bc and ca_nl.

Explore OCR options

  • Determine the price of ABBYY FineReader
    • $6,400 for the development kit, and then an additional fee for the runtime license, which varies at least according to whether you’re doing full-page or zonal OCR

Google put time into Tesseract. There is also OCRopus, Cuneiform, Ocrad, GOCR.

Create the CSV

Check whether the CSV is under 2GB and whether each document is under 640,000 characters. Truncating each document to 640,000 characters should keep the CSV under 2GB.

Upload to Overview

  • Upload the CSV to Overview [1] [2]
    • ca_bc: Uploaded 2011 to test
@jpmckinney jpmckinney changed the title Load all BC files into Overview Load all files into Overview Nov 26, 2015
@jpmckinney jpmckinney added icebox and removed icebox labels Dec 18, 2015
@jpmckinney jpmckinney modified the milestones: 3, 2 Dec 21, 2015
@jpmckinney jpmckinney removed the icebox label Dec 28, 2015
@jpmckinney
Copy link
Owner Author

40GB seems like a lot to upload from local computer. Upload to S3 first #6.

@jpmckinney
Copy link
Owner Author

Using OCR'd text generated by @StormTide.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant