Upload to the Internet Archive #63

konklone · 2014-07-03T18:21:44Z

Using their S3-compatible API:
http://archive.org/help/abouts3.txt

I have an Archive account, under [email protected], and I generated my S3(-like) credentials. I'm not actually sure whether the code to do this upload belongs in this repository -- it could just as easily be a script in a public repo on my own account that runs as a cron on the same box -- but I'm including it here to solicit discussion, and to publicize that I want to get this stuff into the Archive.

I'll also be contacting the Archive directly to see if they have any above-and-beyond interest in this collection.

/cc @waldoj @spulec

Resources:

Todos:

Make a bucket, unitedstates-data
Add a _meta.xml to the bucket (done automatically, actually)
Write to IA about their S3 support (mail [email protected], with s3help in the subject)
Write a backup script that can reliably back up all the reports to the IA.
Upload the full archive, after re-running all scrapers with --archive.

The text was updated successfully, but these errors were encountered:

konklone · 2014-09-01T23:56:06Z

I've gotten a unitedstates-data bucket going, which made a predictable URL, and auto-created a bunch of metadata files:

https://archive.org/download/unitedstates-data/

To test it out, I uploaded the big VA report from earlier this year.

$ s3cmd put data/va/2014/14-02603-267/* s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
14-02603-267/report.json -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.json  [1 of 3]
 6800 of 6800   100% in    2s     2.69 kB/s  done
14-02603-267/report.pdf -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.pdf  [2 of 3]
 1574313 of 1574313   100% in    3s   470.92 kB/s  done
14-02603-267/report.txt -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.txt  [3 of 3]
 474149 of 474149   100% in    2s   203.02 kB/s  done

Which made this:

https://archive.org/download/unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

Interestingly, 10 minutes after upload, the Internet Archive auto-produced a report_jp2.zip (~70MB) that contains JPG-like images of each of the pages of the original PDF.

I've sent an email to the Archive asking for guidance or documentation on how we can best structure the collection. In the meantime, I may just upload everything now, once, and worry about creating a sophisticated script for managing cost-effective sync and re-uploading of metadata later.

konklone · 2014-09-02T00:04:49Z

Also, the Internet Archive has absolutely insane public logging for all this.

Catalog entry and history for the unitedstates-data bucket
Full verbose record of how IA planned and derived more files from our uploads

konklone · 2014-09-02T00:19:58Z

Some more pages relevant to our collection:

Catalog entry landing page, we are part of "Community Texts" for now
Catalog entry manager - requires login, screengrab below

Manager leads to things like editor for collection metadata, screengrab below

Oh and wow an interactive editor, screengrab below

The Internet Archive is extremely cool.

waldoj · 2014-09-02T00:39:32Z

I'm glad that storing this stuff on the Archive is going so well. It's really the perfect home for this stuff.

Well, I mean, a .gov site is the perfect home for this. But, short of that, the Archive is the best home for this.

konklone · 2014-09-04T04:26:46Z

The plan now, after talking with IA, is to store each report as an "item" in the collection, rather than putting them in one bucket. An "item" (bucket) is supposed to have the same metadata for everything. Currently, the unitedstates-data bucket is considered one "item":

https://archive.org/details/unitedstates-data

The Archive is willing to make a Collection for the items, but needs at least 50 "items" uploaded.

So each item would be its own bucket, with an ID something like unitedstates-inspector-general-EPA-2004-[report-id]? And then I would ask those to be categorized. The IDs seem unwieldy, but I think that's the only way, at least to start.

FWIW, not having resolved @divergentdave's work on finding duplicate IDs across years wouldn't come into play here, if I put the year in the ID. The year is a more brittle piece of data than I'd prefer to put in the ID of the item, though.

divergentdave · 2014-10-02T23:57:04Z

It seems like Internet Archive item identifiers can't easily be changed once uploaded. (or deleted, naturally) If we use our report_id in the item identifier, we'll need to make sure we've fixed all our outstanding QA issues before we start uploading. (particularly same-year duplicate IDs and bad 404 pages, maybe duplicate files)

It might also be a good idea to manually review new reports going forward before sending them to IA, in case one of the scrapers starts emitting spurious reports.

konklone · 2014-10-03T00:07:42Z

I agree that we should get our ID QA in order before submitting everything...and you've basically done that, which is outstanding. I do need to regenerate my archive.

But, I think the downside of uploading duplicate or wrongly ID'd content to the Archive is low. It'll happen, we'll make a good faith effort to keep it in order (and automating the running of the qa script will make that possible), but it's ultimately just not a big deal to have dupe reports under different IDs. Everything else is overwritable, I think.

konklone · 2014-10-12T06:33:40Z

For anyone watching this thread, I'm doing some work that will build into a general-purpose Internet Archive uploader, at https://github.com/konklone/bit-voyage.

konklone · 2014-10-24T22:45:05Z

So Harvard's Perma.cc automatically uploads to the Internet Archive, using ia-wrapper.

def upload_to_internet_archive(self, link_guid):
    # setup
    asset = Asset.objects.get(link_id=link_guid)
    link = asset.link
    identifier = settings.INTERNET_ARCHIVE_IDENTIFIER_PREFIX+link_guid
    warc_path = os.path.join(asset.base_storage_path, asset.warc_capture)

    # create IA item for this capture
    item = internetarchive.get_item(identifier)
    metadata = {
        'collection':settings.INTERNET_ARCHIVE_COLLECTION,
        'mediatype':'web',
        'date':link.creation_timestamp,
        'title':'Perma Capture %s' % link_guid,
        'creator':'Perma.cc',

        # custom metadata
        'submitted_url':link.submitted_url,
        'perma_url':"http://%s/%s" % (settings.HOST, link_guid)
    }

    # upload
    with default_storage.open(warc_path, 'rb') as warc_file:
        success = item.upload(warc_file,
                              metadata=metadata,
                              access_key=settings.INTERNET_ARCHIVE_ACCESS_KEY,
                              secret_key=settings.INTERNET_ARCHIVE_SECRET_KEY,
                              verbose=True,
                              debug=True)
    if success:
        print "Succeeded."
    else:
        print "Failed."
        self.retry(exc=Exception("Internet Archive reported upload failure."))

konklone · 2014-11-02T06:59:05Z

A zip viewer for the contents of the bulk data zip I just uploaded: https://ia902205.us.archive.org/zipview.php?zip=/25/items/us-inspectors-general.bulk/us-inspectors-general.bulk.zip

Intended landing page for the bulk data file: https://archive.org/details/us-inspectors-general.bulk

There's no automatic download link for an entire collection, so I'll plan to upload every item in the collection individually, and then upload a bulk file separately.

I have an individual report uploaded and successfully rendering in the Archive's book viewer here: https://archive.org/details/us-inspectors-general.treasury-2014-OIG-14-023

konklone mentioned this issue Aug 14, 2014

Add docs around getting started. konklone/oversight.garden#13

Closed

This was referenced Nov 2, 2014

Uploader is so close to working under Python 3 jjjake/internetarchive#76

Closed

We need inspector names after all #178

Open

konklone mentioned this issue Nov 24, 2014

Backup process to the Internet Archive #184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload to the Internet Archive #63

Upload to the Internet Archive #63

konklone commented Jul 3, 2014

konklone commented Sep 1, 2014

konklone commented Sep 2, 2014

konklone commented Sep 2, 2014

waldoj commented Sep 2, 2014

konklone commented Sep 4, 2014

divergentdave commented Oct 2, 2014

konklone commented Oct 3, 2014

konklone commented Oct 12, 2014

konklone commented Oct 24, 2014

konklone commented Nov 2, 2014

Upload to the Internet Archive #63

Upload to the Internet Archive #63

Comments

konklone commented Jul 3, 2014

konklone commented Sep 1, 2014

konklone commented Sep 2, 2014

konklone commented Sep 2, 2014

waldoj commented Sep 2, 2014

konklone commented Sep 4, 2014

divergentdave commented Oct 2, 2014

konklone commented Oct 3, 2014

konklone commented Oct 12, 2014

konklone commented Oct 24, 2014

konklone commented Nov 2, 2014