Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload to the Internet Archive #63

Open
3 of 5 tasks
konklone opened this issue Jul 3, 2014 · 10 comments
Open
3 of 5 tasks

Upload to the Internet Archive #63

konklone opened this issue Jul 3, 2014 · 10 comments

Comments

@konklone
Copy link
Member

konklone commented Jul 3, 2014

Using their S3-compatible API:
http://archive.org/help/abouts3.txt

I have an Archive account, under [email protected], and I generated my S3(-like) credentials. I'm not actually sure whether the code to do this upload belongs in this repository -- it could just as easily be a script in a public repo on my own account that runs as a cron on the same box -- but I'm including it here to solicit discussion, and to publicize that I want to get this stuff into the Archive.

I'll also be contacting the Archive directly to see if they have any above-and-beyond interest in this collection.

/cc @waldoj @spulec

Resources:

Todos:

  • Make a bucket, unitedstates-data
  • Add a _meta.xml to the bucket (done automatically, actually)
  • Write to IA about their S3 support (mail [email protected], with s3help in the subject)
  • Write a backup script that can reliably back up all the reports to the IA.
  • Upload the full archive, after re-running all scrapers with --archive.
@konklone
Copy link
Member Author

konklone commented Sep 1, 2014

I've gotten a unitedstates-data bucket going, which made a predictable URL, and auto-created a bunch of metadata files:

https://archive.org/download/unitedstates-data/

To test it out, I uploaded the big VA report from earlier this year.

$ s3cmd put data/va/2014/14-02603-267/* s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
14-02603-267/report.json -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.json  [1 of 3]
 6800 of 6800   100% in    2s     2.69 kB/s  done
14-02603-267/report.pdf -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.pdf  [2 of 3]
 1574313 of 1574313   100% in    3s   470.92 kB/s  done
14-02603-267/report.txt -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.txt  [3 of 3]
 474149 of 474149   100% in    2s   203.02 kB/s  done

Which made this:

https://archive.org/download/unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

Interestingly, 10 minutes after upload, the Internet Archive auto-produced a report_jp2.zip (~70MB) that contains JPG-like images of each of the pages of the original PDF.

I've sent an email to the Archive asking for guidance or documentation on how we can best structure the collection. In the meantime, I may just upload everything now, once, and worry about creating a sophisticated script for managing cost-effective sync and re-uploading of metadata later.

@konklone
Copy link
Member Author

konklone commented Sep 2, 2014

Also, the Internet Archive has absolutely insane public logging for all this.

@konklone
Copy link
Member Author

konklone commented Sep 2, 2014

Some more pages relevant to our collection:

manager

editor

okay

The Internet Archive is extremely cool.

@waldoj
Copy link
Member

waldoj commented Sep 2, 2014

I'm glad that storing this stuff on the Archive is going so well. It's really the perfect home for this stuff.

Well, I mean, a .gov site is the perfect home for this. But, short of that, the Archive is the best home for this.

@konklone
Copy link
Member Author

konklone commented Sep 4, 2014

The plan now, after talking with IA, is to store each report as an "item" in the collection, rather than putting them in one bucket. An "item" (bucket) is supposed to have the same metadata for everything. Currently, the unitedstates-data bucket is considered one "item":

https://archive.org/details/unitedstates-data

The Archive is willing to make a Collection for the items, but needs at least 50 "items" uploaded.

So each item would be its own bucket, with an ID something like unitedstates-inspector-general-EPA-2004-[report-id]? And then I would ask those to be categorized. The IDs seem unwieldy, but I think that's the only way, at least to start.

FWIW, not having resolved @divergentdave's work on finding duplicate IDs across years wouldn't come into play here, if I put the year in the ID. The year is a more brittle piece of data than I'd prefer to put in the ID of the item, though.

@divergentdave
Copy link
Contributor

It seems like Internet Archive item identifiers can't easily be changed once uploaded. (or deleted, naturally) If we use our report_id in the item identifier, we'll need to make sure we've fixed all our outstanding QA issues before we start uploading. (particularly same-year duplicate IDs and bad 404 pages, maybe duplicate files)

It might also be a good idea to manually review new reports going forward before sending them to IA, in case one of the scrapers starts emitting spurious reports.

@konklone
Copy link
Member Author

konklone commented Oct 3, 2014

I agree that we should get our ID QA in order before submitting everything...and you've basically done that, which is outstanding. I do need to regenerate my archive.

But, I think the downside of uploading duplicate or wrongly ID'd content to the Archive is low. It'll happen, we'll make a good faith effort to keep it in order (and automating the running of the qa script will make that possible), but it's ultimately just not a big deal to have dupe reports under different IDs. Everything else is overwritable, I think.

@konklone
Copy link
Member Author

For anyone watching this thread, I'm doing some work that will build into a general-purpose Internet Archive uploader, at https://github.com/konklone/bit-voyage.

@konklone
Copy link
Member Author

So Harvard's Perma.cc automatically uploads to the Internet Archive, using ia-wrapper.

def upload_to_internet_archive(self, link_guid):
    # setup
    asset = Asset.objects.get(link_id=link_guid)
    link = asset.link
    identifier = settings.INTERNET_ARCHIVE_IDENTIFIER_PREFIX+link_guid
    warc_path = os.path.join(asset.base_storage_path, asset.warc_capture)

    # create IA item for this capture
    item = internetarchive.get_item(identifier)
    metadata = {
        'collection':settings.INTERNET_ARCHIVE_COLLECTION,
        'mediatype':'web',
        'date':link.creation_timestamp,
        'title':'Perma Capture %s' % link_guid,
        'creator':'Perma.cc',

        # custom metadata
        'submitted_url':link.submitted_url,
        'perma_url':"http://%s/%s" % (settings.HOST, link_guid)
    }

    # upload
    with default_storage.open(warc_path, 'rb') as warc_file:
        success = item.upload(warc_file,
                              metadata=metadata,
                              access_key=settings.INTERNET_ARCHIVE_ACCESS_KEY,
                              secret_key=settings.INTERNET_ARCHIVE_SECRET_KEY,
                              verbose=True,
                              debug=True)
    if success:
        print "Succeeded."
    else:
        print "Failed."
        self.retry(exc=Exception("Internet Archive reported upload failure."))

@konklone
Copy link
Member Author

konklone commented Nov 2, 2014

A zip viewer for the contents of the bulk data zip I just uploaded: https://ia902205.us.archive.org/zipview.php?zip=/25/items/us-inspectors-general.bulk/us-inspectors-general.bulk.zip

Intended landing page for the bulk data file: https://archive.org/details/us-inspectors-general.bulk

There's no automatic download link for an entire collection, so I'll plan to upload every item in the collection individually, and then upload a bulk file separately.

I have an individual report uploaded and successfully rendering in the Archive's book viewer here: https://archive.org/details/us-inspectors-general.treasury-2014-OIG-14-023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants