getm: Fast reads with integrity for data URLs

getm provides fast binary reads for HTTP URLs using multiprocessing and shared memory.

Data is downloaded in background processes and made availabe as references to shared memory. There are no buffer copies, but memory references must be released by the caller, which makes working with getm a bit different than typical Python IO streams. But still easy, and fast. In the case of part iteration, memoryview objects are released for you.

Python API methods accept a parameter, concurrency, which controls the mode of operation of mget:

Default concurrency == 1: Download data in a single background process, using a single HTTP request that is kept alive during the course of the download.
concurrency > 1: Up to concurrency HTTP range requests will be made concurrently, each in a separate background process.
concurrency == None: Data is read on the main process. In this mode, getm is a wrapper for requests.

Python API

import getm

# Readable stream:
with getm.urlopen(url) as fh:
    data = fh.read(size)
	data.release()

# Process data in parts:
for part in getm.iter_content(url, chunk_size=1024 * 1024):
    my_chunk_processor(part)
	# Note that 'part.release()' is not needed in an iterator context

CLI

getm https://my-cool-url my-local-file

Testing

During tests, signed URLs are generated that point to data in S3 and GS buckets. Data is repopulated during each test. You must have credentials available to read and write to the test buckets, and to generate signed URLs.

Set the following environment variables to the GS and S3 test bucket names, respectively:

GETM_GS_TEST_BUCKET
GETM_S3_TEST_BUCKET

GCP Credentials

Generating signed URLs during tests requires service account credentials, which are made available to the test suite by setting the environment variable

export GETM_GOOGLE_APPLICATION_CREDENTIALS=my-creds.json

AWS Credentials

Follow these instructions for configuring the AWS CLI.

Installation

pip install getm

Shared Memory Size Tests

Before release, tests should be performed on systems with various amounts of shared memory. Good choices are 64M and 8G. It is also highly encouraged for development work on the shared memory algorithms and configurations of getm.

Shared memory can be resized on Ubuntu systems, and likely other Linux systems, with the bundled convenience script dev_scripts/resize_shm.sh. Either sudo or root access is required.:

sudo dev_scripts/resize_shm.sh 64M
sudo dev_scripts/resize_shm.sh 8G

sharedmemory backport to Python 3.7

getm relies on the sharedmemory module, which was introduced in Python 3.8. Since a large portion of getm's audience relies on Python 3.7, a C extension backport of sharedmemory is included.

The backport adds significant complexity getm's code base, requiring C/C++ knowlege to maintain, as well as knowledge of CPython. It will be removed when enough getm users have migrated to Python 3.8 or greater.

Links

Project home page GitHub
Package distribution PyPI

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Credits

getm was created by Brian Hannafious at the UCSC Genomics Institute.

Special thanks to Michael Baumann and Lon Blauvelt for critical input and testing.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
dev_scripts		dev_scripts
getm		getm
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.travis.yml		.travis.yml
Changes.md		Changes.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
common.mk		common.mk
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

getm: Fast reads with integrity for data URLs

Testing

GCP Credentials

AWS Credentials

Installation

Shared Memory Size Tests

sharedmemory backport to Python 3.7

Links

Bugs

Credits

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

DataBiosphere/getm

Folders and files

Latest commit

History

Repository files navigation

getm: Fast reads with integrity for data URLs

Testing

GCP Credentials

AWS Credentials

Installation

Shared Memory Size Tests

sharedmemory backport to Python 3.7

Links

Bugs

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages