This client provides an easy way to interact with AIS cluster to create TensorFlow datasets.
$ ./setup.sh
$ source venv/bin/activate
$ ais create bucket $BUCKET
...
Put small tars from gsutil ls gs://lpr-gtc2020 into $BUCKET
and adjust imagenet.py with your $BUCKET and objects template
...
$ python examples/imagenet_in_memory.py
def Dataset(bucket_name, proxy_url, conversions, selections, remote_exec)
Create Dataset object
bucket_name
- string
- name of an AIS bucket
proxy_url
- string
- url of AIS cluster proxy
conversions
- (optional) list of Conversions from tar2tf.ops
. Describes transformations made on tar-record. See tar2tf.ops section for more.
selections
- (optional) list of length 2 of Selections from tar2tf.ops
. Describes how to transform tar-record entry into datapoint. See tar2tf.ops section for more.
remote_exec
- (optional) - bool
- specify is conversions and selections should be executed in the cluster.
If remote_exec == True
, but remote execution of one of conversions is not supported, remote_exec
becomes disabled.
If remote_exec
not provided, it will be automatically detected if remote execution is possible.
def load(template, **kwargs)
Transform tars of images from AIS into TensorFlow compatible format.
template
- string
- object names of tars. Bash range syntax like {0..10}
is supported.
output_shapes
- list of tf.TensorShape
- resulting objects' shapes
output_types
- list of tf.DType
- resulting objects' types
num_workers
- number
- number of workers concurrently downloading objects from AIS cluster
path
- string
or string generator
- destination where TFRecord file or multiple files should be saved to.
If path
provided, remote execution is not enabled.
Accepted: string, string with "{}" format template or generator.
If max_shard_size
is specified multiple files destinations might be needed.
If path
is string default path indexing will be applied.
If path
is string with "{}" consecutive numbers starting with 1 will be put into path
.
If path
is generator consecutive yielded values will be used.
Generated TFRecord files paths are returned from load
.
If empty or None, all operations are made in memory or executed remotely and tf.data.Dataset
is returned.
record_to_example
(optional) - function
- should specify how to translate tar record.
Argument of this function is representation of single tar record: python dict
.
Tar record is an abstraction for multiple files with exactly the same path, but different extension.
The argument of function will have __key__
entry which value is path to record without an extension.
For each extension e
, dict with have an entry e
with value the same as contents of relevant file.
If default record_to_example
was used, default_record_parser
function should be used to
parse TFRecord
to tf.Dataset
interface.
ops
module is used to describe tar-record to datapoint transformation.
Conversions are transformations applied to each tar record.
tar2tf.ops.Convert(ext_name, dst_type)
Converts inner type of ext_name
entry image into dst_type
.
Remote execution supported.
tar2tf.ops.Decode(ext_name)
Decodes image from format BMP, JPEG, or PNG. Fails for other formats.
Remote execution supported.
tar2tf.ops.Resize(ext_name, dst_size)
Resizes ext_name
image into new size dst_size
.
Remote execution supported.
tar2tf.ops.Rotate(ext_name, [angle])
Rotates ext_name
image angle
degrees clockwise. If angle == 0
or not provided, random rotation is applied.
Remote execution supported.
tar2tf.ops.Func(f)
The most versatile operations from tar2tf.ops. Takes function f
and calls it with tar_record
.
Selections select entries from tar record to be either values or labels in dataset.
tar2tf.ops.Select(ext_name)
The simplest of tar2tf.ops. Returns value from tar record under ext_name
key.
tar2tf.ops.SelectJSON(ext_name, nested_path)
Similar to Select
, but is able to extract deeply nested value from JSON format.
nested_path
can be either string/int (for first level values) or list of string/int (for deeply nested).
Reads value under ext_name
, treats it as a JSON, and returns value under nested_path
.
tar2tf.ops.SelectList(list of Selection)
Returns an object which is a list of provided Selections
tar2tf.ops.SelectDict(dict of Selection)
Returns an object which is a dict of provided Selections
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
remote_exec=True,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
num_workers=4,
remote_exec=False,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load(
"train-{4..7}.tar.gz",
num_workers=4,
remote_exec=False,
).batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
dataset = Dataset(BUCKET_NAME, PROXY_URL)
records = dataset.load(
"train-{0..3}.tar.gz",
path="train.record",
)
train_dataset = tf.data.TFRecordDataset(filenames=records)
.map(default_record_parser)
.shuffle(buffer_size=1024)
.batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
Create TensorFlow dataset with intermediate storing TFRecord
in filesystem with limited TFRecord size.
dataset = Dataset(BUCKET_NAME, PROXY_URL)
filenames = dataset.load(
"train-{0..3}.tar.gz",
path="train-{}.record",
max_shard_size="100MB",
)
train_dataset = tf.data.TFRecordDataset(filenames=filenames)
.map(default_record_parser)
.shuffle(buffer_size=1024)
.batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
# Create in-memory TensorFlow dataset
# decoded and resized "jpg", applies function f
# datapoint value from "jpg", label from "cls"
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32)), Func(f)], ["jpg", "cls"])
train_dataset = dataset.load("train-{0..3}.tar.gz"
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)
# Create in-memory TensorFlow dataset
dataset = Dataset(
BUCKET_NAME,
PROXY_URL,
[Decode("jpg"), Resize("jpg", (32,32)), Func(f)],
["jpg", "cls"]
)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)