Skip to content

Commit 022b836

Browse files
committed
PR tensorflow#2043: added tf.SequenceExample to ExampleGen
Please approve this CL. It will be submitted automatically, and its GitHub pull request will be marked as merged. Imported from GitHub PR tensorflow#2043 # Add tf.Sequence Example to ExampleGen for base_example_gen and import_example_gen. Copybara import of the project: - 6177aad updated example gen proto by tinally <[email protected]> - ff07909 align with tfx repo by tinally <[email protected]> - 4b039b2 updated example gen proto by tinally <[email protected]> - bfe70bd align with tfx repo by tinally <[email protected]> - 610e95c align with tfx by tinally <[email protected]> - 992320c added tf.SequenceExample to ExampleGen proto and partition by tinally <[email protected]> - e734273 passed unit tests by tinally <[email protected]> - 5850987 fixed lint issues by tinally <[email protected]> - 95b25c0 fixed reserved number in proto by tinally <[email protected]> - 7227b6c updated import example gen by tinally <[email protected]> - a57c728 added unit test for base_example_gen with sequence example by tinally <[email protected]> - aef1eb7 clean up code by tinally <[email protected]> - 7d15cdf small fix by tinally <[email protected]> - af7d8f4 created a dataset for tf.SequenceExample by tinally <[email protected]> - 2b1310a added full doc comment by tinally <[email protected]> - 4535c81 tested import_example_gen with testdata by tinally <[email protected]> - 6dc4e62 added testImportSequenceExample to unit test by tinally <[email protected]> - bc6f171 added feature_list to sequence example testdata by tinally <[email protected]> - ab6d089 updated README by tinally <[email protected]> - a2afd8c updated format of testdata by tinally <[email protected]> - bead04a fixed most recently addressed issues by tinally <[email protected]> - 2a0c4f2 clean up duplicate code by tinally <[email protected]> - 384f7f3 more clean up by tinally <[email protected]> - 7e5c8b4 updated readme format by tinally <[email protected]> - 1e2aeb9 fixed some comments by tinally <[email protected]> - 61ac9dc Merge branch 'master' of https://github.com/tinally/tfx by tinally <[email protected]> - 4a4d204 added test for feature based partition with sequence exam... by tinally <[email protected]> - e74e417 refactored tests for feature based partition by tinally <[email protected]> - 2b148ab fixed comments by tinally <[email protected]> - 3c3bdc7 small fix by tinally <[email protected]> - 613e59a Merge 3c3bdc7 into 9cd81... by tinally <[email protected]> COPYBARA_INTEGRATE_REVIEW=tensorflow#2043 from tinally:master 3c3bdc7 PiperOrigin-RevId: 319263311
1 parent bb3f0b4 commit 022b836

File tree

8 files changed

+220
-211
lines changed

8 files changed

+220
-211
lines changed

RELEASE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@
88
Deprecated ExampleGen input (external) artifact.
99
* Added ModelRun artifact for Trainer for storing training related files,
1010
e.g., Tensorboard logs.
11+
* Added support for `tf.train.SequenceExample` in ExampleGen:
12+
* ImportExampleGen now supports `tf.train.SequenceExample` importing.
13+
* base_example_gen_executor now supports `tf.train.SequenceExample` as
14+
output payload format, which can be utilized by custom ExampleGen.
1115

1216
## Bug fixes and other changes
1317
* Added Tuner component, which is still work in progress.

tfx/components/example_gen/base_example_gen_executor.py

Lines changed: 53 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -40,18 +40,30 @@
4040
DEFAULT_FILE_NAME = 'data_tfrecord'
4141

4242

43-
def _ExamplePartitionKey(record: tf.train.Example,
44-
split_config: example_gen_pb2.SplitConfig) -> bytes:
45-
"""Generates key for partition for tf.train.Example."""
43+
def _GeneratePartitionKey(record: Union[tf.train.Example,
44+
tf.train.SequenceExample, bytes],
45+
split_config: example_gen_pb2.SplitConfig) -> bytes:
46+
"""Generates key for partition."""
4647

4748
if not split_config.HasField('partition_feature_name'):
49+
if isinstance(record, bytes):
50+
return record
4851
return record.SerializeToString(deterministic=True)
4952

53+
if isinstance(record, tf.train.Example):
54+
features = record.features.feature # pytype: disable=attribute-error
55+
elif isinstance(record, tf.train.SequenceExample):
56+
features = record.context.feature # pytype: disable=attribute-error
57+
else:
58+
raise RuntimeError('Split by `partition_feature_name` is only supported '
59+
'for FORMAT_TF_EXAMPLE and FORMAT_TF_SEQUENCE_EXAMPLE '
60+
'payload format.')
61+
5062
# Use a feature for partitioning the examples.
5163
feature_name = split_config.partition_feature_name
52-
if feature_name not in record.features.feature:
64+
if feature_name not in features:
5365
raise RuntimeError('Feature name `{}` does not exist.'.format(feature_name))
54-
feature = record.features.feature[feature_name]
66+
feature = features[feature_name]
5567
if not feature.HasField('kind'):
5668
raise RuntimeError('Partition feature does not contain any value.')
5769
if (not feature.HasField('bytes_list') and
@@ -62,23 +74,15 @@ def _ExamplePartitionKey(record: tf.train.Example,
6274

6375

6476
def _PartitionFn(
65-
record: Union[tf.train.Example, bytes],
77+
record: Union[tf.train.Example, tf.train.SequenceExample, bytes],
6678
num_partitions: int,
6779
buckets: List[int],
6880
split_config: example_gen_pb2.SplitConfig,
6981
) -> int:
7082
"""Partition function for the ExampleGen's output splits."""
7183
assert num_partitions == len(
7284
buckets), 'Partitions do not match bucket number.'
73-
74-
if isinstance(record, tf.train.Example):
75-
partition_str = _ExamplePartitionKey(record, split_config)
76-
elif split_config.HasField('partition_feature_name'):
77-
raise RuntimeError('Split by `partition_feature_name` is only supported '
78-
'for FORMAT_TF_EXAMPLE payload format.')
79-
else:
80-
partition_str = record
81-
85+
partition_str = _GeneratePartitionKey(record, split_config)
8286
bucket = int(hashlib.sha256(partition_str).hexdigest(), 16) % buckets[-1]
8387
# For example, if buckets is [10,50,80], there will be 3 splits:
8488
# bucket >=0 && < 10, returns 0
@@ -88,14 +92,17 @@ def _PartitionFn(
8892

8993

9094
@beam.ptransform_fn
91-
@beam.typehints.with_input_types(Union[tf.train.Example, bytes])
95+
@beam.typehints.with_input_types(Union[tf.train.Example,
96+
tf.train.SequenceExample, bytes])
9297
@beam.typehints.with_output_types(beam.pvalue.PDone)
9398
def _WriteSplit(example_split: beam.pvalue.PCollection,
9499
output_split_path: Text) -> beam.pvalue.PDone:
95100
"""Shuffles and writes output split as serialized records in TFRecord."""
96101

97102
def _MaybeSerialize(x):
98-
return x.SerializeToString() if isinstance(x, tf.train.Example) else x
103+
if isinstance(x, (tf.train.Example, tf.train.SequenceExample)):
104+
return x.SerializeToString()
105+
return x
99106

100107
return (example_split
101108
# TODO(jyzhao): make shuffle optional.
@@ -107,28 +114,13 @@ def _MaybeSerialize(x):
107114
file_name_suffix='.gz'))
108115

109116

110-
@beam.ptransform_fn
111-
@beam.typehints.with_input_types(beam.Pipeline)
112-
@beam.typehints.with_output_types(Union[tf.train.Example, bytes])
113-
def _InputToExampleOrBytes(
114-
pipeline: beam.Pipeline,
115-
input_to_example: beam.PTransform,
116-
exec_properties: Dict[Text, Any],
117-
split_pattern: Text,
118-
) -> beam.pvalue.PCollection:
119-
"""Converts input into a tf.train.Example, or a bytes (serialized proto)."""
120-
return (pipeline
121-
| 'InputSourceToExampleOrBytes' >> input_to_example(
122-
exec_properties, split_pattern))
123-
124-
125117
class BaseExampleGenExecutor(
126118
with_metaclass(abc.ABCMeta, base_executor.BaseExecutor)):
127119
"""Generic TFX example gen base executor.
128120
129121
The base ExampleGen executor takes a configuration and converts external data
130-
sources to TensorFlow Examples (tf.train.Example), or any other protocol
131-
buffer as subclass defines.
122+
sources to TensorFlow Examples (tf.train.Example, tf.train.SequenceExample),
123+
or any other protocol buffer as subclass defines.
132124
133125
The common configuration (defined in
134126
https://github.com/tensorflow/tfx/blob/master/tfx/proto/example_gen.proto#L44.)
@@ -137,12 +129,14 @@ class BaseExampleGenExecutor(
137129
138130
The conversion is done in `GenerateExamplesByBeam` as a Beam pipeline, which
139131
validates the configuration, reads the external data sources, converts the
140-
record in the input source to tf.Example if needed, and splits the examples if
141-
the output split config is given. Then the executor's `Do` writes the results
142-
in splits to the output path.
132+
record in the input source to any supported output payload formats
133+
(e.g., tf.Example or tf.SequenceExample) if needed, and splits the examples
134+
if the output split config is given. Then the executor's `Do` writes the
135+
results in splits to the output path.
143136
144137
For simple custom ExampleGens, the details of transforming input data
145-
record(s) to a tf.Example is expected to be given in
138+
record(s) to a specific output payload format (e.g., tf.Example or
139+
tf.SequenceExample) is expected to be given in
146140
`GetInputSourceToExamplePTransform`, which returns a Beam PTransform with the
147141
actual implementation. For complex use cases, such as joining multiple data
148142
sources and different interpretations of the configurations, the custom
@@ -163,7 +157,9 @@ def GetInputSourceToExamplePTransform(self) -> beam.PTransform:
163157
Here is an example PTransform:
164158
@beam.ptransform_fn
165159
@beam.typehints.with_input_types(beam.Pipeline)
166-
@beam.typehints.with_output_types(Union[tf.train.Example, bytes])
160+
@beam.typehints.with_output_types(Union[tf.train.Example,
161+
tf.train.SequenceExample,
162+
bytes])
167163
def ExamplePTransform(
168164
pipeline: beam.Pipeline,
169165
exec_properties: Dict[Text, Any],
@@ -176,15 +172,15 @@ def GenerateExamplesByBeam(
176172
pipeline: beam.Pipeline,
177173
exec_properties: Dict[Text, Any],
178174
) -> Dict[Text, beam.pvalue.PCollection]:
179-
"""Converts input source to TF example splits based on configs.
175+
"""Converts input source to serialized record splits based on configs.
180176
181177
Custom ExampleGen executor should provide GetInputSourceToExamplePTransform
182-
for converting input split to TF Examples. Overriding this
178+
for converting input split to serialized records. Overriding this
183179
'GenerateExamplesByBeam' method instead if complex logic is need, e.g.,
184180
custom spliting logic.
185181
186182
Args:
187-
pipeline: beam pipeline.
183+
pipeline: Beam pipeline.
188184
exec_properties: A dict of execution properties. Depends on detailed
189185
example gen implementation.
190186
- input_base: an external directory containing the data files.
@@ -197,7 +193,7 @@ def GenerateExamplesByBeam(
197193
198194
Returns:
199195
Dict of beam PCollection with split name as key, each PCollection is a
200-
single output split that contains serialized TF Examples.
196+
single output split that contains serialized records.
201197
"""
202198
# Get input split information.
203199
input_config = example_gen_pb2.Input()
@@ -214,7 +210,7 @@ def GenerateExamplesByBeam(
214210
exec_properties['_beam_pipeline_args'] = self._beam_pipeline_args or []
215211

216212
example_splits = []
217-
input_to_example = self.GetInputSourceToExamplePTransform()
213+
input_to_record = self.GetInputSourceToExamplePTransform()
218214
if output_config.split_config.splits:
219215
# Use output splits, input must have only one split.
220216
assert len(
@@ -228,21 +224,19 @@ def GenerateExamplesByBeam(
228224
buckets.append(total_buckets)
229225
example_splits = (
230226
pipeline
231-
| 'InputToExampleOrBytes' >>
227+
| 'InputToRecord' >>
232228
# pylint: disable=no-value-for-parameter
233-
_InputToExampleOrBytes(input_to_example, exec_properties,
234-
input_config.splits[0].pattern)
229+
input_to_record(exec_properties, input_config.splits[0].pattern)
235230
| 'SplitData' >> beam.Partition(_PartitionFn, len(buckets), buckets,
236231
output_config.split_config))
237232
else:
238233
# Use input splits.
239234
for split in input_config.splits:
240235
examples = (
241236
pipeline
242-
| 'InputToExampleOrBytes[{}]'.format(split.name) >>
237+
| 'InputToRecord[{}]'.format(split.name) >>
243238
# pylint: disable=no-value-for-parameter
244-
_InputToExampleOrBytes(input_to_example, exec_properties,
245-
split.pattern))
239+
input_to_record(exec_properties, split.pattern))
246240
example_splits.append(examples)
247241

248242
result = {}
@@ -258,22 +252,23 @@ def Do(
258252
) -> None:
259253
"""Take input data source and generates serialized data splits.
260254
261-
The output is intended to be serialized tf.train.Examples protocol buffer
262-
in gzipped TFRecord format, but subclasses can choose to override to write
263-
to any serialized records payload into gzipped TFRecord as specified,
264-
so long as downstream component can consume it. The format of payload is
265-
added to `payload_format` custom property of the output Example artifact.
255+
The output is intended to be serialized tf.train.Examples or
256+
tf.train.SequenceExamples protocol buffer in gzipped TFRecord format,
257+
but subclasses can choose to override to write to any serialized records
258+
payload into gzipped TFRecord as specified, so long as downstream
259+
component can consume it. The format of payload is added to
260+
`payload_format` custom property of the output Example artifact.
266261
267262
Args:
268263
input_dict: Input dict from input key to a list of Artifacts. Depends on
269264
detailed example gen implementation.
270265
output_dict: Output dict from output key to a list of Artifacts.
271-
- examples: splits of tf examples.
266+
- examples: splits of serialized records.
272267
exec_properties: A dict of execution properties. Depends on detailed
273268
example gen implementation.
274269
- input_base: an external directory containing the data files.
275-
- input_config: JSON string of example_gen_pb2.Input instance, providing
276-
input configuration.
270+
- input_config: JSON string of example_gen_pb2.Input instance,
271+
providing input configuration.
277272
- output_config: JSON string of example_gen_pb2.Output instance,
278273
providing output configuration.
279274
- output_data_format: Payload format of generated data in output

0 commit comments

Comments
 (0)