You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/docs/ApiReference/task.md
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -233,6 +233,11 @@ An asynchronous task cannot set `multiprocessing` as `True`
233
233
234
234
See some [considerations](../UserGuide/AdvancedConcepts#cpu-bound-work) for when to set this parameter.
235
235
236
+
Note, also, that normal Python multiprocessing restrictions apply:
237
+
238
+
* Only [picklable](https://docs.python.org/3/library/pickle.html#module-pickle) functions can be multiprocessed, which excludes certain types of functions like lambdas and closures.
239
+
* Arguments and return values of multiprocessed tasks must also be picklable, which excludes objects like file handles, connections, and (on Windows) generators.
Copy file name to clipboardExpand all lines: docs/src/docs/UserGuide/AdvancedConcepts.md
+93-7Lines changed: 93 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ IO-bound tasks benefit from both concurrent and parallel execution.
39
39
However, to avoid the overhead costs of creating processes, it is generally preferable to use either threading or async code.
40
40
41
41
{: .info}
42
-
Threads incur a higher overhead cost compared to async coroutines, but are suitable if your application prefers or requires a synchronous implementation
42
+
Threads incur a higher overhead cost compared to async coroutines, but are suitable if the function / application prefers or requires a synchronous implementation
43
43
44
44
Note that asynchronous functions need to `await` or `yield` something in order to benefit from concurrency.
45
45
Any long-running call in an async task which does not yield execution will prevent other tasks from making progress:
Note, however, that processes incur a very high overhead cost (performance in creation and memory in maintaining inter-process communication). Specific cases should be benchmarked to fine-tune the task parameters for your program / your machine.
90
+
Note, however, that processes incur a very high overhead cost (performance cost in creation and memory cost in inter-process communication). Specific cases should be benchmarked to fine-tune the task parameters for your program / your machine.
91
91
92
92
### Summary
93
93
@@ -101,7 +101,7 @@ Note, however, that processes incur a very high overhead cost (performance in cr
101
101
{: .text-green-200}
102
102
**Key Considerations:**
103
103
104
-
* If a task is doing extremely expensive CPU-bound work, define it synchronously and set `multiprocess=True`
104
+
* If a task is doing expensive CPU-bound work, define it synchronously and set `multiprocess=True`
105
105
* If a task is doing expensive IO-bound work, consider implementing it asynchronously, or use threads
106
106
* Do _not_ put expensive, blocking work in an async task, as this clogs up the async event loop
107
107
@@ -111,7 +111,7 @@ Note, however, that processes incur a very high overhead cost (performance in cr
111
111
112
112
Writing clean code is partly about defining functions with single, clear responsibilities.
113
113
114
-
In Pyper specifically, it is especially important to separate out different types of work into different tasks if we want to optimize their performance. For example, consider a task which performs an IO-bound network request along with a CPU-bound function to parse the data.
114
+
In Pyper, it is especially important to separate out different types of work into different tasks if we want to optimize their performance. For example, consider a task which performs an IO-bound network request along with a CPU-bound function to parse the data.
115
115
116
116
```python
117
117
# Bad -- functions not separated
@@ -165,10 +165,10 @@ When defining a pipeline, these additional arguments are plugged into tasks usin
165
165
asyncdefmain():
166
166
asyncwith ClientSession("http://localhost:8000/api") as session:
Generators in Python are a mechanism for _lazy execution_, whereby results in an iterable are returned one by one (via underlying calls to `__next__`) instead of within a data structure, like a `list`, which requires all of its elements to be allocated in memory.
218
+
219
+
Using generators is an indispensible approach for processing large volumes of data in a memory-friendly way. We can define generator functions by using the `yield` keyword within a normal `def` block:
# Subsequent tasks also cannot start executing until the entire list is created
233
+
@task(branch=True)
234
+
defcreate_values_in_list() -> typing.List[dict]:
235
+
return [{"data": i} for i inrange(10_000_000)]
236
+
```
237
+
238
+
{: .info}
239
+
Generator `functions` return immediately. They return `generator` objects, which are iterable
240
+
241
+
Using the `branch` task parameter in Pyper allows generators to generate multiple outputs, which get picked up by subsequent tasks as soon as the data is available.
242
+
243
+
Using a generator function without `branch=True` is also possible; this just means the task submits `generator` objects as output, instead of each generated value.
244
+
245
+
```python
246
+
from pyper import task
247
+
248
+
defget_data():
249
+
yield1
250
+
yield2
251
+
yield3
252
+
253
+
if__name__=="__main__":
254
+
branched_pipeline = task(get_data, branch=True)
255
+
for output in branched_pipeline():
256
+
print(output)
257
+
# Prints:
258
+
# 1
259
+
# 2
260
+
# 3
261
+
262
+
non_branched_pipeline = task(get_data)
263
+
for output in non_branched_pipeline():
264
+
print(output)
265
+
# Prints:
266
+
# <generator object get_data at ...>
267
+
```
268
+
269
+
### Limitations
270
+
271
+
Implementing generator objects in a pipeline can also come with some caveats that are important to keep in mind.
272
+
273
+
{: .text-green-200}
274
+
**Synchronous Generators with Asynchronous Code**
275
+
276
+
Synchronous generators in an `AsyncPipeline` do not benefit from threading or multiprocessing.
277
+
278
+
This is because, in order to be scheduled in an async event loop, each synchronous task is run by a thread/process, and then wrapped in an `asyncio.Task`.
279
+
280
+
Generator functions, which return _immediately_, do most of their work outside of the thread/process and this synchronous work will therefore not benefit from multiple workers in an async context.
281
+
282
+
The alternatives are to:
283
+
284
+
1. Use a synchronous generator anyway (if its performance is unlikely to be a bottleneck)
285
+
286
+
2. Use a normal synchronous function, and return an iterable data structure (if memory is unlikely to be a bottleneck)
287
+
288
+
3. Use an async generator (if an async implementation of the function is appropriate)
289
+
290
+
{: .text-green-200}
291
+
**Multiprocessing and Pickling**
292
+
293
+
In Python, anything that goes into and comes out of a process must be picklable.
294
+
295
+
On Windows, generator objects cannot be pickled, so cannot be passed as inputs and outputs when multiprocessing.
296
+
297
+
Note that, for example, using `branch=True` to pass individual outputs from a generator into a multiprocessed task is still fine, because the task input would not be a `generator` object.
Copy file name to clipboardExpand all lines: docs/src/docs/UserGuide/ComposingPipelines.md
+10-10Lines changed: 10 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ if __name__ == "__main__":
78
78
writer(pipeline(limit=10)) # Run
79
79
```
80
80
81
-
The `>` operator (again inspired by UNIX syntax) is used to pipe a `Pipeline` into a consumer function (any callable that takes a data stream) returning simply a function that handles the 'run' operation. This is syntactic sugar for the `Pipeline.consume` method.
81
+
The `>` operator (again inspired by UNIX syntax) is used to pipe a `Pipeline` into a consumer function (any callable that takes an `Iterable` of inputs) returning simply a function that handles the 'run' operation. This is syntactic sugar for the `Pipeline.consume` method.
82
82
```python
83
83
if__name__=="__main__":
84
84
run = step1 | step2 > JsonFileWriter("data.json")
@@ -105,26 +105,26 @@ For example, let's say we have a theoretical pipeline which takes `(source: str)
task(list_files, branch=True)# Return a list of file info
109
+
| task(download_file, workers=20)# Return a filepath
110
+
| task(decrypt_file, workers=5, multiprocess=True)# Return a filepath
111
111
)
112
112
```
113
113
114
114
This is a function which generates multiple outputs per source. But we may wish to process _batches of filepaths_ downstream, after waiting for a single source to finish downloading. This means a piping approach, where we pass each _individual_ filepath along to subsequent tasks, won't work.
115
115
116
-
Instead, we can define a function to create a list of filepaths as `download_files_from_source > list`. This is now a composable function which can be used in an outer pipeline.
116
+
Instead, we can define `download_files_from_source` as a task within an outer pipeline, which is as simple as wrapping it in `task` like we would with any other function.
117
117
118
118
```python
119
119
download_and_merge_files = (
120
-
task(get_sources, branch=True)
121
-
| task(download_files_from_source>list)
122
-
| task(merge_files, workers=5, multiprocess=True)
120
+
task(get_sources, branch=True)# Return a list of sources
121
+
| task(download_files_from_source) # Return a batch of filepaths (as a generator)
122
+
| task(sync_files, workers=5) # Do something with each batch
123
123
)
124
124
```
125
125
126
-
*`download_files_from source > list` takes a source as input, downloads all files, and creates a list of filepaths as output.
127
-
*`merge_files` takes a list of filepaths as input.
126
+
*`download_files_from_source` takes a source as input, and returns a generator of filepaths (note that we are _not_ setting `branch=True`; a batch of filepaths is being passed along per source)
127
+
*`sync_files` takes each batch of filepaths as input, and works on them concurrently
Copy file name to clipboardExpand all lines: docs/src/index.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ It is designed with the following goals in mind:
52
52
***Error Handling**: Data flows fail fast, even in long-running threads, and propagate their errors cleanly
53
53
***Complex Data Flows**: Data pipelines support branching/joining data flows, as well as sharing contexts/resources between tasks
54
54
55
-
In addition, Pyper provides an extensible way to write code that can be integrated with other frameworks like those aforementioned.
55
+
In addition, Pyper enables developers to write code in an extensible way that can be integrated naturally with other frameworks like those aforementioned.
0 commit comments