update some docs, comments, docstrings

RichardZhu2 · RichardZhu2 · commit 0a127bbd141c · 2024-12-13T22:29:53.000Z
diff --git a/docs/src/docs/ApiReference/task.md b/docs/src/docs/ApiReference/task.md
@@ -233,6 +233,11 @@ An asynchronous task cannot set `multiprocessing` as `True`
 
 See some [considerations](../UserGuide/AdvancedConcepts#cpu-bound-work) for when to set this parameter.
 
+Note, also, that normal Python multiprocessing restrictions apply:
+
+* Only [picklable](https://docs.python.org/3/library/pickle.html#module-pickle) functions can be multiprocessed, which excludes certain types of functions like lambdas and closures.
+* Arguments and return values of multiprocessed tasks must also be picklable, which excludes objects like file handles, connections, and (on Windows) generators.
+
 {: .text-beta}
 ### `bind`
 
diff --git a/docs/src/docs/UserGuide/AdvancedConcepts.md b/docs/src/docs/UserGuide/AdvancedConcepts.md
@@ -39,7 +39,7 @@ IO-bound tasks benefit from both concurrent and parallel execution.
 However, to avoid the overhead costs of creating processes, it is generally preferable to use either threading or async code.
 
 {: .info}
-Threads incur a higher overhead cost compared to async coroutines, but are suitable if your application prefers or requires a synchronous implementation
+Threads incur a higher overhead cost compared to async coroutines, but are suitable if the function / application prefers or requires a synchronous implementation
 
 Note that asynchronous functions need to `await` or `yield` something in order to benefit from concurrency.
 Any long-running call in an async task which does not yield execution will prevent other tasks from making progress:
@@ -87,7 +87,7 @@ def long_computation(data: int):
     return data
 ```
 
-Note, however, that processes incur a very high overhead cost (performance in creation and memory in maintaining inter-process communication). Specific cases should be benchmarked to fine-tune the task parameters for your program / your machine.
+Note, however, that processes incur a very high overhead cost (performance cost in creation and memory cost in inter-process communication). Specific cases should be benchmarked to fine-tune the task parameters for your program / your machine.
 
 ### Summary
 
@@ -101,7 +101,7 @@ Note, however, that processes incur a very high overhead cost (performance in cr
 {: .text-green-200}
 **Key Considerations:**
 
-* If a task is doing extremely expensive CPU-bound work, define it synchronously and set `multiprocess=True`
+* If a task is doing expensive CPU-bound work, define it synchronously and set `multiprocess=True`
 * If a task is doing expensive IO-bound work, consider implementing it asynchronously, or use threads
 * Do _not_ put expensive, blocking work in an async task, as this clogs up the async event loop
 
@@ -111,7 +111,7 @@ Note, however, that processes incur a very high overhead cost (performance in cr
 
 Writing clean code is partly about defining functions with single, clear responsibilities.
 
-In Pyper specifically, it is especially important to separate out different types of work into different tasks if we want to optimize their performance. For example, consider a task which performs an IO-bound network request along with a CPU-bound function to parse the data.
+In Pyper, it is especially important to separate out different types of work into different tasks if we want to optimize their performance. For example, consider a task which performs an IO-bound network request along with a CPU-bound function to parse the data.
 
 ```python
 # Bad -- functions not separated
@@ -165,10 +165,10 @@ When defining a pipeline, these additional arguments are plugged into tasks usin
 async def main():
     async with ClientSession("http://localhost:8000/api") as session:
         user_data_pipeline = (
-            task(list_user_ids, branch=True, bind=task.bind(session=session))
+            task(list_user_ids, branch=True)
             | task(fetch_user_data, workers=10, bind=task.bind(session=session))
         )
-        async for output in user_data_pipeline():
+        async for output in user_data_pipeline(session):
             print(output)
 ```
 
@@ -208,4 +208,90 @@ async def main():
             > copy_to_db
         )
         await run()
-```
+```
+
+## Generators
+
+### Usage
+
+Generators in Python are a mechanism for _lazy execution_, whereby results in an iterable are returned one by one (via underlying calls to `__next__`) instead of within a data structure, like a `list`, which requires all of its elements to be allocated in memory.
+
+Using generators is an indispensible approach for processing large volumes of data in a memory-friendly way. We can define generator functions by using the `yield` keyword within a normal `def` block:
+
+```python
+import typing
+from pyper import task
+
+# Okay
+@task(branch=True)
+def generate_values_lazily() -> typing.Iterable[dict]:
+    for i in range(10_000_000):
+        yield {"data": i}
+
+# Bad -- this creates 10 million values in memory
+# Subsequent tasks also cannot start executing until the entire list is created
+@task(branch=True)
+def create_values_in_list() -> typing.List[dict]:
+    return [{"data": i} for i in range(10_000_000)]
+```
+
+{: .info}
+Generator `functions` return immediately. They return `generator` objects, which are iterable
+
+Using the `branch` task parameter in Pyper allows generators to generate multiple outputs, which get picked up by subsequent tasks as soon as the data is available.
+
+Using a generator function without `branch=True` is also possible; this just means the task submits `generator` objects as output, instead of each generated value.
+
+```python
+from pyper import task
+
+def get_data():
+    yield 1
+    yield 2
+    yield 3
+
+if __name__ == "__main__":
+    branched_pipeline = task(get_data, branch=True)
+    for output in branched_pipeline():
+        print(output)
+        # Prints:
+        # 1
+        # 2
+        # 3
+
+    non_branched_pipeline = task(get_data)
+    for output in non_branched_pipeline():
+        print(output)
+        # Prints:
+        # <generator object get_data at ...>
+```
+
+### Limitations
+
+Implementing generator objects in a pipeline can also come with some caveats that are important to keep in mind.
+
+{: .text-green-200}
+**Synchronous Generators with Asynchronous Code**
+
+Synchronous generators in an `AsyncPipeline` do not benefit from threading or multiprocessing.
+
+This is because, in order to be scheduled in an async event loop, each synchronous task is run by a thread/process, and then wrapped in an `asyncio.Task`.
+
+Generator functions, which return _immediately_, do most of their work outside of the thread/process and this synchronous work will therefore not benefit from multiple workers in an async context.
+
+The alternatives are to:
+
+1. Use a synchronous generator anyway (if its performance is unlikely to be a bottleneck)
+
+2. Use a normal synchronous function, and return an iterable data structure (if memory is unlikely to be a bottleneck)
+
+3. Use an async generator (if an async implementation of the function is appropriate)
+
+{: .text-green-200}
+**Multiprocessing and Pickling**
+
+In Python, anything that goes into and comes out of a process must be picklable.
+
+On Windows, generator objects cannot be pickled, so cannot be passed as inputs and outputs when multiprocessing.
+
+Note that, for example, using `branch=True` to pass individual outputs from a generator into a multiprocessed task is still fine, because the task input would not be a `generator` object.
diff --git a/docs/src/docs/UserGuide/ComposingPipelines.md b/docs/src/docs/UserGuide/ComposingPipelines.md
@@ -78,7 +78,7 @@ if __name__ == "__main__":
     writer(pipeline(limit=10))  # Run
 ```
 
-The `>` operator (again inspired by UNIX syntax) is used to pipe a `Pipeline` into a consumer function (any callable that takes a data stream) returning simply a function that handles the 'run' operation. This is syntactic sugar for the `Pipeline.consume` method.
+The `>` operator (again inspired by UNIX syntax) is used to pipe a `Pipeline` into a consumer function (any callable that takes an `Iterable` of inputs) returning simply a function that handles the 'run' operation. This is syntactic sugar for the `Pipeline.consume` method.
 ```python
 if __name__ == "__main__":
     run = step1 | step2 > JsonFileWriter("data.json")
@@ -105,26 +105,26 @@ For example, let's say we have a theoretical pipeline which takes `(source: str)
 
 ```python
 download_files_from_source = (
-    task(list_files, branch=True)
-    | task(download_file, workers=20)
-    | task(decrypt_file, workers=5, multiprocess=True)
+    task(list_files, branch=True)  # Return a list of file info
+    | task(download_file, workers=20)  # Return a filepath
+    | task(decrypt_file, workers=5, multiprocess=True)  # Return a filepath
 )
 ```
 
 This is a function which generates multiple outputs per source. But we may wish to process _batches of filepaths_ downstream, after waiting for a single source to finish downloading. This means a piping approach, where we pass each _individual_ filepath along to subsequent tasks, won't work.
 
-Instead, we can define a function to create a list of filepaths as `download_files_from_source > list`. This is now a composable function which can be used in an outer pipeline.
+Instead, we can define `download_files_from_source` as a task within an outer pipeline, which is as simple as wrapping it in `task` like we would with any other function.
 
 ```python
 download_and_merge_files = (
-    task(get_sources, branch=True)
-    | task(download_files_from_source > list)
-    | task(merge_files, workers=5, multiprocess=True)
+    task(get_sources, branch=True)  # Return a list of sources
+    | task(download_files_from_source)  # Return a batch of filepaths (as a generator)
+    | task(sync_files, workers=5)  # Do something with each batch
 )
 ```
 
-* `download_files_from source > list` takes a source as input, downloads all files, and creates a list of filepaths as output.
-* `merge_files` takes a list of filepaths as input.
+* `download_files_from_source` takes a source as input, and returns a generator of filepaths (note that we are _not_ setting `branch=True`; a batch of filepaths is being passed along per source)
+* `sync_files` takes each batch of filepaths as input, and works on them concurrently
 
 ## Asynchronous Code
 
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -52,7 +52,7 @@ It is designed with the following goals in mind:
 * **Error Handling**: Data flows fail fast, even in long-running threads, and propagate their errors cleanly
 * **Complex Data Flows**: Data pipelines support branching/joining data flows, as well as sharing contexts/resources between tasks
 
-In addition, Pyper provides an extensible way to write code that can be integrated with other frameworks like those aforementioned.
+In addition, Pyper enables developers to write code in an extensible way that can be integrated naturally with other frameworks like those aforementioned.
 
 ## Installation
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -36,6 +36,7 @@ dependencies = [
 ]
 
 [project.urls]
+Homepage = "https://pyper-dev.github.io/pyper/"
 Documentation = "https://pyper-dev.github.io/pyper/"
 Repository = "https://github.com/pyper-dev/pyper"
 Issues = "https://github.com/pyper-dev/pyper/issues"
diff --git a/src/pyper/_core/async_helper/queue_io.py b/src/pyper/_core/async_helper/queue_io.py
@@ -1,6 +1,6 @@
 from __future__ import annotations
 
-from collections.abc import Iterable
+from collections.abc import AsyncIterable, Iterable
 from typing import TYPE_CHECKING
 
 from ..util.sentinel import StopSentinel
@@ -61,10 +61,11 @@ async def __call__(self, *args, **kwargs):
 
 class _BranchingAsyncEnqueue(_AsyncEnqueue):
     async def __call__(self, *args, **kwargs):
-        if self.task.is_gen:
-            async for output in self.task.func(*args, **kwargs):
+        result = self.task.func(*args, **kwargs)
+        if isinstance(result, AsyncIterable):
+            async for output in result:
                 await self.q_out.put(output)
-        elif isinstance(result := await self.task.func(*args, **kwargs), Iterable):
+        elif isinstance(result := await result, Iterable):
             for output in result:
                 await self.q_out.put(output)
         else:
diff --git a/src/pyper/_core/util/asynchronize.py b/src/pyper/_core/util/asynchronize.py
@@ -18,10 +18,14 @@ def ascynchronize(task: Task, tp: ThreadPoolExecutor, pp: ProcessPoolExecutor) -
         return task
     
     if task.is_gen and task.branch:
+        # Small optimization to convert sync generators to async generators
+        # This saves from having to use a thread/process just to get the generator object
+        # We also add asyncio.sleep(0) to unblock long synchronous generators
         @functools.wraps(task.func)
         async def wrapper(*args, **kwargs):
             for output in task.func(*args, **kwargs):
                 yield output
+                await asyncio.sleep(0)
     else:
         executor = pp if task.multiprocess else tp
         @functools.wraps(task.func)

Original file line number	Diff line number	Diff line change
`@@ -36,6 +36,7 @@ dependencies = [`
`36`	`36`	`]`
`37`	`37`
`38`	`38`	`[project.urls]`
	`39`	`+Homepage = "https://pyper-dev.github.io/pyper/"`
`39`	`40`	`Documentation = "https://pyper-dev.github.io/pyper/"`
`40`	`41`	`Repository = "https://github.com/pyper-dev/pyper"`
`41`	`42`	`Issues = "https://github.com/pyper-dev/pyper/issues"`