Restructure the import pipeline as a state machine, and adopt asyncio for performance #5736
Replies: 1 comment
-
On further consideration, I've convinced myself this isn't a feasible or desirable direction. See peterjdolan@234e41a for some details. In short, I realized that my draft change was getting far too large, so instead of replacing the existing Pipeline entirely, I tried using the AsyncStateMachine to implement the Pipeline's internal parallel processing loop. In short, this led to a number of issues, including some that I suspect are related to the library module's initialization in multiple threads, which I couldn't easily diagnose. Ultimately I had enough challenges that I couldn't get a baseline performance impact measurement. So, this would be a lot of extra complexity, but with questionable practical gains. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
This is a bit of a radical idea, but I want to put it out there and see if there's any interest.
I was poking around the import command trying to figure out why it was running so slowly, and I spotted a few things I think could be improved.
I've written up a draft PR demonstrating these ideas, here: #5735. I'm currently using this branch to import and re-tag items in my personal library.
Here's a quick example (adapted from the state machine's unit test) that gives an example of my suggestions below:
beets/test/util/test_async_state_machine.py
Lines 117 to 145 in f15beda
Rewrite the import pipeline as a state machine
The import pipeline is currently structured as a sequence of stages, each performing some significant piece of work. There are two problems with this - the import logic is not actually sequential, and it's not possible to skip stages.
By non-sequential operations I specifically mean when the user selects "import as tracks" or "group albums." In both cases, the logic is for the pipeline to split the ImportTask into singletons or into one or more ImportTasks that have items matching the same artist - album combination. Right now the way this is handled is to create a new Pipeline within the "user choice" stage, and to run that forward until the user makes a new set of choices.
beets/beets/importer.py
Lines 1595 to 1614 in 3c177b5
This also requires communication mechanisms between child and parent pipelines to support early-exiting the sub-pipelines. Whew!
beets/beets/importer.py
Lines 1587 to 1588 in 3c177b5
The inflexibility of the import pipeline also means that it's not possible to skip stages. This then requires that all stages be aware of the possibility that the user chose to skip importing an album, and to start most pipeline stages by checking if the task should be skipped and passed to the next stage unmodified.
beets/beets/importer.py
Lines 1549 to 1558 in 3c177b5
I'm sure that when it was first developed, a sequential import model made perfect sense, but the code is now quite complex, difficult to understand, and I would suggest is quite brittle. When I was trying to make some small changes, I easily broke everything.
I suggest that a better model for the import command would be a state machine, in which individual import operations would be represented as states, and which would support transitions between states. This would directly model cycles in the import flow, and would also support skipping an import task directly to the conclusion of the pipeline.
Adopt Python's asyncio concurrency
I admit that for people unfamiliar with it, it can be a bit abstruse and difficult to understand, but I've found it's absolutely the right tool for the job when the job involves lots of I/O-bound work.
Essentially, with asyncio, you can write code that gains all the benefits of performing I/O-intensive work in a separate thread without blocking the main program, but without the overhead of actually using multiple threads and the complexity of maintaining thread safety that comes with that.
If the import process were to adopt asyncio, then it would be able to perform work concurrently while waiting for the user to choose what to do with an import item, but without the challenge of maintaining multiple threads to divide up different processing stages. It would also benefit from naturally making all I/O operations non-blocking and concurrent, including e.g. queries to MusicBrainz or other metadata backends while searching for the right album match. Currently, these queries are performed sequentially in the thread assigned to interacting with the user. Otherwise put, the concurrency would be much more finely-grained, and we wouldn't need to manage it ourselves.
Fully adopting asyncio will probably be infeasible, as many dependencies do not support it (Mutagen particularly, via MediaFile). For those operations, it's straightforward to schedule them on a ThreadPoolExecutor, and the code to do so is no more complicated than if they did support asyncio:
MediaFile as-is:
MediaFile if it supported asyncio:
(note that this is a good example of why it's often not a good idea to perform I/O operations in class constructors)
MediaFile use in an async function using a ThreadPoolExecutor:
Note also that asyncio gained some significant improvements, particularly task groups in Python 3.11, but all the necessary functionality is there in Python 3.9.
Thanks!
Thanks for reading this far! I'm looking forward to hearing your thoughts and suggestions.
Beta Was this translation helpful? Give feedback.
All reactions