Restructure the import pipeline as a state machine, and adopt asyncio for performance #5736

peterjdolan · 2025-04-17T04:14:03Z

peterjdolan
Apr 17, 2025

Hi all,

This is a bit of a radical idea, but I want to put it out there and see if there's any interest.

I was poking around the import command trying to figure out why it was running so slowly, and I spotted a few things I think could be improved.

I've written up a draft PR demonstrating these ideas, here: #5735. I'm currently using this branch to import and re-tag items in my personal library.

Here's a quick example (adapted from the state machine's unit test) that gives an example of my suggestions below:

beets/test/util/test_async_state_machine.py

Lines 117 to 145 in f15beda

    
           @pytest.mark.asyncio 
        
           async def test_terminal_state(): 
        
               """Test behavior when reaching a terminal state.""" 
        
               transition_graph = ( 
        
                   ( 
        
                       State( 
        
                           id="INIT", 
        
                           max_queue_size=1, 
        
                           handler=handler_for("INIT"), 
        
                       ), 
        
                       (("FINAL", lambda _: True),), 
        
                   ), 
        
                   ( 
        
                       State( 
        
                           id="FINAL", 
        
                           max_queue_size=1, 
        
                           handler=handler_for("FINAL"), 
        
                           accumulate=True, 
        
                       ), 
        
                       tuple(), 
        
                   ), 
        
               ) 
        
               async with AsyncStateMachine(transition_graph) as machine: 
        
                   await machine.inject(Task(value=1), "INIT") 
        
                   async for output in machine.accumulated("FINAL"): 
        
                       assert output.value == 1 
        
                       assert output.states == ["INIT", "FINAL"]

Rewrite the import pipeline as a state machine

The import pipeline is currently structured as a sequence of stages, each performing some significant piece of work. There are two problems with this - the import logic is not actually sequential, and it's not possible to skip stages.

By non-sequential operations I specifically mean when the user selects "import as tracks" or "group albums." In both cases, the logic is for the pipeline to split the ImportTask into singletons or into one or more ImportTasks that have items matching the same artist - album combination. Right now the way this is handled is to create a new Pipeline within the "user choice" stage, and to run that forward until the user makes a new set of choices.

beets/beets/importer.py

Lines 1595 to 1614 in 3c177b5

    
           if task.choice_flag is action.TRACKS: 
        
               # Set up a little pipeline for dealing with the singletons. 
        
               def emitter(task): 
        
                   for item in task.items: 
        
                       task = SingletonImportTask(task.toppath, item) 
        
                       yield from task.handle_created(session) 
        
                   yield SentinelImportTask(task.toppath, task.paths) 
        
               return _extend_pipeline( 
        
                   emitter(task), lookup_candidates(session), user_query(session) 
        
               ) 
        
           # As albums: group items by albums and create task for each album 
        
           if task.choice_flag is action.ALBUMS: 
        
               return _extend_pipeline( 
        
                   [task], 
        
                   group_albums(session), 
        
                   lookup_candidates(session), 
        
                   user_query(session), 
        
               )

This also requires communication mechanisms between child and parent pipelines to support early-exiting the sub-pipelines. Whew!

beets/beets/importer.py

Lines 1587 to 1588 in 3c177b5

    
           if session.already_merged(task.paths): 
        
               return pipeline.BUBBLE

The inflexibility of the import pipeline also means that it's not possible to skip stages. This then requires that all stages be aware of the possibility that the user chose to skip importing an album, and to start most pipeline stages by checking if the task should be skipped and passed to the next stage unmodified.

beets/beets/importer.py

Lines 1549 to 1558 in 3c177b5

    
           def lookup_candidates(session: ImportSession, task: ImportTask): 
        
               """A coroutine for performing the initial MusicBrainz lookup for an 
        
               album. It accepts lists of Items and yields 
        
               (items, cur_artist, cur_album, candidates, rec) tuples. If no match 
        
               is found, all of the yielded parameters (except items) are None. 
        
               """ 
        
               if task.skip: 
        
                   # FIXME This gets duplicated a lot. We need a better 
        
                   # abstraction. 
        
                   return

I'm sure that when it was first developed, a sequential import model made perfect sense, but the code is now quite complex, difficult to understand, and I would suggest is quite brittle. When I was trying to make some small changes, I easily broke everything.

I suggest that a better model for the import command would be a state machine, in which individual import operations would be represented as states, and which would support transitions between states. This would directly model cycles in the import flow, and would also support skipping an import task directly to the conclusion of the pipeline.

Adopt Python's asyncio concurrency

I admit that for people unfamiliar with it, it can be a bit abstruse and difficult to understand, but I've found it's absolutely the right tool for the job when the job involves lots of I/O-bound work.

Essentially, with asyncio, you can write code that gains all the benefits of performing I/O-intensive work in a separate thread without blocking the main program, but without the overhead of actually using multiple threads and the complexity of maintaining thread safety that comes with that.

If the import process were to adopt asyncio, then it would be able to perform work concurrently while waiting for the user to choose what to do with an import item, but without the challenge of maintaining multiple threads to divide up different processing stages. It would also benefit from naturally making all I/O operations non-blocking and concurrent, including e.g. queries to MusicBrainz or other metadata backends while searching for the right album match. Currently, these queries are performed sequentially in the thread assigned to interacting with the user. Otherwise put, the concurrency would be much more finely-grained, and we wouldn't need to manage it ourselves.

Fully adopting asyncio will probably be infeasible, as many dependencies do not support it (Mutagen particularly, via MediaFile). For those operations, it's straightforward to schedule them on a ThreadPoolExecutor, and the code to do so is no more complicated than if they did support asyncio:

MediaFile as-is:

from mediafile import MediaFile

f = MediaFile('Lucy.mp3')
f.artist = 'The Beatles'
f.save()

MediaFile if it supported asyncio:

from mediafile import MediaFile

f = await MediaFile.open('Lucy.mp3')
f.artist = 'The Beatles'
await f.save()

(note that this is a good example of why it's often not a good idea to perform I/O operations in class constructors)

MediaFile use in an async function using a ThreadPoolExecutor:

from mediafile import MediaFile
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as tp:
    f = await tp.submit(MediaFile.open('Lucy.mp3'))
    f.artist = 'The Beatles'
    await tp.submit(f.save())

Note also that asyncio gained some significant improvements, particularly task groups in Python 3.11, but all the necessary functionality is there in Python 3.9.

Thanks!

Thanks for reading this far! I'm looking forward to hearing your thoughts and suggestions.

peterjdolan · 2025-04-22T03:45:10Z

peterjdolan
Apr 22, 2025
Author

On further consideration, I've convinced myself this isn't a feasible or desirable direction. See peterjdolan@234e41a for some details.

In short, I realized that my draft change was getting far too large, so instead of replacing the existing Pipeline entirely, I tried using the AsyncStateMachine to implement the Pipeline's internal parallel processing loop. In short, this led to a number of issues, including some that I suspect are related to the library module's initialization in multiple threads, which I couldn't easily diagnose. Ultimately I had enough challenges that I couldn't get a baseline performance impact measurement.

So, this would be a lot of extra complexity, but with questionable practical gains.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restructure the import pipeline as a state machine, and adopt asyncio for performance #5736

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Restructure the import pipeline as a state machine, and adopt asyncio for performance #5736

Uh oh!

Uh oh!

peterjdolan Apr 17, 2025

Rewrite the import pipeline as a state machine

Adopt Python's asyncio concurrency

Thanks!

Replies: 1 comment

Uh oh!

peterjdolan Apr 22, 2025 Author

peterjdolan
Apr 17, 2025

peterjdolan
Apr 22, 2025
Author