Derivative sync plugin problem #60

p-zach · 2022-04-24T04:10:01Z

p-zach
Apr 24, 2022

I'm writing a minimum viable product for derivative sync functions in order to understand them better and share here for other people to look at. It works (as in, it fetches data and creates a properly formed child data pipe) but it still produces an error/warning that I'm not sure how to fix.
The error is:
Failed to sync pipe 'plugin_test_derivative_a_derived_pipe_1' with exception: ''NoneType' object has no attribute 'to_sql''
For context, the plugin is called test_derivative, it's named "a", and the derived pipe is derived_pipe_1.

Here is the code:

# Minimum working example of derivative sync plugins

# More samples of register, fetch, and sync can be found in the mrsm docs

required = ['pandas', 'random', 'datetime']

# Minimum register function
def register(pipe, **kw):
    # Indicates which column acts as the datetime column
    return {
        'columns': {
            # Tells mrsm that the timestamp column is the datetime column
            'datetime': 'timestamp',
        },
    }

# Get the base data that you will do operations on
# This function gets random data.
def fetch(pipe, **kw):
    # Import statements go in function body
    import pandas as pd
    import random
    import datetime

    now = datetime.datetime.now()

    # Initialize dataframe
    df = pd.DataFrame()

    # Generate random data
    for i in range(3):
        data = {
            'timestamp': now + datetime.timedelta(seconds=i),
            'random1': random.randint(1, 100),
            'random2': random.randint(101, 200)
        }
        # This uses just one method of populating the dataframe (concatenating successive dictionaries)
        # Another method is retrieving a whole DF from some API
        # Good examples are in the mrsm docs
        df_new = pd.DataFrame([data])
        df = pd.concat([df, df_new], ignore_index=True)

    # Return fetched data
    return df

# The sync function is what creates derivative pipes.
def sync(pipe, **kw):
    import meerschaum as mrsm

    # Get data using fetch function above
    pipe.sync(fetch(pipe, **kw), **kw)

    # Create child pipe
    child_pipe = mrsm.Pipe(
        # Carry over the original pipe's connector and metric keys
        pipe.connector_keys,
        pipe.metric_key,
        # Name the child pipe (required; child pipe will not be created if it does not have a unique name)
        'derived_pipe_1',
        # Add new derived columns
        # Initially set to empty lists; they are populated in the remainder of this function
        columns = pipe.columns.update({'deriv_random1': [], 'deriv_random2': []})
    )

    # Get the data from the pipe
    # alternatively: use pipe.get_backtrack_data(X) to only get data from the past X minutes
    # get_backtrack_data(X) is, of course, more efficient for large datasets.
    fetched_data = pipe.get_data()

    # copy fetched data to child data and add 2 derived columns
    # df.assign copies data and adds new specified columns based on existing data
    child_data = fetched_data.assign(
        # These new column names need to match the ones added above
        # The parameter handed to these functions is the original dataframe (fetched_data)

        # Example derived column 1: delegating the data derivation to a function defined elsewhere
        deriv_random1=derive1,
        # Example derived column 2: lambda function
        deriv_random2=lambda row: row.random2 + 0.5
        # also works: deriv_random2=df['random2'] + 0.5
    )

    # Add the fetched and additional data to the child pipe
    return child_pipe.sync(child_data, **kw)

# Example function for creating derivative data
# Must be a function with 1 argument that represents the original dataframe, called "row" here
def derive1(row):
    return row.random1 * 2
    # also works: return row['random1'] * 2

Edit: Also--any suggestions for improving conciseness/readability for other people to use as a resourse?

Answered by bmeares

Apr 24, 2022

Hey @p-zach, thanks for opening a question! I just played around with your plugin, and the issue stems from sync pipes selecting both the parent and child pipes, executing the plugin twice at the same time. The first sync works as expected because the child doesn't yet exist, but because the child has different columns from the parent (i.e. the warning about DataFrames' shapes before the error), syncing the child pipe directly fails on line 51.

Because syncing the parent updates the child as well, you need to exit the function if the child is synced directly. To avoid the exception, add a quick check at the top of sync(pipe) to ensure that pipe is actually the parent:

import meerschaum as m…

View full answer

bmeares · 2022-04-24T08:26:52Z

bmeares
Apr 24, 2022
Maintainer

Hey @p-zach, thanks for opening a question! I just played around with your plugin, and the issue stems from sync pipes selecting both the parent and child pipes, executing the plugin twice at the same time. The first sync works as expected because the child doesn't yet exist, but because the child has different columns from the parent (i.e. the warning about DataFrames' shapes before the error), syncing the child pipe directly fails on line 51.

Because syncing the parent updates the child as well, you need to exit the function if the child is synced directly. To avoid the exception, add a quick check at the top of sync(pipe) to ensure that pipe is actually the parent:

import meerschaum as mrsm

def register(pipe: mrsm.Pipe, **kw):
    return {'columns': {'datetime': 'timestamp'}}

def fetch(pipe: mrsm.Pipe, **kw):
    ...

def get_child_data(parent_df):
    ...

def sync(pipe: mrsm.Pipe, **kw):
    """
    Sync the parent and child in the same process.
    """
    ### Only continue if we're dealing with the parent pipe.
    if pipe.location_key is not None:
        return True, "Success"
    
    parent_df = fetch(pipe, **kw)
    pipe.sync(parent_df, **kw)
    child_pipe = mrsm.Pipe(
        pipe.connector_keys, pipe.metric_key, 'child',
        columns=pipe.columns, ### Optional
    )
    child_data = get_child_data(parent_df)
    return child_pipe.sync(child_data, **kw)

Also, you only need to specify the columns for the child if the datetime or id column is different; when the child is registered during its first sync, it gets its column names from register() unless you pass a dictionary to columns. On line 62, you pass pipe.columns.update() as the columns, but that doesn't have any effect because update() on a dictionary returns None (it mutates the in-memory dictionary).

About the in-function imports ― in most cases, it's the convention to import at the top of the module, but because all of the plugins are imported each time mrsm is called, it's better to only load heavy libraries like pandas when they're needed. If you import sub-modules into your plugin after sync is called, then it's ok to import at the module-level in those files. One interesting way to import all of your packages at the top of the file without a performance penalty is to use lazy_import() or attempt_import():

>>> from meerschaum.utils.packages import lazy_import, attempt_import
>>> ### One at a time
>>> pd = lazy_import('pandas')
>>> np = lazy_import('numpy')
>>> 
>>> ### One or more at once
>>> pd, np = attempt_import('pandas', 'numpy', venv='derivative_test', deactivate=False)

Finally, for code styling, I recommend reading some of the PEP8 guidelines, and for readability, I recommend getting into the habit of splitting blocks of code into as many small functions as possible. Over time, you'll get a feeling for which level of abstraction you're working in, like how the entry point function (sync() in this case) usually has the highest level concepts and almost reads like poetry with well-written function names. Here's an interesting series of talks from the speaker Robert "Uncle Bob" Martin called Clean Code where he gives tips for writing readable code.

I hope this cleared some things up! I'm cleaning up a lot of the documentation for the v0.6.0 release (soon to be v0.6.1) which should make things easier! For example, I'm exposing the Plugin class so that cross-pollinating between plugins should be easier.

Edit: The location_key is what differentiates parent from child in this case, so I hard-coded the location as 'child' in this example.

2 replies

p-zach Apr 24, 2022
Author

Thanks! This mostly fixed it, one issue though--when you feed pipe.location_key to child_pipe, that's giving the child_pipe a location of None as well, which prevents the child pipe from being created and therefore no derivative data is created. I believe I fixed that by replacing pipe.location_key with 'deriv_1', effectively "naming" it 'deriv_1'. Is this valid, or should I do something else?

bmeares Apr 24, 2022
Maintainer

That's exactly right — passing the same three keys with the same instance creates a copy of the parent, not a distinct child. It was pretty late when I wrote up that code snippet! Good catch, I'll edit my original answer.

After answering your question last night, I published v0.6.1 which refactors the Python API and cleans up the package docs. I'm working on updating the changelog and adding more documentation about derivative pipes, the Plugin class, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Derivative sync plugin problem #60

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Derivative sync plugin problem #60

Uh oh!

Uh oh!

p-zach Apr 24, 2022

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

bmeares Apr 24, 2022 Maintainer

Uh oh!

p-zach Apr 24, 2022 Author

Uh oh!

bmeares Apr 24, 2022 Maintainer

p-zach
Apr 24, 2022

Replies: 1 comment 2 replies

bmeares
Apr 24, 2022
Maintainer

p-zach Apr 24, 2022
Author

bmeares Apr 24, 2022
Maintainer