Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Incidence-to-Prevalence instruction set needs a bit of refinement #168

Open
liunelson opened this issue Nov 21, 2024 · 5 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@liunelson
Copy link
Member

There are some minor bugs in the code generated from the incidence-to-prevalence prompt.

I had to edit the prompt and code a bit to get the right behaviour:

I have three dataframes d1, d2, d3. Do not edit them. Create new dateframes. d1 is incident case counts. d2 is incident hospitalization counts. d3 is cumulative death counts. Let's assume that average time to recover is 14 days and average time to exit the hospital is 10 days. Can you convert this data into prevalence data? Ideally please map it to SIRHD. Assume a population of 320 million. Make sure that every value of I, R, H, D, S is at least 0.0, not negative.

The code is:

import pandas as pd

# Assuming d1 is the dataset for new cases (I), d2 for hospitalizations (H), and d3 for deaths (D)
# Convert 'date' columns in all dataframes to datetime type
d1['date'] = pd.to_datetime(d1['date'], format = '%Y-%m-%d')
d2['date'] = pd.to_datetime(d2['date'], format = '%Y-%m-%d')
d3['date'] = pd.to_datetime(d3['date'], format = '%Y-%m-%d')

# Sort dataframes by date
d1 = d1.sort_values('date')
d2 = d2.sort_values('date')
d3 = d3.sort_values('date')

# Calculate active cases (I) - Sum of new cases over the last 14 days
I = d1['value'].rolling(window=14, min_periods = 1).sum()

# Calculate hospitalizations (H) - Sum of new hospitalizations over the last 10 days
H = d2['value'].rolling(window=10, min_periods = 1).sum()

# Deaths (D) are cumulative, so we can use the value from d3 directly
D = d3['value']

# Calculate recoveries (R) - Cumulative sum of cases up to (current date - 14 days) - current deaths
R = d1['value'].cumsum().shift(periods = 14, fill_value = 0.0) - D

# Create a new dataframe for I, R, H, D
df_SIRHD = pd.DataFrame({
    'date': d1['date'],
    'I': I,
    'R': R,
    'H': H,
    'D': D
})

# Calculate the susceptible population (S)
total_population = 320e6  # Total population
df_SIRHD['S'] = total_population - df_SIRHD[['I', 'R', 'H', 'D']].sum(axis=1)

# Replace negative values with 0.0 in specified columns
df_SIRHD[['I', 'R', 'H', 'D', 'S']] = df_SIRHD[['I', 'R', 'H', 'D', 'S']].clip(lower=0)

# Remove rows where the 'date' column is NaT
df_SIRHD.dropna(subset=['date'], inplace=True)
df_SIRHD

The particular edits are:

  1. Tell the agent to avoid editing the input variables (to work around the "Reset kernel" and "Rerun all cells" issue)
  2. remove any code that changes the index to the date
  3. add min_periods = 1 on the rolling sums to avoid np.na showing I, H variables
  4. add fill_value = 0.0 to the cumulative sum for R
  5. replace negative values and remove rows with NaN date
@liunelson liunelson added the bug Something isn't working label Nov 21, 2024
@liunelson
Copy link
Member Author

Re-running the prompt above gave me code with incorrect logic unfortunately:

import pandas as pd

# Convert 'date' columns in all dataframes to datetime for proper manipulation
d1['date'] = pd.to_datetime(d1['date'])
d2['date'] = pd.to_datetime(d2['date'])
d3['date'] = pd.to_datetime(d3['date'])

# Sort dataframes by date to ensure correct rolling calculations
d1 = d1.sort_values('date')
d2 = d2.sort_values('date')
d3 = d3.sort_values('date')

# Calculate 'Infected' (I) category
d1['I'] = d1['value'] - d1['value'].shift(14)
d1['I'] = d1['I'].fillna(0).apply(lambda x: max(x, 0.0))

# Calculate 'Recovered' (R) category
d1['R'] = d1['value'].shift(14)
d1['R'] = d1['R'].fillna(0).apply(lambda x: max(x, 0.0))

# Calculate 'Hospitalized' (H) category
d2['H'] = d2['value'] - d2['value'].shift(10)
d2['H'] = d2['H'].fillna(0).apply(lambda x: max(x, 0.0))

# 'Dead' (D) category directly comes from d3
d3['D'] = d3['value'].apply(lambda x: max(x, 0.0))

# Calculate 'Susceptible' (S) category
# First, ensure we have a common date range for all calculations
common_dates = set(d1['date']) & set(d2['date']) & set(d3['date'])
df_I = d1[d1['date'].isin(common_dates)][['date', 'I']]
df_R = d1[d1['date'].isin(common_dates)][['date', 'R']]
df_H = d2[d2['date'].isin(common_dates)][['date', 'H']]
df_D = d3[d3['date'].isin(common_dates)][['date', 'D']]

# Merge dataframes on 'date' to align all calculations
df_merged = pd.merge(pd.merge(pd.merge(df_I, df_R, on='date'), df_H, on='date'), df_D, on='date')

# Calculate 'Susceptible' (S)
total_population = 320e6
df_merged['S'] = total_population - df_merged['I'] - df_merged['R'] - df_merged['H'] - df_merged['D']
df_merged['S'] = df_merged['S'].apply(lambda x: max(x, 0.0))

# Selecting the final dataframe to display the compartmentalized data
df_final = df_merged[['date', 'S', 'I', 'R', 'H', 'D']]

# Display the head of the final dataframe to verify the calculations
print(df_final.head())

@liunelson
Copy link
Member Author

A picture = a thousand words

The left operator shows the correct result while the right one is showing results calculated using the code from rerunning the prompt above.
Image

@brandomr
Copy link
Contributor

@liunelson looking into these....in the first comment, is that code correct? If not, do you have the corrected code by any chance?

@brandomr
Copy link
Contributor

@liunelson are you testing with these datasets:

{"d1":"2fb95ea8-8e71-47c6-b6a9-9c2d6d31484e",
"d2": "06a36516-f4ad-4233-a5fe-dd4a2e3343cd",
"d3":"561b00ad-331f-4d0c-8eb2-0dd813290eef"}

The issue is that they have a bunch of locations so need to be filtered first (e.g. to just United States). In that plot on the right in your screenshot I think what we're seeing is a whole bunch of values for the same date (due to multiple locations on that date). Still poking at it to confirm though

@brandomr
Copy link
Contributor

@liunelson testing with those datasets I referenced above (staging) everything seems to work for me just fine once I filter for location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants