Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

Open
donboyd5 opened this issue Jun 24, 2024 · 0 comments

Comments

@donboyd5
Copy link
Collaborator

@nikhilwoodruff , could you please look at this?

I downloaded soi_from_puf_tmd_2021.csv today, a summary file with data that is used for some of our diagnoses of file quality and target hitting, as it allows us to compare, for 2021, tmd output to IRS published aggregates.

The screenshot below shows several records from this file with total AGI, by AGI range, for taxable returns, for the sum of all filing statuses, for "not the full population" (which I take to be data_source==1 records, but I am not 100% certain). For readability, I have removed the filtered columns that have no variation in them. All columns that can vary are included in the table. The R code that generated the table in the screenshot is below the table, simply to show my work.

As you can see, we have duplicate records. That might not, on its own, cause any problems. However, note that the total for AGI is $15.8 trillion compared to $13.880 trillion in published IRS data. That would be worrying if correct. However, also note that my second table in issue #106 has my summary of our data file. It shows a total for our data of $13.725 trillion, which is quite close to the IRS total, so I think this is probably an issue with how soi_from_puf_tmd_2021.csv summarizes the microdata, and not with the microdata itself. (However, issue #106 suggests we have other things we need to attend to.)

image

dir <- r"(C:\Users\donbo\Downloads)"
fn <- "soi_from_puf_tmd_2021.csv"
df <- read_csv(path(dir, fn))

df |> 
  filter(Variable == "adjusted_gross_income",
         `Filing status` == "All",
         !Count,
         `Taxable only`,
         !`Full population`) |> 
  select(-c(`Filing status`, Count, `Taxable only`, `Full population`)) |> # for readability, drop columns with only one value
  arrange(`AGI lower bound`, desc(`AGI upper bound`), OK) |> 
  gt() |> 
  fmt_number(columns = c(`AGI lower bound`, `AGI upper bound`),
             scale=1,
             decimals = 0) |>  
  fmt_number(columns = Value:`Absolute error`,
             scale=1e-9,
             decimals = 2) |> 
  fmt_percent(columns=c(`Relative error`, `Absolute relative error`),
              decimals=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant