Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

donboyd5 · 2024-06-24T23:58:49Z

@nikhilwoodruff , could you please look at this?

I downloaded soi_from_puf_tmd_2021.csv today, a summary file with data that is used for some of our diagnoses of file quality and target hitting, as it allows us to compare, for 2021, tmd output to IRS published aggregates.

The screenshot below shows several records from this file with total AGI, by AGI range, for taxable returns, for the sum of all filing statuses, for "not the full population" (which I take to be data_source==1 records, but I am not 100% certain). For readability, I have removed the filtered columns that have no variation in them. All columns that can vary are included in the table. The R code that generated the table in the screenshot is below the table, simply to show my work.

As you can see, we have duplicate records. That might not, on its own, cause any problems. However, note that the total for AGI is $15.8 trillion compared to $13.880 trillion in published IRS data. That would be worrying if correct. However, also note that my second table in issue #106 has my summary of our data file. It shows a total for our data of $13.725 trillion, which is quite close to the IRS total, so I think this is probably an issue with how soi_from_puf_tmd_2021.csv summarizes the microdata, and not with the microdata itself. (However, issue #106 suggests we have other things we need to attend to.)

dir <- r"(C:\Users\donbo\Downloads)"
fn <- "soi_from_puf_tmd_2021.csv"
df <- read_csv(path(dir, fn))

df |> 
  filter(Variable == "adjusted_gross_income",
         `Filing status` == "All",
         !Count,
         `Taxable only`,
         !`Full population`) |> 
  select(-c(`Filing status`, Count, `Taxable only`, `Full population`)) |> # for readability, drop columns with only one value
  arrange(`AGI lower bound`, desc(`AGI upper bound`), OK) |> 
  gt() |> 
  fmt_number(columns = c(`AGI lower bound`, `AGI upper bound`),
             scale=1,
             decimals = 0) |>  
  fmt_number(columns = Value:`Absolute error`,
             scale=1e-9,
             decimals = 2) |> 
  fmt_percent(columns=c(`Relative error`, `Absolute relative error`),
              decimals=1)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

donboyd5 commented Jun 24, 2024

Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

Duplicate records in soi_from_puf_tmd_2021.csv, the data that underlie some of our diagnoses of data quality and target-hitting #107

Comments

donboyd5 commented Jun 24, 2024