You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I downloaded soi_from_puf_tmd_2021.csv today, a summary file with data that is used for some of our diagnoses of file quality and target hitting, as it allows us to compare, for 2021, tmd output to IRS published aggregates.
The screenshot below shows several records from this file with total AGI, by AGI range, for taxable returns, for the sum of all filing statuses, for "not the full population" (which I take to be data_source==1 records, but I am not 100% certain). For readability, I have removed the filtered columns that have no variation in them. All columns that can vary are included in the table. The R code that generated the table in the screenshot is below the table, simply to show my work.
As you can see, we have duplicate records. That might not, on its own, cause any problems. However, note that the total for AGI is $15.8 trillion compared to $13.880 trillion in published IRS data. That would be worrying if correct. However, also note that my second table in issue #106 has my summary of our data file. It shows a total for our data of $13.725 trillion, which is quite close to the IRS total, so I think this is probably an issue with how soi_from_puf_tmd_2021.csv summarizes the microdata, and not with the microdata itself. (However, issue #106 suggests we have other things we need to attend to.)
@nikhilwoodruff , could you please look at this?
I downloaded soi_from_puf_tmd_2021.csv today, a summary file with data that is used for some of our diagnoses of file quality and target hitting, as it allows us to compare, for 2021, tmd output to IRS published aggregates.
The screenshot below shows several records from this file with total AGI, by AGI range, for taxable returns, for the sum of all filing statuses, for "not the full population" (which I take to be data_source==1 records, but I am not 100% certain). For readability, I have removed the filtered columns that have no variation in them. All columns that can vary are included in the table. The R code that generated the table in the screenshot is below the table, simply to show my work.
As you can see, we have duplicate records. That might not, on its own, cause any problems. However, note that the total for AGI is $15.8 trillion compared to $13.880 trillion in published IRS data. That would be worrying if correct. However, also note that my second table in issue #106 has my summary of our data file. It shows a total for our data of $13.725 trillion, which is quite close to the IRS total, so I think this is probably an issue with how soi_from_puf_tmd_2021.csv summarizes the microdata, and not with the microdata itself. (However, issue #106 suggests we have other things we need to attend to.)
The text was updated successfully, but these errors were encountered: