Review reweighting targets #5

donboyd5 · 2024-02-08T14:05:39Z

Here is a link to @donboyd5's notes about IRS spreadsheets as source of historical target data, from a prior project.

Should I add them below, or keep as separate resource doc, and start conversation here about what to do and how?

nikhilwoodruff · 2024-02-08T14:25:45Z

Thanks @donboyd5 - I think we can just keep in the doc since we don't need to add excess work here. Discussion on this thread would be good though!

donboyd5 · 2024-02-08T14:32:21Z

Suggested principles and general approach

I suggest we make this discussion primarily about setting targets. If we find ourselves talking about technical methods for hitting/approximating targets, or evaluating goodness of fit, I can open a separate issue for that.

Comments below are about long-run approach, without regard to easier/shorter methods we may need to use in the near term. Clearly we may only implement some (or possibly even none) of this in Phase 1, and may not get to all of it even in Phases 1-3. But it's good to think about where we want to end up. We absolutely should include in this issue discussion of how much targeting, and how, to do in Phase 1. My preference is to waste no work. What I mean by that is, let's not construct anything elaborate in Phase 1 that we'll have to discard or tear down later. Let's make progress on target setting and hitting methods in Phase 1, even if we don't use anything sophisticated in Phase 1 and rather just apply some simple growth rates.

We have to target the distribution of key components of income, key deductions, and numbers of filing units by type, as well as we can if we want to have accurate revenue and distributional estimates.
We generally will have good targeting information (published data on distribution of #s of units, income, and deductions) for historical years.
We generally will not have good targeting information for forecast years. Rather, we will rely primarily on simple extrapolative forecasts of the last-best historical year -- forecasts of growth in number of returns (and of population more generally), growth in per-filing-unit wages, interest income, medical expenses, etc. This is the way taxdata does it now, and the way I believe JCT and TPC do it also (eventually I'll confirm).

This approach, in its simplest form does NOT forecast the distribution of income. Rather, it relies on hitting future targets in aggregate (total # returns, total agi, total capital gains, etc.) but not necessarily their distribution. This is probably the way we'll need to do it in Phases 1-3 (to the extent we even target Phase 1) unless we are well ahead of goals. There are ways, however, to forecast simple points on the distribution of income (e.g., weighted # units by agi range) that we could consider in future work.
The last best historical year becomes a critical year in this approach. We want it to be as good as possible because after that we run out of detailed targeting information and we are flying free (no distributional targets). If we can't get the last-best year right, we certainly won't get the future right.
The last best historical year won't necessarily be the same year for everything. For example, when targeting potential (not actual) itemized deductions, we will have far more information about 2017 -- the last pre-TCJA year -- than we will for 2018-2021. We may have detailed ID distributional targets for 2017, and then extrapolative aggregate targets for later historical years.
We also need to pay close attention to the last data year -- e.g., 2015 in the case of the PUF. We almost certainly will do a better job of hitting targets, plausibly, for the last-best historical year (e.g., 2021) if we have initial weights for the last data year (e.g., 2015) that hit targets for that year.
Unless we really, really care about policy analysis for years between the data year(s) (mostly 2015) and the last-best data year (e.g., 2021), we should not worry about them and fill in those years with the simplest possible approach.

donboyd5 · 2024-02-08T14:50:15Z

Filers and nonfilers approach

IRS data will be useful primarily as targets for filers. Much of their published aggregate data will be available by AGI range and by filing status, for filers only of course.
To make best use of these data, we need to be able to:

(1) calculate agi for each record in our unweighted (or weighted but untargeted) file for any year for which we are going to target the distribution of income, so that we can determine that year's agi range for each record

(2) determine (as best we can) whether the record is a filer or not; later I will link to a past discussion on this topic and code I have used for this purpose

(3) less critically and less difficult, but still important, determine filing status (married, single, etc.) from the demographic data on the record so that we can make use of targets by filing status; ordinarily, I suspect this will be frozen - unchanging from year to year - so I mention it only for completeness
We will be able to use IRS-based targets for the filer records for many tax-related variables.
For the nonfiler records and for certain non-tax-related variables, we may consider constructing distributional targets from the CPS or ACS (e.g., suppose our last nonfiler data year is 2015, but that we can construct nonfiler or universe distribution and totals from the available 2021 CPS or ACS, and use those as 2021 targets)
Whether to target these latter variables for the universe (in addition to separate targets for filers) or, in some cases, to have separate filer and nonfiler targets that sum to the universe is an important implementation detail, but one we don't need to worry about just yet.

All of this is important for getting a file that can represent the last-best historical year well -- essential to representing future years plausibly. (If you extrapolate from the wrong base using the right growth, you'll still get the wrong future.)

donboyd5 · 2024-02-08T15:20:34Z

Principles for establishing targets -- which items to target?

IRS aggregates create opportunity to target many hundreds of variables, as do either the CPS or ACS. Certainly in the short run we won't be able to try to hit them all, and in the long run we may not want to try to hit them all. We need principles for what to target and, because our technical methods will not allow us to hit all targets well, principles for which targets to place greatest importance on. We'd also like good ways to operationalize those principles.

Possible principles -- we should target variables that are:

Important for accurately calculating tax liability by agi range by filing status -- e.g., total agi, wages, capital gains, interest income, dividends, pension income, potential SALT deductions, taxable income, # dependents, plus, for many of the preceding items, # of weighted records that have nonzero and/or positive values and/or negative values of the item, all by agi range and filing status.
Important for evaluating important current policies, even if not critical to calculating liability -- e.g., # of young children by agi range and filing status
Important for evaluating anticipated important proposed policies. (Which variables?)

Are there other important principles? Does filer age come into play under any of the above? (TBD)

martinholmer · 2024-02-08T16:28:32Z

@donboyd5 and @nikhilwoodruff, Thanks for all the work so far. My understanding of our Feb work plan is to create a flat-file version of a Policyengine-US (PEUS) hierarchical input dataset. @nikhilwoodruff has already started that work in PR #4.

My responsibility is validation. I have been able to download TSY and JCT tax expenditure estimates for FY2023, which is what I will need to compare estimates generated by our Feb dataset, which looks like it will be for CY2023 (see PR #4).

But when I look at the IRS SOI aggregate tables pointed to by @donboyd5 in issue #5, the latest available information is for CY2021. Maybe I'm missing something obvious, but I don't see that we can use any IRS-SOI data in the Feb work given the lag in IRS-SOI publication.

donboyd5 · 2024-02-09T15:55:09Z

@martinholmer @nikhilwoodruff The agreed Phase 1 plan only requires that we "construct a flattened version of the PolicyEngine file suitable for input into Tax-Calculator", so we do not need to do this specific kind of targeting in Phase 1. If the basic PE flat file will be an already-targeted CY 2023 file, which is how I interpret the screenshot in PR #4 (if not correct, please say so @nikhilwoodruff), then we don't need to do any specific targeting in Phase 1.

This thread is focused primarily on longer-term approaches.

That said, for completeness and not for work in Phase 1, there are ways to use CY 2021 target information when producing a CY 2023 file. One approach would be to forecast key targets forward 2 years, allowing us to have targets for aggregates in 2023, as well as distributional targets. Then reweight the file to hit/approximate those targets for filers. But we don't need to do that in Phase 1.

martinholmer · 2024-07-16T22:24:39Z

@donboyd5, What is the status of issue #5 (which you raised in early February)?
If it has been resolved, you should close it.
If not, then add a comment about what else needs to be done?

donboyd5 · 2024-09-02T16:26:06Z

Thanks, @martinholmer. I missed your earlier request for status update. The thread was about longer term issues that we discussed extensively with @nikhilwoodruff during Phases 1-3, although not all were addressed. Fine to close. We can revisit if / when appropriate in the future.

nikhilwoodruff added the targeting Tasks related to gathering targets label Feb 8, 2024

martinholmer mentioned this issue Feb 8, 2024

Generate initial flat file from PolicyEngine Enhanced CPS #2

Closed

nikhilwoodruff changed the title ~~Establish targets for selected variables~~ Review reweighting targets Jun 19, 2024

martinholmer closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review reweighting targets #5

Review reweighting targets #5

donboyd5 commented Feb 8, 2024 •

edited

Loading

nikhilwoodruff commented Feb 8, 2024

donboyd5 commented Feb 8, 2024 •

edited

Loading

donboyd5 commented Feb 8, 2024

donboyd5 commented Feb 8, 2024 •

edited

Loading

martinholmer commented Feb 8, 2024

donboyd5 commented Feb 9, 2024 •

edited

Loading

martinholmer commented Jul 16, 2024

donboyd5 commented Sep 2, 2024

Review reweighting targets #5

Review reweighting targets #5

Comments

donboyd5 commented Feb 8, 2024 • edited Loading

nikhilwoodruff commented Feb 8, 2024

donboyd5 commented Feb 8, 2024 • edited Loading

donboyd5 commented Feb 8, 2024

donboyd5 commented Feb 8, 2024 • edited Loading

martinholmer commented Feb 8, 2024

donboyd5 commented Feb 9, 2024 • edited Loading

martinholmer commented Jul 16, 2024

donboyd5 commented Sep 2, 2024

donboyd5 commented Feb 8, 2024 •

edited

Loading

donboyd5 commented Feb 8, 2024 •

edited

Loading

donboyd5 commented Feb 8, 2024 •

edited

Loading

donboyd5 commented Feb 9, 2024 •

edited

Loading