forked from Robinlovelace/IPF-performance-testing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathipfpres.Rpres
170 lines (129 loc) · 7.01 KB
/
ipfpres.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
Evaluating the performance of IPF
========================================================
font-family: 'Helvetica'
transition: rotate
<large> **new tests for an old technique** </large>
Robin Lovelace
[RSAI-BIS](http://www.rsai-bis.org/) August 2013, Cambridge
Wednesday 21^st 11 - 13, early careers session
Introduction
========================================================
- Iterative proportional fitting is an established statistical technique [(Deming, 1940)](http://www.jstor.org/stable/10.2307/2235722)
- It estimates the values of internal cells, based on marginal totals:

- Used in spatial microsimulation for allocating individuals to zones [(Lovelace et al., 2013)]( http://dx.doi.org/10.1016/j.compenvurbsys.2013.03.004)
Other applications of IPF
=======================================================
How IPF works 1 - visually
========================================================
Visually, this can be seen as follows:
Selection of variables used to sample individuals
People most representative of the target area selected
(Optional) process of integerisation converts weights into whole individuals
------------------------
<img src="figure/schematic.jpg" height="800px" width="400px" />
How IPF works 2 - in maths
========================================================
$$
w(n+1) = \frac{w(n) \times sT_{i}}{mT(n)_{i}}
$$
- where, $w(n+1)$ is the individual's new weight,
- $w(n)_{ij}$ is the original weight,
- $sT_{i}$ is the marginal total of the small area constraint
- $mT(n)_{i}$ is the aggregate results of the weighted microdataset
Apply this algorithm, one constraint at a time, to every area in the case study
How IPF works 3 - in code
=================================================
Main algorithm:
```{r, eval=F}
for (j in 1:nrow(all.msim)){
for(i in 1:ncol(con1)){
weights[which(ind.cat[,i] == 1),j,1] <- con1[j,i] /ind.agg[j,i,1]}}
```
Or in English:
- for each zone, set the new weight of individuals in each category equal to their true number divided by their current number in the simulation
- May need worked example to 'get it' (took me 3 months!)
The need for testing
============================================
It is well-known that IPF works:
- converges towards a single result
- robust and computationally efficient
- has been used in many spatial microsimulation studies
Much less is know about the factors influencing its performance.
Are there ways IPF should or should-not be set-up?
Baseline scenarios
===================================
Three baseline scenarios were used:
- a simplest possible case, with 5 areas, 10 individuals and 2 constraints
- 'small area' constraints: 24 '[OA](http://www.ons.gov.uk/ons/guide-method/geography/beginner-s-guide/census/output-area--oas-/index.html)' zones, ~1,000 individuals and 3 constraints
- 'Sheffield', containing the 71 '[MSOA](http://www.ons.gov.uk/ons/guide-method/geography/beginner-s-guide/census/super-output-areas--soas-/index.html)', ~5,000 individuals and 4 constraints
Most tests were done on the 'small area' scenario
Baseline result
====================================
- Correlation rapidly approaches 1
- Beyond 5 iterations, result is indistinguishable from 1
- perfect convergence (no empty cells)
- But are we using the right metric of model fit?
Baseline result - visual
===================================

-------------------

Evaluating model fit
====================================
Commonly used options include:
- Pearson's coefficient of correlation (r)
- Total and Standardised Absolute Error (TAE and SAE)
- Root mean squared (RMS)
- Z-scores
- Standard Error Around Identity (SEI)
- other metrics do exist!
Model experiments
=====================================
The impact of the following changes was tested:
- number of iterations
- number/order of constraints
- initial weights
- ratio of survey size:zone areas
- empty cells
- integerisation
Results - Iterations and constraints
======================================
- After 4 iterations all models had near-perfect fit
- The order of constraints had some impact, but not a lot
- Fewer constraints > faster convergence (dur!)
Results - Initial weights I
=======================================
<img src="models/small-area-weights/weight-1-5-its.png" height="600px" width="1000px" />
Doubling initial weight has some impact after 1 iteration, tends rapidly to 0
Results - Initial weights II
=======================================

Effects most pronounced within each iteration
Knock-on effects on other individuals
Summary of findings
=======================================
- The number of lines of code to perform IPF has been reduced
- IPF converges rapidly to a single result, if set-up correctly
- Supports previous work suggesting convergence after 10 iterations ([Ballas et al. 2005](http://www.jrf.org.uk/sites/files/jrf/1859352669.pdf))
- Five is probably sufficient for 4 or fewer constraints
- Initial weights seem to have very little impact on the results - will have no impact on the model
- Integerisation has a slight negative impact on fit
Conclusions and further work
======================================
- IPF is a useful procedure for various applications, but its utility can be extended and enhanced in various ways ([Pritchard and Miller, 2012](http://www.springerlink.com/index/10.1007/s11116-011-9367-4))
- Before 'trying to run', however, researchers should master walking
- Therefore these basic tests on the performance of IPF should be useful in informing future work
- Reproducible code and example data should be useful to others ('fork me' on [Github](https://github.com/Robinlovelace/IPF-performance-testing) !)
- More model experiments: missing cells and interactions between variables
- Is it possible for IPF to be made even faster?
- Methods for grouping individuals (e.g. into families)
Key references (see links to others)
========================================
<small>
Deming, W. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. [The Annals of Mathematical Statistics](http://www.jstor.org/stable/10.2307/2235722)
Lovelace, R., & Ballas, D. (2013). “Truncate, replicate, sample”: A method for creating integer weights for spatial microsimulation. [*CEUS*, 41, 1–11](doi:http://dx.doi.org/10.1016/j.compenvurbsys.2013.03.004)
"[IPF-performance-testing](https://github.com/Robinlovelace/IPF-performance-testing)" github repository - please 'clone' this and contribute! + this presentation at [www.rpubs.com/RobinLovelace](http://rpubs.com/RobinLovelace/7598)
Pritchard, D. R., & Miller, E. J. (2012). Advances in population synthesis: fitting many attributes per agent and fitting to household and person margins simultaneously. [Transportation, 39(3)](http://www.springerlink.com/index/10.1007/s11116-011-9367-4)
Thanks for listening `r.lovelace at leeds.ac.uk`
</small>