ipfpres.Rpres

Evaluating the performance of IPF 
========================================================
font-family: 'Helvetica'
transition: rotate
<large> **new tests for an old technique** </large>

Robin Lovelace   
[RSAI-BIS](http://www.rsai-bis.org/) August 2013, Cambridge  
Wednesday 21^st  11 - 13, early careers session


Introduction
========================================================


- Iterative proportional fitting is an established statistical technique [(Deming, 1940)](http://www.jstor.org/stable/10.2307/2235722)
- It estimates the values of internal cells, based on marginal totals:    
![Marginal totals table](figure/t2.png)
- Used in spatial microsimulation for allocating individuals to zones [(Lovelace et al., 2013)]( http://dx.doi.org/10.1016/j.compenvurbsys.2013.03.004)

Other applications of IPF
=======================================================


How IPF works 1 - visually
========================================================
Visually, this can be seen as follows:

Selection of variables used to sample individuals

People most representative of the target area selected

(Optional) process of integerisation converts weights into whole individuals

------------------------
  <img src="figure/schematic.jpg" height="800px" width="400px" />

How IPF works 2 - in maths
========================================================
$$
w(n+1) = \frac{w(n) \times sT_{i}}{mT(n)_{i}}
$$

- where, $w(n+1)$ is the individual's new weight,
- $w(n)_{ij}$ is the original weight, 
- $sT_{i}$ is the marginal total of the small area constraint
- $mT(n)_{i}$ is the aggregate results of the weighted microdataset

Apply this algorithm, one constraint at a time, to every area in the case study

How IPF works 3 - in code
=================================================
Main algorithm:  
```{r, eval=F}
for (j in 1:nrow(all.msim)){
  for(i in 1:ncol(con1)){
 weights[which(ind.cat[,i] == 1),j,1] <- con1[j,i] /ind.agg[j,i,1]}}
```
Or in English:
- for each zone, set the new weight of individuals in each category equal to their true number divided by their current number in the simulation
- May need worked example to 'get it' (took me 3 months!)

The need for testing
============================================
It is well-known that IPF works:
- converges towards a single result
- robust and computationally efficient
- has been used in many spatial microsimulation studies

Much less is know about the factors influencing its performance.  
Are there ways IPF should or should-not be set-up?

Baseline scenarios
===================================
Three baseline scenarios were used:
- a simplest possible case, with 5 areas, 10 individuals and 2 constraints
- 'small area' constraints: 24 '[OA](http://www.ons.gov.uk/ons/guide-method/geography/beginner-s-guide/census/output-area--oas-/index.html)' zones, ~1,000 individuals and 3 constraints
- 'Sheffield', containing the 71 '[MSOA](http://www.ons.gov.uk/ons/guide-method/geography/beginner-s-guide/census/super-output-areas--soas-/index.html)', ~5,000 individuals and 4 constraints

Most tests were done on the 'small area' scenario

Baseline result
====================================
- Correlation rapidly approaches 1
- Beyond 5 iterations, result is indistinguishable from 1
- perfect convergence (no empty cells)

- But are we using the right metric of model fit?

Baseline result - visual
===================================
![Convergence between constraints](figure/analysis1.png)

-------------------

![Correlation over time - baseline](figure/corr-baseline.png)

Evaluating model fit
====================================
Commonly used options include:
- Pearson's coefficient of correlation (r)
- Total and Standardised Absolute Error (TAE and SAE)
- Root mean squared (RMS)
- Z-scores 
- Standard Error Around Identity (SEI)
- other metrics do exist!

Model experiments
=====================================
The impact of the following changes was tested:
- number of iterations
- number/order of constraints
- initial weights
- ratio of survey size:zone areas
- empty cells
- integerisation

Results - Iterations and constraints
======================================
- After 4 iterations all models had near-perfect fit
- The order of constraints had some impact, but not a lot
- Fewer constraints > faster convergence (dur!)

Results - Initial weights I
=======================================
  <img src="models/small-area-weights/weight-1-5-its.png" height="600px" width="1000px" />
  
  
  Doubling initial weight has some impact after 1 iteration, tends rapidly to 0

Results - Initial weights II
=======================================
![Influence of weights](models/small-area-weights/weights-exp-5.4.nice2.png)

Effects most pronounced within each iteration

Knock-on effects on other individuals

Summary of findings
=======================================
- The number of lines of code to perform IPF has been reduced
- IPF converges rapidly to a single result, if set-up correctly
- Supports previous work suggesting convergence after 10 iterations ([Ballas et al. 2005](http://www.jrf.org.uk/sites/files/jrf/1859352669.pdf))
- Five is probably sufficient for 4 or fewer constraints
- Initial weights seem to have very little impact on the results - will have no impact on the model
- Integerisation has a slight negative impact on fit

Conclusions and further work
======================================
- IPF is a useful procedure for various applications, but its utility can be extended and enhanced in various ways ([Pritchard and Miller, 2012](http://www.springerlink.com/index/10.1007/s11116-011-9367-4))
- Before 'trying to run', however, researchers should master walking 
- Therefore these basic tests on the performance of IPF should be useful in informing future work
- Reproducible code and example data should be useful to others ('fork me' on [Github](https://github.com/Robinlovelace/IPF-performance-testing) !)
- More model experiments: missing cells and interactions between variables
- Is it possible for IPF to be made even faster?
- Methods for grouping individuals (e.g. into families)

Key references (see links to others)
========================================
<small>
Deming, W. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. [The Annals of Mathematical Statistics](http://www.jstor.org/stable/10.2307/2235722) 

Lovelace, R., & Ballas, D. (2013). “Truncate, replicate, sample”: A method for creating integer weights for spatial microsimulation. [*CEUS*, 41, 1–11](doi:http://dx.doi.org/10.1016/j.compenvurbsys.2013.03.004)

"[IPF-performance-testing](https://github.com/Robinlovelace/IPF-performance-testing)" github repository - please 'clone' this and contribute! + this presentation at [www.rpubs.com/RobinLovelace](http://rpubs.com/RobinLovelace/7598)

Pritchard, D. R., & Miller, E. J. (2012). Advances in population synthesis: fitting many attributes per agent and fitting to household and person margins simultaneously. [Transportation, 39(3)](http://www.springerlink.com/index/10.1007/s11116-011-9367-4)

Thanks for listening  `r.lovelace at leeds.ac.uk`
</small>