Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program 12.7: in R, SW^{AC} summaries do not match those in the book (nor in Stata or Python) #15

Open
rdiaz02 opened this issue May 19, 2021 · 0 comments

Comments

@rdiaz02
Copy link

rdiaz02 commented May 19, 2021

A minor issue that had me confused for a while. This concerns the R code for section 12.7.

  • In the calculations of SW^C and SW^{AC}, summaries of SW^C and SW^{AC} do not
    match those of Python or Stata, nor does the summary of SW^{AC} matches that
    given in the book (p. 159). In R we use all observations; in contrast, Python
    sets the SW^C of the censored to 1 by decree (and I think Sata does not use the
    censored when computing the SW^{C}).

  • Does it matter? Not for the final model fit. Since the censored observations
    are not used (of course: they have NA for the dependent variable) they are
    dropped when fitting the model in R. However, I find it confusing that this is
    implicit, not explicit and, moreover, the intermediate summaries of SWs are not
    correct.

  • What would I suggest?

    • Changing the computation of the SW^C, maybe as Python does or, even better
      (?), setting the censored to NA. E.g. nhefs$sw.c[nhefs$cens == 1] <- NA
    • In the model fit for msm.sw explicitly using only the subset of individuals
      that are not censored.

Details follow for SW^C and SW^{AC}. I am using this for the Python code:
https://github.com/jrfiedler/causal_inference_python_code/blob/master/chapter12.ipynb

############### SW^C
nhefs$sw.c <- pn.cens / pd.cens
## Does not match Python output (chunk 77). Compare max and mean
summary(nhefs$sw.c)


## The Python code (chunk 76) does the equivalent of
nhefs$sw.c2 <- nhefs$sw.c 
nhefs$sw.c2[nhefs$cens == 1] <- 1

## Compare to Python output: matches
summary(nhefs$sw.c2)
## (Though that is irrelevant, as Python will exclude the censored ones in the
## final fit)

## Stata also creates the sw_c variable, but only assigns for those uncensored 
## (I am Stata ignorant; guessing the meaning of the expression).


############# SW^{AC}

nhefs$sw <- nhefs$sw.c * nhefs$sw.a

## Summary of this variable does not match the book, nor the stata output nor the
##  Python output: max is 12.86, not 4.10 (same max in stata and Python)

summary(nhefs$sw)

## If we use only the uncensored, it matches Stata output and Python output (chunk
## 79)

summary(nhefs$sw[nhefs$cens == 0])


################ Why is the fit not affected?

## Because in this call
msm.sw <- geeglm(
  wt82_71 ~ qsmk,
  data = nhefs,
  weights = sw,
  id = seqn,
  corstr = "independence"
)

## wt82_71 has NA, precisely those censored. And the summary shows it is fitted
## for 1566.


##  Number of observations match Stata: see table of Stata
##  output at the bottom of the output

## In Python, the final model is not using the censored ones:
##     Chunk 80, expression sw_AC[nhefs_all.censored == 0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant