|
| 1 | +--- |
| 2 | +title: "Homework 11" |
| 3 | +output: word_document |
| 4 | +--- |
| 5 | + |
| 6 | +```{r setup, include=FALSE} |
| 7 | +knitr::opts_chunk$set(echo = TRUE) |
| 8 | +``` |
| 9 | + |
| 10 | +# Assignment for Biostatistics Week 11: |
| 11 | + |
| 12 | +Download and open the “apoptosis” data from the class webpage Answer the following questions referring to this dataset. |
| 13 | + |
| 14 | +This data summarizes an in vitro study of patients with HIV taking IL-7. The main measurement of interest was the percent of cells undergoing apoptosis (cell death), over time. That measurement has been summarized in two ways: betahat6, a measure of the trend of the change over time for each person, and avdiff6, a measure of the average change over time. |
| 15 | + |
| 16 | +```{r} |
| 17 | +library(readr) |
| 18 | +apoptosis <- read_csv("Papers/Biostatistics JHU 2021/apoptosis.csv") |
| 19 | +``` |
| 20 | + |
| 21 | +1. Make a scatterplot of betahat6 against CD4 using plot( ) |
| 22 | + |
| 23 | +```{r} |
| 24 | +
|
| 25 | +plot(apoptosis$betahat6 ~ apoptosis$CD4) |
| 26 | +
|
| 27 | +``` |
| 28 | + |
| 29 | +a. Compute the value of the Pearson Correlation Coefficient and test whether this is =0 using cor.test( ). What is your p-value? What can you say? |
| 30 | + |
| 31 | +```{r} |
| 32 | +
|
| 33 | +cor.test(apoptosis$CD4, apoptosis$betahat6) |
| 34 | +
|
| 35 | +
|
| 36 | +``` |
| 37 | +*p-value is not statistically significant. Can't reject 0.* |
| 38 | + |
| 39 | +b. Now find the Spearman Correlation Coefficient based on the ranks, using the same function but specifying method= “spearman”. What is the value of Spearman’s ρ? What is your p-value? Using this, are these variables significantly correlated? |
| 40 | + |
| 41 | +```{r} |
| 42 | +#Enter code here |
| 43 | +
|
| 44 | +cor.test(apoptosis$CD4, apoptosis$betahat6, method = "spearman") |
| 45 | +
|
| 46 | +
|
| 47 | +``` |
| 48 | +*p-value is statistically significant, more appropriate because of the outlier.* |
| 49 | + |
| 50 | +c. Use lm( ) to fit a regression line to betahat6 predicted by CD4. Write down the regression equation for this fit line: |
| 51 | + |
| 52 | +```{r} |
| 53 | +#Enter code here |
| 54 | +summary(lm(formula = apoptosis$betahat6 ~ apoptosis$CD4)) |
| 55 | +
|
| 56 | +
|
| 57 | +``` |
| 58 | + |
| 59 | +d. Is the slope of the line significantly different than 0? What is your p-value? |
| 60 | + |
| 61 | +*No, because p-value is greater than .05. Same p-value as pearson, will always be true if have single x and y values.* |
| 62 | + |
| 63 | +e. Plot the residuals versus fitted values. Describe what you see. Does the line seem to fit well, based on this? |
| 64 | + |
| 65 | +```{r} |
| 66 | +fit1 <- lm(apoptosis$betahat6 ~ apoptosis$CD4) |
| 67 | +
|
| 68 | +names(fit1) |
| 69 | +plot(fit1$fitted.values, fit1$residuals) |
| 70 | +abline(h=0) |
| 71 | +plot(apoptosis$betahat6 ~ apoptosis$CD4) |
| 72 | +abline(fit1) |
| 73 | +
|
| 74 | +
|
| 75 | +``` |
| 76 | + |
| 77 | +f. Find the R2 using the output of the linear regression. If you square root this, what number do you get? |
| 78 | + |
| 79 | +*Multiple R-squared: 0.08073; sqrt(.08073) = 0.2841303* |
| 80 | + |
| 81 | + |
| 82 | +2. Notice that there is one large outlier, with the highest CD4 count in the data. Exclude that one value, and recompute: |
| 83 | +a. The Pearson Correlation coefficient and p-value. |
| 84 | + |
| 85 | +```{r} |
| 86 | +#Enter code here |
| 87 | +#use subset() or square brackets - data point. Look at past homework. |
| 88 | +no_outlier <- subset(apoptosis, CD4<=800, select = CD4) |
| 89 | +no_outlier2 <- subset(apoptosis, betahat6<=4.5, select = betahat6) |
| 90 | +cor.test(no_outlier$CD4, no_outlier2$betahat6) |
| 91 | +
|
| 92 | +``` |
| 93 | + |
| 94 | +b. The Spearman Correlation coefficient and p-value. |
| 95 | + |
| 96 | +```{r} |
| 97 | +
|
| 98 | +cor.test(no_outlier$CD4, no_outlier2$betahat6, method = "spearman") |
| 99 | +
|
| 100 | +``` |
| 101 | + |
| 102 | +c. the regression equation and p-value. |
| 103 | + |
| 104 | +```{r} |
| 105 | +#Enter code here |
| 106 | +
|
| 107 | +summary(lm(formula = no_outlier$CD4 ~ no_outlier2$betahat6)) |
| 108 | +
|
| 109 | +``` |
| 110 | + |
| 111 | +d. The plot of residuals versus fitted values |
| 112 | + |
| 113 | +```{r} |
| 114 | +fit2 <- lm(no_outlier$CD4 ~ no_outlier2$betahat6) |
| 115 | +names(fit2) |
| 116 | +plot(fit2$fitted.values, fit2$residuals) |
| 117 | +abline(h=0) |
| 118 | +
|
| 119 | +plot(no_outlier$CD4 ~ no_outlier2$betahat6) |
| 120 | +abline(fit2) |
| 121 | +
|
| 122 | +``` |
| 123 | + |
| 124 | + |
| 125 | +Which of these changed substantially? Why? |
| 126 | + |
| 127 | +*What changed significantly is the pearsons and the spearman tests because you excluded that outlier and there wa less sway in one direction.* |
| 128 | + |
| 129 | + |
| 130 | +Other Homework: Chapter 17 #1, 3,7, Chapter 18 #4, #9, |
| 131 | + |
| 132 | +17.1 When you are investigating the relationship between two continuous random variables, why is it important to create a scatter plot of the data? |
| 133 | +*This allows you to see the individual points better and to determine if there are any outliers.* |
| 134 | + |
| 135 | +17.3 How does Spearman's rank correlation differ from the Pearson correlation? |
| 136 | +*The Pearson is most appropriate for measurements taken from an interval scale, while the Spearman is more appropriate for measurements taken from ordinal scales.* |
| 137 | + |
| 138 | +17.7. The data set lowbwt contains information collected for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [5] (Appendix B, Table B.7). Measurements of systolic blood pressure are saved under the variable name sbp, and values of the Apgar score recorded five minutes after birth-an index of neonatal asphyxia or oxygen deprivation-are saved under the name apgar5. The Apgar score is an ordinal random variable that takes values between 0 and 10. |
| 139 | + |
| 140 | +(a) Estimate the correlation of the random variables systolic blood pressure and five-minute Apgar score for this population of low birth weight infants. |
| 141 | +```{r} |
| 142 | +library(readr) |
| 143 | +lowbwt <- read_csv("Papers/Biostatistics JHU 2021/lowbwt.csv") |
| 144 | +plot(lowbwt$sbp ~ lowbwt$apgar5) |
| 145 | +t.test(lowbwt$sbp, lowbwt$apgar5) |
| 146 | +
|
| 147 | +``` |
| 148 | + |
| 149 | +(b) Does Apgar score tend to increase or decrease as systolic blood pressure increases? |
| 150 | +*As Apgar increase it looks like so does the systolic blood pressure.* |
| 151 | + |
| 152 | +(c) Test the null hypothesis H0 : ρ = 0. What do you conclude? |
| 153 | +*We can reject the null hypothesis, because the p-value is less than .05.* |
| 154 | + |
| 155 | +18.4 Why is it dangerous to extrapolate an estimated linear regression line outside the range of the observed data values? |
| 156 | +*This can introduce too many variables making the data unreliable.* |
| 157 | + |
| 158 | +18.9. The data set lowbwt contains information for the sample of 100 low birth weight infants born in Boston, Massachusetts. Measurements of systolic blood pressure are saved under the variable name sbp, and values of gestational age under the name *gestage*. |
| 159 | + |
| 160 | +(a) Construct a two-way scatter plot of systolic blood pressure versus gestational age. Does the graph suggest anything about the nature of the relationship between these variables? |
| 161 | +```{r} |
| 162 | +plot(lowbwt$sbp ~ lowbwt$gestage) |
| 163 | +
|
| 164 | +summary(lm(formula = lowbwt$sbp ~ lowbwt$gestage)) |
| 165 | +fit3 <- lm(lowbwt$sbp ~ lowbwt$gestage) |
| 166 | +names(fit3) |
| 167 | +plot(fit3$fitted.values, fit3$residuals) |
| 168 | +abline(h=0) |
| 169 | +plot(lowbwt$sbp ~ lowbwt$gestage) |
| 170 | +abline(fit3) |
| 171 | +t.test(lowbwt$sbp, lowbwt$gestage) |
| 172 | +
|
| 173 | +``` |
| 174 | +*This graph suggests there may be a correlation between the ages of 29 and 31 have chances of higher spb.* |
| 175 | + |
| 176 | +(b) Using systolic blood pressure as the response and gestational age as the explanatory variable, compute the least-squares regression line. Interpret the estimated slope and y-intercept of the line; what do they mean in words? |
| 177 | +*As women age they seem to stay around the 40 spb, but there are df surrounding the line.* |
| 178 | + |
| 179 | +(c) At the 0.05 level of significance, test the null hypothesis that the true population slope β is equal to 0. What do you conclude? |
| 180 | +*We can reject the null hypothesis because the p-value is below .05.* |
| 181 | + |
| 182 | +(d) What is the estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks? |
| 183 | +*The mean is 28.89* |
| 184 | + |
| 185 | +(e) Construct a 95% confidence interval for the true mean value of systolic blood pressure when x = 31 weeks. |
| 186 | +*The confidence interval is between 15.87472 and 20.50528.* |
| 187 | + |
| 188 | +(f) Suppose that you randomly select a new child from the population of low birth weight infants with gestational age 31 weeks. What is the predicted systolic blood pressure for this child? |
| 189 | +*The predicted sbp for this infant will be close to 40.* |
| 190 | + |
| 191 | +(g) Construct a 95% prediction interval for this new value of systolic blood pressure. |
| 192 | + |
| 193 | +(h) Does the least-squares regression model seem to fit the observed data? Comment on the coefficient of determination and a plot of the residuals versus the fitted values of systolic blood pressure. |
| 194 | +*The least squares regression model seems to fit the observed data based on the residual values we calculated and then aligning them with the spb numbers from the graph.* |
| 195 | + |
0 commit comments