ProblemSet2-template.Rmd

---
title: "Analysis on key factors affecting life satisfaction"
author: "Woolim Kim, Yena Joo, Guemin Kim"
date: "Oct 19, 2020"
output:
  pdf_document: default
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
library(dplyr)
library(tidyverse)
library("gridExtra")
library(knitr)
library("kableExtra")
# the data set was obtained from the CHASS website
# please download the submitted gss.csv in Quercus to the directory
data <- read_csv("gss.csv")
```

# Analysis on key factors affecting life satisfaction
*Code and data supporting this analysis is available at: https://github.com/Guemin/Problem_Set_2.git *

## Abstract

We use the 2017 General Social Survey(GSS) data obtained from the CHASS website to study and analyze some potential factors affecting Canadians' life satisfactions, as well as to observe the most significant factor among them. Linear model and adjusted $R^2$ values^[*adjusted $R^2$ is explained in "Model" section] are used to determine the significant factors, and plots are drawn to show the linear trend or to explain some weaknesses of the analysis. Through the analysis, We find positive linear relationships between a dependent variable, life_satisfaction rate and two independent variables self_rated_mental_health and self_rated_health, which means people who rate themselves as "mentally and/or physically healthy" are relatively more satisfied with their lives; on the other hand, there is no overall effects of hours worked on life satisfaction level. 
Together, these results suggest that no matter how much the income or working hours are, the most important factor that decides people's life satisfaction level is their mental health. 

## Introduction

There are many factors that determine one's well being and satisfaction of life. It could be health condition, economic status, relationships, religion, or any other element one would value or prioritize.   
Particularly, in these uncertain and unprecedented times caused by the COVID-19 pandemic, many people across the world are feeling more stressed due to the changes that the pandemic has brought into their lives and those concerns negatively influence their own well being.
Therefore, as the ongoing COVID-19 crisis reminds us of the importance of well being, our group decided to investigate and determine some factors that influence one's well being and satisfaction of life.

Throughout this report, we are going to determine and analyze some potential factors that affect one's life satisfaction, as well as to identify the most significant factor among them.  
To be more specific, in the following sections, we will use statistical methods to build a regression model of life satisfaction score by potential factors, and interpret the regression output to find relationships between the life satisfaction score and potential factors.  
This process includes cleaning the given data into a simpler, but efficient version, linear regression modeling, graphical visualizations and interpretations of the outputs.

## Data

The data set we chose for this assignment contains responses of the General Social Survey conducted in 2017. The contents of the survey include some characteristics of diverse families in Canada, their socio-economic status, as well as other subjective information such as the respondent's life satisfaction and health conditions. 

The target population in the GSS data includes all non-institutionalized persons 15 years of age and older, living in the 10 provinces of Canada. The frame population is everyone who is registered combining both landline and cellular with Statistics Canada’s address registers, and the sampled population is whoever is reached via telephone. The target population was divided into 27 strata by geographic areas, and simple random sampling without replacement of records was performed in each stratum (which means, from each stratum/group, everyone has an equal probability of being chosen).

Since we want to identify some key factors affecting one's life satisfaction as mentioned previously, the focus of our analysis will be "Health and subjective well-being". 

The reason for choosing the 2017 GSS is because it is the most recent^[As it is stated in the documentation of the GSS, one of the primary objectives of the General Social Survey is to monitor the well being of Canadians over time.
As a result, every survey conducted so far contain the responses related to the questions asking for the respondents' well being, and the 2017 GSS is the most recent survey with such responses.] survey that includes  the "Health and subjective well being" topic.

One of the characteristics of our data set is that the majority of the variables are categorical. However, some drawbacks of using such data is that there is a limit to the kinds of statistical analysis that we can use with our data, as well as numerical operations or quantitative analysis cannot be performed on such data.  
Not only this, but there are numbers of columns in our data set with too many NAs in them, and this indicates that many of the observations in certain variables are not available or missing. If we want to use such variables in our analysis, we would first need to exclude those 'NA' observations from our data; however, any results drawn from the data could possibly be biased or misleading due to the small number of observations available. 

Since the original data set contains too many variables that are not necessary, we are going to clean the data set prior to analysis by removing them. Also, we are going to look for non-responses in our variables of interest and simply remove them from our data; removing the non-responses would not influence the overall performance of the regression model since we still have 13007 observations left in the new data.

```{r include = F, echo = F, warning = F, message = F}
# life satisfaction data with some potential factors
life_satisfaction_data <- data %>% 
  select(caseid, feelings_life, self_rated_health, self_rated_mental_health, 
         income_family, average_hours_worked) %>% 
  # mutate categorical values in self_rated_health into numerical scores
  mutate(self_rated_health = case_when(self_rated_health == "Excellent" ~ 5, 
                                       self_rated_health == "Very good" ~ 4, 
                                       self_rated_health == "Good" ~ 3, 
                                       self_rated_health == "Fair" ~ 2, 
                                       self_rated_health == "Poor" ~1, 
                                       self_rated_health == "Don't know" ~ 0)) %>%
  # mutate categorical values in self_rated_mental_health into numerical scores
  mutate(self_rated_mental_health = case_when(self_rated_mental_health == "Excellent" ~ 5, 
                                              self_rated_mental_health == "Very good" ~ 4, 
                                              self_rated_mental_health == "Good" ~ 3, 
                                              self_rated_mental_health == "Fair" ~ 2, 
                                              self_rated_mental_health == "Poor" ~1, 
                                              self_rated_mental_health == "Don't know" ~ 0)) %>%
  # remove not-responded observations from the data
  filter(self_rated_health != 0, self_rated_mental_health != 0, 
         average_hours_worked != "Don't know") %>%
  # rename some varialbes
  rename(ID = caseid, satisfaction_score = feelings_life, family_income = income_family, work_hours = average_hours_worked)
```

These are the first six observations in our newly created data:
```{r echo = F, warning = F, message = F}
# preview of what our newly created data looks like
kable(head(life_satisfaction_data), "latex", booktabs = T, align = "c") %>%
column_spec(3:3, width = "2.58cm")
```

Our data contains 6 variables: ID, satisfaction_score, self_rated_health, self_rated_mental_health, family_income, and work_hours.  
Detailed descriptions on variables are provided in the footnote^[* satisfaction_score indicates the life satisfaction score on a scale of 0(very dissatisfied) to 10(very satisfied).  
\    \    \ * self_rated_health and self_rated_mental health are the physical and mental health ratings, respectively,  
\    \    \ on a scale of 1(poor) to 5(Excellent) given by the respondent.  
\    \    \ * work_hours indicates the average number of hours worked per week.].

Since we want to observe how the life satisfaction score is related to potential factors such as health or financial conditions, the response variable of our analysis will be satisfaction_score and the predictors will be the potential factors: self_rated_health, self_rated_mental_health, family_income, and work_hours.  

(Note: the scatter plot of the raw data is eliminated since the explanatory variables are categorical, which do not show a good visualization of the data)

\newpage
## Model

Now, we are going to fit a multiple linear regression model in order to find linear associations between satisfaction_score and other predictor variables: self_rated_health, self_rated_mental_health, family_income, and work_hours, using R software.  

Note, the equation for our regression line looks like this:  

satisfaction_score = $\hat{B_0}\ +\ \hat{B_1}*x_{health2}\ +\ \hat{B_2}*x_{health3}\ +\ \hat{B_3}*x_{health4}\ +\ \hat{B_4}*x_{health5}\ +\ \hat{B_5}*x_{mental2}\ +$  
$\hat{B_6}*x_{mental3}\ +\ \hat{B_7}*x_{mental4}\ +\ \hat{B_8}*x_{mental5}\ +\ \hat{B_9}*x_{income2}\ +\ \hat{B_{10}}*x_{income3}\ +\ \hat{B_{11}}*x_{income4}\ +$   
$\hat{B_{12}}*x_{income5}\ +\ \hat{B_{13}}*x_{income6}\ +\ \hat{B_{14}}*x_{work2}\ +\ \hat{B_{15}}*x_{work3}\ +\ \hat{B_{16}}*x_{work4}\ +\ \hat{B_{17}}*x_{work5}$


(The detailed descriptions on the x-variables are found in the footnote^[* $x_{health_{i}}$ is a physical health rating indicator for i from 2 to 5  
\   \   \ (i.e. $x_{health_{5}} = 1$ if the respondent's self_rated_health = 5, and $x_{health_{5}} = 0$ otherwise).  
\   \   \ * $x_{mental_{i}}$ is a mental health rating indicator for i from 2 to 5.  
\   \   \ * $x_{income_{i}}$ is an average income range indicator for i from 2 to 6  
\   \   \ (i.e. $x_{income_{2}} = 1$ if the family income is in the second category "\$25,000 to \$49,999", and $x_{income_{2}}\ =\ 0$ otherwise).  
\   \   \ * $x_{work_{i}}$ is a working hours range indicator for i from 2 to 5  
\   \   \ (i.e. $x_{work_{2}} = 1$ if the average hours of work is in the second category "0.1 to 29.9 hours", and $x_{work_{2}}\ =\ 0$ otherwise).].)

As it is shown above, we are going to have a very long equation for our regression line; however, this is inevitable since each of our predictor variables has several levels in them.

One thing we should notice in our data is that both the income and work_hours variables are categorical. To be more specific, we are given a range of values instead of an exact amount as income or average work hours in each variable. 
However, since our response variable is numerical, and we want to determine if each predictor has a linear relationship with it, we are going to treat our observations in both income and work_hours as numbers by replacing each category in income and work_hours with the midpoint. 

This process will allow us to investigate the linear relationships between the life satisfaction and the two predictors numerically; however, there is a chance where the true values for income or work_hours could be very different from the midpoint. Therefore, we need to take into account when interpreting the regression model, that the two predictor variables could be biased and so does the result.  

```{r include = F, echo = F, warning = F, message= F}
# replace each category in income with its midpoint
model_data <- life_satisfaction_data %>% 
  mutate(family_income = case_when(family_income == "$125,000 and more" ~ 125000.00, 
                                   family_income == "$100,000 to $ 124,999" ~ 112499.50, 
                                   family_income == "$75,000 to $99,999" ~ 87499.50, 
                                   family_income == "$50,000 to $74,999" ~ 62499.50, 
                                   family_income == "$25,000 to $49,999" ~37499.50, 
                                   family_income == "Less than $25,000" ~ 12500.00)) %>%
# replace each category in work_hours with its midpoint
  mutate(work_hours = case_when(work_hours == "0 hour" ~ 0,
                                work_hours == "0.1 to 29.9 hours" ~ 15.0,
                                work_hours == "30.0 to 40.0 hours" ~ 35.0,
                                work_hours == "40.1 to 50.0 hours" ~ 45.05,
                                work_hours == "50.1 hours and more" ~ 50.1))

# General Linear Regression Model
# for satisfaction_score by all predictor variables
satisfaction_lm <- lm(satisfaction_score ~ as.factor(self_rated_health) 
                      + as.factor(self_rated_mental_health) 
                      + as.factor(family_income)
                      + as.factor(work_hours), data = model_data)
```

After finding some relationships between our response and predictor variables using the regression model, we are going to identify which predictor is the most significant factor in explaining the variability in the response variable.
In other words, we are going to find out which of the potential factors (among physical health condition, mental health condition, income and average work hours) can explain the variability in life satisfaction the most.

Diagnostic issues in our model will be discussed in the "Weaknesses" section.

\newpage
## Results

Here is the summary of the multiple linear regression model:  

__Summary 1:__
```{r echo = F, warning = F, message = F}
# summary output of the regression model
result <- summary(satisfaction_lm)
# create a vector of estimates
estimates <- c(result$coefficients["(Intercept)","Estimate"], 
               result$coefficients["as.factor(self_rated_health)2", "Estimate"],
               result$coefficients["as.factor(self_rated_health)3", "Estimate"],
               result$coefficients["as.factor(self_rated_health)4", "Estimate"],
               result$coefficients["as.factor(self_rated_health)5", "Estimate"],
               result$coefficients["as.factor(self_rated_mental_health)2", "Estimate"], 
               result$coefficients["as.factor(self_rated_mental_health)3", "Estimate"], 
               result$coefficients["as.factor(self_rated_mental_health)4", "Estimate"], 
               result$coefficients["as.factor(self_rated_mental_health)5", "Estimate"],
               result$coefficients["as.factor(family_income)37499.5", "Estimate"], 
               result$coefficients["as.factor(family_income)62499.5", "Estimate"], 
               result$coefficients["as.factor(family_income)87499.5", "Estimate"], 
               result$coefficients["as.factor(family_income)112499.5", "Estimate"], 
               result$coefficients["as.factor(family_income)125000", "Estimate"], 
               result$coefficients["as.factor(work_hours)15", "Estimate"],
               result$coefficients["as.factor(work_hours)35", "Estimate"],
               result$coefficients["as.factor(work_hours)45.05", "Estimate"],
               result$coefficients["as.factor(work_hours)50.1", "Estimate"])
# create a vector of p-values
p_values <- c(result$coefficients["(Intercept)","Pr(>|t|)"], 
              result$coefficients["as.factor(self_rated_health)2","Pr(>|t|)"], 
              result$coefficients["as.factor(self_rated_health)3","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_health)4","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_health)5","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_mental_health)2","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_mental_health)3","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_mental_health)4","Pr(>|t|)"],
              result$coefficients["as.factor(self_rated_mental_health)5","Pr(>|t|)"],
              result$coefficients["as.factor(family_income)37499.5","Pr(>|t|)"],
              result$coefficients["as.factor(family_income)62499.5","Pr(>|t|)"],
              result$coefficients["as.factor(family_income)87499.5","Pr(>|t|)"],
              result$coefficients["as.factor(family_income)112499.5","Pr(>|t|)"],
              result$coefficients["as.factor(family_income)125000","Pr(>|t|)"],
              result$coefficients["as.factor(work_hours)15","Pr(>|t|)"],
              result$coefficients["as.factor(work_hours)35","Pr(>|t|)"],
              result$coefficients["as.factor(work_hours)45.05","Pr(>|t|)"],
              result$coefficients["as.factor(work_hours)50.1","Pr(>|t|)"])
# create a table of estimates and corresponding p-values of the test with null hypothesis that the slope is 0
kable(data.frame(coefficients = c("(Intercept)", "as.factor(self_rated_health)2", "as.factor(self_rated_health)3", 
                   "as.factor(self_rated_health)4", "as.factor(self_rated_health)5", "as.factor(self_rated_mental_health)2", 
                   "as.factor(self_rated_mental_health)3", "as.factor(self_rated_mental_health)4", 
                   "as.factor(self_rated_mental_health)5", "as.factor(family_income)37499.5", 
                   "as.factor(family_income)62499.5", "as.factor(family_income)87499.5", "as.factor(family_income)112499.5",
                   "as.factor(family_income)125000", "as.factor(work_hours)15", "as.factor(work_hours)35",
                   "as.factor(work_hours)45.05", "as.factor(work_hours)50.1"), estimates, p_values))
```

With the estimates from the regression output, we know that our regression line has a following equation:  

Satisfaction Score $= 3.44552\ +\ 0.51471*x_{health2}\ +\ 0.83475*x_{health3}\ +$   $0.98584*x_{health4}\ +\ 1.17522*x_{health5}\ +\ 1.63274*x_{mental2} +$   $2.69951*x_{mental3}\ +\ 3.24499*x_{mental4}\ +\ 3.72145*x_{mental5}\ +$   $0.11351*x_{income2}\ +\ 0.33408*x_{income3}\ +\ 0.35903*x_{income4}\ +$   $0.47951*x_{income5}\ +\ 0.52816*x_{income6}\ +\ 0.29608*x_{work1}\ +$   $0.23252*x_{work2}\ +\ 0.32056*x_{work3}\ +\ 0.39476*x_{work4}$  

(*Note, you can find the descriptions for  x-variables from the footnotes in the previous page.) 

As we can observe from the output, estimated satisfaction_score increases as each of self_rated_health, self_rated_mental_health and family_income increases; however, the slope estimates for work_hours are found to be quite inconsistent, because there is a decrease in slope estimates from 0.29608 to 0.23252 in the first two categories of work_hours, but they increase again in the third and fourth categories.  
Furthermore, unlike self_rated_health, self_rated_mental_health, and family_income where the p-values are much less than the significance level of 0.05, p-values of work_hours are greater than 0.05; this finding provides us with more evidence that there is no linear relationship between the average working hours and the life satisfaction score. 
Hence, the regression model suggests that only the physical health, mental health and financial conditions have positive linear relationships with the life satisfaction score.  

Now that we've found that the life satisfaction score has linear relationships with physical health, mental health, and financial conditions, we want to ask ourselves: are those factors equally important in terms of explaining the variability in satisfaction_score?  

The answer is 'No'.  
Although they all have positive linear relationships with the response variable, not all variables may contribute significantly in explaining the variability of the response variable.  
Hence, we would now like to identify the most important predictor variable in this regression model. 

There are two ways to do this^[One way is to compare the standardized regression coefficients, and the other way is to compare the increases in adjusted $R^2$ when each predictor is added to the regression model. For this analysis, specifically, we cannot use the first method, because our model contains categorical predictor variable which cannot be standardized.].
In this analysis, specifically, we are going to use the method where we compute and compare the changes in adjusted $R^2$ for the last variable added to the model^[$R^2_{adjusted}\ =\ 1-(1-R^2)*(\frac{n-1}{n-p-1})$ where n is the total sample size and p is the number of additional predictor variables. Hence, unlike $R^2$ that will be inflated as a new predictor is added to the model, regardless of its significance, if the newly added predictor does not explain variation in the response variable well, $R^2_{adjusted}$ will go down.]. This is a valid method for identifying which predictor explains the most variability in the response variable, because by the definition, $R^2$ gives the percentage of variation in the response variable explained by the regression line, and also, if the newly added predictor variable is the only difference between the two models, the associated change in 'adjusted $R^2$' will represent the 'goodness-of-fit'.  

We are going to begin with fitting a linear model with only one predictor, and then add another predictor to the model at a time to see how much $R^2_{adjusted}$ changes when each variable is added.

Key point in this method is to identify the predictor variable with the largest increase in $R^2_{adjusted}$ when it is the last variable added to the model.  

```{r include = F, echo = F, warning = F, message = F}
# Simple Linear Regression Model with predictor self_rated_health
first <- lm(satisfaction_score ~ self_rated_health, data = life_satisfaction_data)
# adjusted R-squared of the first model
first_R2 <- summary(first)$adj.r.squared

# self_rated_mental_health added to the first model
second <- lm(satisfaction_score ~ self_rated_health + self_rated_mental_health, data = life_satisfaction_data)
# adjusted R-squared of the second model
second_R2 <-summary(second)$adj.r.squared

# family_income added to the second model
third <- lm(satisfaction_score ~ self_rated_health + self_rated_mental_health + family_income, data = life_satisfaction_data)
# adjusted R-squared of the last model
third_R2 <- summary(third)$adj.r.squared
```

This is the table with $R^2_{adjusted}$ values in each model, and the change in $R^2_{adjusted}$ as more variables are added:  

__Table 1:__
```{r echo = F, warning = F, message = F}
# create a table with adjusted r-squared and its changes as more predictors added to the model
kable(data.frame(Predictors = c("self_rated_health", "self_rated_health + self_rated_mental_health", 
                      "self_rated_health + self_rated_mental_health + family_income"), 
       R_squared_adjusted = c(first_R2, second_R2, third_R2), 
       Change = c(first_R2, second_R2 - first_R2, third_R2 - second_R2)))
```

As it is shown in the table, there is a greatest increase in $R^2_{adjusted}$ when the second predictor, self_rated_mental_health is added to the model. This increase is quite close the $R^2_{adjusted}$ of our initial simple linear regression model with a predictor, self_rated_health.

On the other hand, the change in $R^2_{adjusted}$ is relatively small when family_income is added to the regression model compared to the previous changes, and therefore, we know that the family income does not contribute significantly in explaining the variation in the satisfaction score.
\newpage

## Discussion


The goal of this process is to find some factors that affect one's satisfaction on life. Data cleaning of the 2017 General Social Survey(GSS), obtained from the CHASS website, is done in the "Data" part, but there may be some biases since many responses are not available. Therefore, We focused more on the variables with less NA responses to do the analysis as accurate as possible. Further discussions on the biases that our data may contain can be found in the "Weaknesses" section.

Here is some graphical visualization of positive correlation of the explanatory variables used in the regression models and life satisfaction score:

```{r echo = F, warning = F, message = F, fig.width=12, fig.height=10.5}

#Create boxplots of the relationship between four factors(self_rated_mental_health, work_hours, self_rated_health, family_income) and the response variable(satisfaction_score).

coefs<- coef(lm(satisfaction_score ~ self_rated_mental_health, data = life_satisfaction_data)) 
plot1<-life_satisfaction_data %>% 
  ggplot(aes(x = as.factor(self_rated_mental_health), y = (satisfaction_score))) + geom_boxplot() +
  geom_abline(intercept = coefs[1], slope = coefs[2], color = "red") + 
  ggtitle("Figure 1. Boxplot of Satisfaction score vs. Self rated Mental health") +
  ylab("Satisfaction Score") +
  xlab("Self Rated Mental Health")

plot2<-life_satisfaction_data %>%
  ggplot(aes(x = as.factor(work_hours), y = (satisfaction_score))) + geom_boxplot() +
  ggtitle("Figure 2. Boxplot of Satisfaction score vs. Work hours ") +
  ylab("Satisfaction Score") +
  xlab("Work Hours") + scale_x_discrete(labels = c("0 hour" = "0", "0.1 to 29.9 hours" = "0.1-29.9", "30.0 to 40.0 hours" = "30-40", "40.1 to 50.0 hours" = "40.1-50", "50.1 hours and more" = "50.1~"))

coefs2<- coef(lm(satisfaction_score ~ self_rated_health, data = life_satisfaction_data))
plot3 <- life_satisfaction_data %>%
  ggplot(aes(x = as.factor(self_rated_health), y = (satisfaction_score))) + geom_boxplot() +
  geom_abline(intercept = coefs2[1], slope = coefs2[2], color = "red") + 
  ggtitle("Figure 3. Boxplot of Satisfaction score vs. Self rated Health ") +
  xlab("Self Rated Health") + 
  ylab("Satisfaction Score")

plot4 <- model_data %>%
  ggplot(aes(x = as.factor(family_income), y = (satisfaction_score))) + geom_boxplot() +
  ggtitle("Figure 4. Boxplot of Satisfaction score vs. Family income ") +
  xlab("Family Income") + 
  ylab("Satisfaction Score")

#Put boxplots in one place  
grid.arrange(grobs = list(plot1, plot2, plot3, plot4), ncol = 2, nrow = 2)   

```
  
In Figure 1, The boxplot has a positive linear relationship between the two variables, mental health and life satisfaction. It signifies that average of life satisfaction score increases as the self rated mental health increases.

Figure 2 shown above is the boxplot of satisfaction_score and work_hours. Work_hours variable includes 0 hour to 50 hours and more, which means that both unemployed and employed are included in this category. As it was mentioned previously, employed hours show no positive or negative relation with the satisfaction score signifying that there is no significant relationship with satisfaction score.

Figure 3 is a boxplot of satisfaction_score and self_rated_health, and it shows a positive linear relationship between the two variables. The slope of the red line shows weaker change in 1 unit of health rate than what shows in figure 1, which is mentioned previously that self_rated_health is not as well related as self_rated_mental_health.

Figure 4 shows a consistent median of satisfaction score over all income ranges, which shows family_income and satisfaction_score do not have a significant linear relationship. The boxplot suggests that family income does not contribute much in explaining the variation in satisfaction score. 

From Table 1 and Summary 1 in the "Result" section, we find that physical health also has a significant positive linear relationship (using estimates and p-value, and r-squared). Which explains, that physical and mental health are the two major factors that affect life satisfaction. However there is no strong relationship shown between income, hours worked and life satisfaction(from summary 1 using p-values and estimates). Hence, the result suggests that both physical and mental health are significant predictors. Between the two variables, mental health is a more important factor than physical health in terms of explaining the variability of the life satisfaction score.


# Weaknesses

Every statistic data study and analysis includes some biases and weaknesses. One of our weakness is that the predictor variables are categorical variable, which made it challenging to visualize the linear regressions model we found and show the significance between the life satisfaction score and other factors. 
In order to overcome these problems, we replaced the ranges given in the categorical predictor variables with their midpoints.  
This allowed us to perform quantitative analysis on our data; however, we still need to take potential biases into account when interpreting the result, because the true values for the categorical predictor variables could be far away from their newly assigned midpoints.  

Another possible weakness is that we do not know if there is any omitted variable bias. Omitted variable should be correlated with the dependent variable, and correlated with the explanatory variables included in the model. There might be an important variable that would affect the model, but it is hard to figure out since the variable might be missing in our data set, or might be impossible to measure.

Also, the units of self_rated_health and self_rated_mental_health are based on different standards of the participants and might not be a reliable measurement to analyze. There might be some errors in the measurement of the variables since it is not a measurable unit.
  

```{r echo = FALSE, message= FALSE, warning= FALSE}
# create a linear model of satisfaction score by mental health rating
mod<- lm(satisfaction_score ~ self_rated_mental_health, data = life_satisfaction_data)

# create a normal qqplot of the residuals in our model
X <- resid(mod)
qqnorm(X, main = "Figure 5. Normal Q-Q Plot")
qqline(X, col = 'red')
```
The normal QQ plot shown above graphically analyze the residuals in our regression model.  
The x-axis represents the quantiles for the standard normal distribution and the y-axis are the data points for the residuals.  
As we can see in the plot, the data points do not trend the theoretical line, and the points at each tail of the data seem to fall off the line, revealing that the distribution of residuals may have long tails.  
Normality is one of the assumptions in the linear regression model. However, the normal QQ plot above suggests that our model does not satisfy the normality assumption on the error terms. Therefore, we need to take into account that the result drawn from the regression model could be misleading or biased.


# Next Steps

For the next steps, we could do a follow up survey on the related topic (life satisfaction vs mental and physical health), and compare the data of the prior and post COVID-19. We could see how people's mental health changed due to COVID-19, the life satisfaction rate is expected to decrease in this case. Also, we could seek for interaction effects between few independent variables and do ANOVA tests to figure out if there is any independent variable that is dependent to another independent variable. (ie) how age and mental health rate could be interactive).

Furthermore, since we have found in the previous section that the normality assumption was violated in our model, we could fit other models that satisfy all of the required assumptions on linear regression model, and find the best model among them.

\newpage

## References

1. 2017 GSS Data: 
General Social Survey, Cycle 31: 2017: Family. (n.d.). Retrieved October 16, 2020, from https://sda-artsci-utoronto-ca.myaccess.library.utoronto.ca/cgi-bin/sda/hsda?harcsda4+gss31

2. 2017 GSS Data Documentation:
General Social Survey Cycle 31 : Families - Public Use Microdata File Documentation and User’s Guide. (2017). Retrieved October 17, 2020, from https://sda-artsci-utoronto-ca.myaccess.library.utoronto.ca/sdaweb/dli2/gss/gss31/gss31/more_doc/GSS31_User_Guide.pdf

3. Data Cleaning Code: 
Alexander, Rohan, and Sam Caetano.( 2019, Sept 16). "gss_cleaning.R”. Retrieved Oct 10. 2020, from
https://www.tellingstorieswithdata.com/01-03-r_essentials.html

4. Identifying the most significant predictor: 
Frost, J., Adusei, C., Peeyush, Siyabonga, Sreeja, Narayanan, J., . . . Sachin. (2019, June 13). Identifying the Most Important Independent Variables in Regression Models. Retrieved October 16, 2020, from https://statisticsbyjim.com/regression/identifying-important-independent-variables/

5. QQ normal plot code:
Sheather, S. J. (2010). A Modern approach to regression with R. New York: Springer.

6. Plotting multiple plots on one page:
R Draw Multiple ggplot2 Plots Side-by-Side (Example): Plot on One Page. (2020, September 30). Retrieved October 17, 2020, from https://statisticsglobe.com/draw-multiple-ggplot-plots-side-by-side

7. Knitr package:
Xie, Y. (n.d.). Knitr v1.30. Retrieved October 19, 2020, from https://www.rdocumentation.org/packages/knitr/versions/1.30

8. Wrapping column name in pdf table using package "kableExtra": 
Feder, A. (2018, April 10). Wrap column name in pdf table, from knitr::kable. Retrieved October 19, 2020, from https://community.rstudio.com/t/wrap-column-name-in-pdf-table-from-knitr-kable/3278/4  (apa6 )