Skip to content

Latest commit

 

History

History
1648 lines (1401 loc) · 113 KB

logit.md

File metadata and controls

1648 lines (1401 loc) · 113 KB

Logistic Regression: will people born today will have a better life than their parents?

course: Data Analysis in Sociology
author: Ekaterina Skriptsova
date: 2020-02-15


introduction

Do developers think that people born today will have a better life than their parents?

Well, this work examines exactly this question using Stack Overflow Developer Survey 2019, which covers various aspects of developers' life. In this report, I focus on 6 variables: Age, Gender, Region, SocialMedia, BlockchainIs, and BetterLife -- the outcome.

why those?

Age and Gender may seem as obvious ones, as representatives of different genders and generations sometimes have different attitudes to life.

Social media may be used to develop a sense of belonging and connectedness, which can improve wellbeing and perhaps make people more optimistic about the future of children. I expected that the belief in happy future will differ depending on the social media that is used the most due to different "nature" of the services — some of them are just messengers allowing people to chat with friends, some also give an opportunity to read news, while social networking sites do all of that — show news, happy friends, some with perfect lives. Some also are the so-called superapps, e.g. WeChat, that include a whole set of different features in one application. Unfortunately, in this dataset we do not know the original reason of why and for what people use the social media (news, chat, etc.), so let's just see whether it differs. In sum, the news feeds here are different, and maybe sometimes the less you know — the better?

Region -- originally created from the Country variable, was added because both countries and regions do differ economically.

BlockchainIs was the last variable added to the analysis, suggested by my colleague A. Nikulin. This is a developing technology, which probably may inspire the developers and make them believe in a brighter future. New tech - better life?

Does the belief about better life of youth differ depending on the selected predictors?

variables

The outcome variable is BetterLife: "Will People Born Today Have a Better Life Than Their Parents?".

As for the predictors, the possible ones are:

  • Region (constructed from Country)
  • Gender
  • SocialMedia
  • BlockchainIs
  • Age

Final equation in general form: $$log(\frac{p(BetterLife_Yes)}{p(1 - BetterLife_Yes)}) = \beta_0 + \beta_1Region + \beta_2BlockchainIs + \beta_3Gender + \beta_4Age$$

df = res %>% dplyr::select(Country, SocialMedia, Gender, Age, BlockchainIs, BetterLife)
df = remove_attributes(df, "spec") 
df = df %>% dplyr::filter(!is.na(BetterLife))
factor_vars = c("Country", "SocialMedia", "Gender", "BlockchainIs",
                "BetterLife")
df[factor_vars] = data.frame(lapply(df[factor_vars], to_factor))
df$Region = countrycode::countrycode(df$Country, origin = "country.name", destination =  "region")

exploratory data analysis

In this section, exploratory data analysis is performed in order to see how the selected variables are distributed, which should be further put into the model and may be important overall. Firstly, I make sure that variables of interest do not have too much missing data, and as a result - drop several ideas due to this fact. Then proceed with plotting the variables to check within which ones the percentage of those who believe that life for children will be better differs.

missing data

At the data loading stage, I've deleted observations that did not have label for the outcome variable (BetterLife). Now let's see whether other variables have missing values, and if their amount is > 15% - try to remove them. Otherwise - think what can be done about that.

As you can see from the following plot, BlockchainIs has ~30% of missings (similar to other predictors in full dataset), but without this one the models will be worse and less interesting, so I'll keep it. Uh.

Another possible predictor that has many missing values is Age. This variable plays an important role in deciding whether people born today will have a better life than parents, according to publications on the Internet [1]. In total, approximately 10% of 86269 values are missing here. Thus, let's keep the variable but drop guys with unknown age.

predictors

After removal of missing values in the previously mentioned variables, the dataset contains 50 841 observations. Now the data has the following variables, which were used to create several models:

feature description
BetterLife Do you think people born today will have a better life than their parents?
Gender Which of the following do you currently identify as?
Age What is your age (in years)?
SocialMedia What social media site do you use the most?
Region In which region (orig. - country) do you currently reside?
BlockchainIs Blockchain / cryptocurrency technology is primarily:

And basic descriptive statistics:

## The data contains 50841 observations of the following variables:
##   - SocialMedia: 14 levels: Facebook (n = 7402); Hello (n = 4); I don't use social media (n = 3099); Instagram (n = 3627); LinkedIn (n = 2456); Reddit (n = 10283); Snapchat (n = 416); Twitter (n = 7657); VK ВКонта́кте (n = 326); WeChat 微信 (n = 337); Weibo 新浪微博 (n = 32); WhatsApp (n = 7533); Youku Tudou 优酷 (n = 5) and YouTube (n = 7664)
##   - Gender: 7 levels: Man (n = 47179); Man;Non-binary, genderqueer, or gender non-conforming (n = 117); Non-binary, genderqueer, or gender non-conforming (n = 330); Woman (n = 3055); Woman;Man (n = 42); Woman;Man;Non-binary, genderqueer, or gender non-conforming (n = 27) and Woman;Non-binary, genderqueer, or gender non-conforming (n = 91)
##   - Age: Mean = 30.43, SD = 8.92, range = [1, 99], 0 missing
##   - BlockchainIs: 5 levels: A passing fad (n = 8596); An irresponsible use of resources (n = 7822); Useful across many domains and could change many aspects of our lives (n = 14805); Useful for decentralized currency (i.e., Bitcoin) (n = 6086) and Useful for immutable record keeping outside of currency (n = 13532)
##   - BetterLife: 2 levels: No (n = 18205) and Yes (n = 32636)
##   - Region: 20 entries: Northern America, n = 15330; Western Europe, n = 7697; Northern Europe, n = 6314 and 17 others

Let's proceed & see how these are distributed.

outcome

As stated in the task, the target variable is BetterLife - "Will People Born Today Have a Better Life Than Their Parents?". The users are quite optimistic about that: 64% think that life will improve.
Maybe they are working on new innovative technologies that will make the world a better place? Who knows! 😎

gender

In the full dataset, there are 7 unique options for Gender, however, in the Stack Overflow report they are split into:

(1) Man,
(2) Woman,
(3) Non-binary, genderqueer, or gender non-conforming.

Seems reasonable to join the levels into these three. The majority of respondents are male - approximately 93%; females constitute 6%, and non-binary, genderqueer, or gender non-conforming - 1%.

##                                                              BetterLife
## Gender                                                           No   Yes
##   Man                                                         16504 30675
##   Man;Non-binary, genderqueer, or gender non-conforming          52    65
##   Non-binary, genderqueer, or gender non-conforming             169   161
##   Woman                                                        1406  1649
##   Woman;Man                                                       7    35
##   Woman;Man;Non-binary, genderqueer, or gender non-conforming    12    15
##   Woman;Non-binary, genderqueer, or gender non-conforming        55    36

The following figure shows that the highest percentage of those thinking that life will be better is among males (65%), followed by females (54%) which is 11% less compared to representatives of the opposite gender. The latter - Non-binary, genderqueer, or gender non-conforming (51.4%).

age

As mentioned in the introduction, people of different age groups have different opinion towards the life improvement. Does the relationship differ in Stack Overflow data?

First of all, Age variable contains several outliers: some respondents are just 1 year old & are already coding - definitely a mistake. Better to drop these guys and leave the developers aged from 14 to 80. Now there are 50 699 observations.

##    vars     n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50841 30.43 8.92     29   29.52 7.41   1  99    98 1.19     2.57 0.04

region

Caribbean,Central Asia,Middle Africa and Western Africa have relatively small number of observations in several cells.
Based on the map & Google, I united Middle Africa with Western Africa, and Central Asia - with Western Asia because sometimes it is represented as "Western & Central Asia". Melanesia was also united with Australia and New Zealand. Caribbean were left as is because it differs from the nearby regions and is located somewhere between.

##                                        BetterLife
## Region                                    No  Yes
##   Caribbean                               20   80
##   Central America                        158  398
##   Eastern Africa                          63  233
##   Eastern Asia                           220  632
##   Eastern Europe                        1033 3070
##   Melanesia & Australia and New Zealand  562 1002
##   Middle & Western Africa                 63  293
##   Northern Africa                         72  219
##   Northern America                      5865 9406
##   Northern Europe                       2477 3818
##   South America                          522 1569
##   South-Eastern Asia                     199  733
##   Southern Africa                        151  275
##   Southern Asia                         1533 3755
##   Southern Europe                       1200 1955
##   Western & Central Asia                 471  965
##   Western Europe                        3558 4129

The most optimistic people are in Middle & Western Africa (82%), Caribbean (80%), Eastern Africa (78.7%) and South-Eastern Asia (78.5%). The least - among the developers in Southern (62%), Northern (61.6%) and Western Europe (53.7%). Wow. In comparison, 74.8% of those in Eastern Europe, including Russia, believe that youth will have a better life than their parents. Developers in Northern America (e.g. USA) are also not sure that life will improve: only 61.6% of them agreed.

social media

SocialMedia shows the respondent's most frequently used social media. The most popular here are: Facebook, WhatsApp, Twitter, Reddit, YouTube; 3082 users do not use social media at all.

Overall, Facebook & VK are social networking sites, that have both quite rich newsfeeds and communication tools. LinkedIn - job-related stuff. Reddit, Snapchat, Instagram, Youtube & Youku Tudou can be grouped into entertainment services, however, some of them also may act as a place for discussions (1st one), and the format slightly differs - the latter two are video-hosting services. WhatsApp - a messenger owned by Facebook; WeChat is a Chinese super-app, which includes a lot of features from messaging, social media to mobile payment app. Twitter - mostly about microblogging (and posting sad stuff), Weibo - also microblogging website but popular in China.

Hello is also a real social network, however, only 4 people frequently use it. Maybe they are the developers of this site? Spoiler: this level was not significant, but should we keep it or drop it? Joining it with Weibo and Youku Tudou did not make much difference, so it's better to delete these several observations.

##                           BetterLife
## SocialMedia                  No  Yes
##   Facebook                 2533 4862
##   Hello                       0    4
##   I don't use social media 1457 1625
##   Instagram                1249 2365
##   LinkedIn                  896 1558
##   Reddit                   3580 6682
##   Snapchat                  125  288
##   Twitter                  3080 4571
##   VK ВКонта́кте               56  268
##   WeChat 微信                71  263
##   Weibo 新浪微博              6   26
##   WhatsApp                 2714 4813
##   Youku Tudou 优酷            1    3
##   YouTube                  2399 5204

The highest percentage of those thinking that people born today will have a better life than their parents is among users of VK 💙 - 82.7% selected "Yes", which is quite understandable given that most of them come from Eastern Europe (e.g. Russia), where situation was not the best one and now has improved. This is followed by the users of WeChat: 78.7% of them are sure. This application is widely popular in China, so this result may be related to the country's growth [2].

Those who frequently use worldwide SNs and entertainment services, such as Snapchat, YouTube, Instagram, Facebook and Reddit - have from 69 to 63% of believers in better life. In particular, for FB users figure is 65.75%, which is by ~17% lower compared to VK. Perhaps, this is due to Facebook being used by people all around the world, and the attitude may differ depending on the country.

Twitter, as I expected, has one of the lowest percentages - only 59.74% of those who use it frequently have optimistic thoughts about the future. And approximately half of those who do not use social media have a similar attitude - 52.73%. I suppose that this may happen due to several reasons: maybe some users do not browse social media because they are disappointed with what they saw at some point, or just think that social media makes life sadder, or social media may make someone feel rather isolated and not more connected, etc.

In sum, those who frequently use worldwide entertainment services, e.g. Youtube or Snapchat, have higher percantage of those thinking that youth will live better than their parents, while those who use Twitter (news + tweets about life) and LinkedIn (job-related) - are least optimistic about the future, followed by those who do not use social media at all.

blockchainIs ⭐

And a bonus/additional variable suggested in a chat with A. Nikulin: BlockchainIs (Blockchain / cryptocurrency technology is primarily:..).

One may think how can this one be related to the BetterLife? Well, turned out that BlockchainIs is connected to the outcome. The variable has 5 levels:

  • A passing fad (n = 8569);
  • An irresponsible use of resources (n = 7794);
  • Useful across many domains and could change many aspects of our lives (n = 14751);
  • Useful for decentralized currency (i.e., Bitcoin) (n = 6035);
  • Useful for immutable record keeping outside of currency (n = 13510)

Developers who consider blockchain as useful in some way are positively thinking about the youth having a better life - from 71.3% to 64.5%! Those who have rather negative attitude (a passing fad or an irresponsible use of resources) have lower value.

## The data contains 50659 observations of the following variables:
##   - SocialMedia: 11 levels: Facebook (n = 7395); I don't use social media (n = 3082); Instagram (n = 3614); LinkedIn (n = 2454); Reddit (n = 10262); Snapchat (n = 413); Twitter (n = 7651); VK ВКонта́кте (n = 324); WeChat 微信 (n = 334); WhatsApp (n = 7527) and YouTube (n = 7603)
##   - Gender: 3 levels: Man (n = 47019); Non-binary, genderqueer, or gender non-conforming (n = 600) and Woman (n = 3040)
##   - Age: Mean = 30.46, SD = 8.81, range = [14, 79], 0 missing
##   - BlockchainIs: 5 levels: A passing fad (n = 8569); An irresponsible use of resources (n = 7794); Useful across many domains and could change many aspects of our lives (n = 14751); Useful for decentralized currency (i.e., Bitcoin) (n = 6035) and Useful for immutable record keeping outside of currency (n = 13510)
##   - BetterLife: 2 levels: No (n = 18160) and Yes (n = 32499)
##   - Region: 17 levels: Northern America (n = 15266); Caribbean (n = 100); Central America (n = 556); Eastern Africa (n = 296); Eastern Asia (n = 823); Eastern Europe (n = 4103); Melanesia & Australia and New Zealand (n = 1563); Middle & Western Africa (n = 355); Northern Africa (n = 291); Northern Europe (n = 6294); South America (n = 2091); South-Eastern Asia (n = 929); Southern Africa (n = 426); Southern Asia (n = 5288); Southern Europe (n = 3155); Western & Central Asia (n = 1436) and Western Europe (n = 7687)


logistic regression

models

logit_sm = readRDS("logit_sm.rds")
logit_age = readRDS("logit_age.rds")
logit_reg = readRDS("logit_reg.rds")
logit_block = readRDS("logit_block.rds")

single predictor

The first models contain only one predictor. Among the presented, the best one, according to the R2 Tjur, AIC, and log-Likelihood is the one with Region, where the reference level is Northern America. Second-best - BlockchainIs. Let's proceed with more complex models and see what happens.

Models with single predictors
  BetterLife BetterLife BetterLife
Predictors Odds Ratios std. Error CI p Odds Ratios std. Error CI p Odds Ratios std. Error CI p
(Intercept) 1.92 0.02 1.83 – 2.01 <0.001 1.60 0.02 1.55 – 1.66 <0.001 1.51 0.02 1.44 – 1.57 <0.001
Facebook Reference Reference Reference
I don't use social media 0.58 0.04 0.53 – 0.63 <0.001
Instagram 0.99 0.04 0.91 – 1.07 0.750
LinkedIn 0.91 0.05 0.82 – 1.00 0.042
Reddit 0.97 0.03 0.91 – 1.04 0.383
Snapchat 1.20 0.11 0.97 – 1.49 0.097
Twitter 0.77 0.03 0.72 – 0.83 <0.001
VK ВКонта́кте 2.49 0.15 1.88 – 3.37 <0.001
WeChat 微信 1.93 0.14 1.49 – 2.54 <0.001
WhatsApp 0.92 0.03 0.86 – 0.99 0.021
YouTube 1.13 0.03 1.06 – 1.21 <0.001
Northern America Reference Reference Reference
Caribbean 2.50 0.25 1.56 – 4.19 <0.001
Central America 1.57 0.10 1.31 – 1.90 <0.001
Eastern Africa 2.31 0.14 1.76 – 3.08 <0.001
Eastern Asia 1.78 0.08 1.52 – 2.09 <0.001
Eastern Europe 1.85 0.04 1.72 – 2.00 <0.001
Melanesia & Australia and
New Zealand
1.11 0.06 1.00 – 1.24 0.050
Middle & Western Africa 2.89 0.14 2.21 – 3.84 <0.001
Northern Africa 1.90 0.14 1.46 – 2.50 <0.001
Northern Europe 0.96 0.03 0.91 – 1.02 0.199
South America 1.88 0.05 1.69 – 2.08 <0.001
South-Eastern Asia 2.29 0.08 1.95 – 2.69 <0.001
Southern Africa 1.14 0.10 0.93 – 1.39 0.214
Southern Asia 1.53 0.03 1.43 – 1.64 <0.001
Southern Europe 1.02 0.04 0.94 – 1.10 0.686
Western & Central Asia 1.28 0.06 1.14 – 1.43 <0.001
Western Europe 0.72 0.03 0.68 – 0.77 <0.001
A passing fad Reference Reference Reference
An irresponsible use of
resources
0.69 0.03 0.65 – 0.73 <0.001
Useful across many
domains and could change
many aspects of our lives
1.65 0.03 1.56 – 1.74 <0.001
Useful for decentralized
currency (i.e., Bitcoin)
1.47 0.04 1.37 – 1.57 <0.001
Useful for immutable
record keeping outside of
currency
1.21 0.03 1.14 – 1.28 <0.001
Observations 50659 50659 50659
R2 Tjur 0.008 0.022 0.021
AIC 65731.651 65009.810 65091.232
log-Likelihood -32854.825 -32487.905 -32540.616

Model comparison using anova shows that model with Region is significantly better than model with BlockchainIs, while model with blockchain is not better than social media, and social media - not significantly better than age.

## Analysis of Deviance Table
## 
## Model 1: BetterLife ~ Age
## Model 2: BetterLife ~ SocialMedia
## Model 3: BetterLife ~ BlockchainIs
## Model 4: BetterLife ~ Region
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     50657      65275                          
## 2     50648      65710  9  -434.56              
## 3     50654      65081 -6   628.42              
## 4     50642      64976 12   105.42 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

multiple predictors

The results obtained using one predictor may be quite different from those received using multiple predictors, thus, I built several models with multiple predictors to see whether including them will make the final model better, and to decide whether I should include both SocialMedia and Region as they may be related.

The predictors include: Region, SocialMedia, Gender, Age, BlockchainIs.

goodness of fit

Based on the anova output, two models (with 4 predictors each) look better than the other ones. However, they do not differ significantly.

## Analysis of Deviance Table
## 
## Model 1: BetterLife ~ SocialMedia + Gender + Age
## Model 2: BetterLife ~ Region + Gender + Age
## Model 3: BetterLife ~ SocialMedia + Region + Gender + Age
## Model 4: BetterLife ~ BlockchainIs + Region + Gender + Age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     50645      64788                          
## 2     50639      64140  6   647.85 < 2.2e-16 ***
## 3     50629      63973 10   167.35 < 2.2e-16 ***
## 4     50635      63553 -6   419.82              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test shows the best (closest to 0) value is achieved by the last model, which included BlockchainIs, Region, Gender, Age. It would also be nice to look at the model with interactions between SocialMedia and Region, however, there would be too many coefficients (& NAs) to interpret, so let's skip that. As SocialMedia and Region may be related, I'll proceed with model 4 - BetterLife ~ BlockchainIs + Region + Gender + Age.

## Likelihood ratio test
## 
## Model 1: BetterLife ~ SocialMedia + Gender + Age
## Model 2: BetterLife ~ Region + Gender + Age
## Model 3: BetterLife ~ SocialMedia + Region + Gender + Age
## Model 4: BetterLife ~ BlockchainIs + Region + Gender + Age
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1  14 -32394                         
## 2  20 -32070  6 647.85  < 2.2e-16 ***
## 3  30 -31987 10 167.35  < 2.2e-16 ***
## 4  24 -31777 -6 419.82  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Other metrics are summarized in the following outputs.

The first llh column shows the log-likelihood for the fitted models, and by comparing it to the llhNull intercept-model, all models are better than null model. The highest McFadden pseudo r-squared is achieved by model with all predictors (3.87%), followed by the one with Region, SocialMedia, Gender and Age - 3.24%. Other measures also show that the last two models are better than the others.

##           llh       llhNull            G2      McFadden          r2ML 
## -3.239416e+04 -3.305682e+04  1.325309e+03  2.004592e-02  2.582213e-02 
##          r2CU 
##  3.542874e-02
##           llh       llhNull            G2      McFadden          r2ML 
## -3.207024e+04 -3.305682e+04  1.973158e+03  2.984495e-02  3.820102e-02 
##          r2CU 
##  5.241295e-02
##           llh       llhNull            G2      McFadden          r2ML 
## -3.198656e+04 -3.305682e+04  2.140505e+03  3.237615e-02  4.137298e-02 
##          r2CU 
##  5.676497e-02
##           llh       llhNull            G2      McFadden          r2ML 
## -3.177665e+04 -3.305682e+04  2.560326e+03  3.872614e-02  4.928448e-02 
##          r2CU 
##  6.761979e-02
##   Rank Df.res   AIC  AICc   BIC McFadden Cox.and.Snell Nagelkerke    p.value
## 1   14  50640 64820 64820 64950  0.02005       0.02582    0.03543 9.326e-276
## 2   20  50640 64180 64180 64370  0.02984       0.03820    0.05241  0.000e+00
## 3   30  50630 64040 64040 64310  0.03238       0.04137    0.05676  0.000e+00
## 4   24  50640 63600 63600 63820  0.03873       0.04928    0.06762  0.000e+00

The model output shows that all final predictors are significant, there is no need to drop any.

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: BetterLife
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                         50658      66114              
## BlockchainIs  4  1032.40     50654      65081 < 2.2e-16 ***
## Region       16   803.72     50638      64278 < 2.2e-16 ***
## Gender        2   164.30     50636      64113 < 2.2e-16 ***
## Age           1   559.91     50635      63553 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hosmer-Lemeshow Test: search for evidence of poor fit

The Hosmer-Lemeshow Test is used to find evidence of poor fit. The p-value here is > 0.05, meaning that there is no evidence that the selected model is bad, nice!

## 
## 	Hosmer and Lemeshow test (binary model)
## 
## data:  df$BetterLife, fitted(logit_AgeGenRegBlock)
## X-squared = 19.842, df = 24, p-value = 0.7057

final model

Let's go deeper and use cross-validation to fit the best model, finally examine the coefficients, after that - look at the performance metrics.

cross-validation

Here the data is split into train (80%) and test (20%); cross validation: 5-fold, metric to select the optimal model -- ROC.

X = df 
set.seed(17)
split = createDataPartition(X$BetterLife, p=0.8, list = FALSE)
X_train = X[split, ]
X_valid = X[-split, ]

The output below shows the raw coefficients and the respective confidence intervals. The model equation is:

$$log(\frac{p(BetterLife_Yes)}{p(1 - BetterLife_Yes)}) = 1.26 + 0.65Region_Caribbean + 0.29Region_Central_America + \0.48Region_Eastern_Africa + 0.46Region_Eastern_Asia + \0.48Region_Eastern_Europe + 0.14Region_Melanesia_Australia_And_New_Zealand + \0.62Region_Middle_And_Western_Africa + 0.39Region_Nothern_Africa - \0.04Region_Northern_Europe + 0.45Region_South_America + \0.65Region_South_Eastern_Asia - 0.03Region_Southern_Africa + \0.1Region_Southern_Asia - 0.12Region_Southern_Europe + 0.07Region_Western_Central_Asia - \0.4Region_Western_Europe - 0.3Blockchain_Is_Anirresponsible_Use_Of_Resources + \0.39Blockchain_Is_Useful_Across_Many_Domains_And_Could_Change_Many_Aspects_Of_Our_Lives + \0.22Blockchain_Is_Useful_For_Decentralized_Currency + \0.18Blockchain_Is_Useful_For_Immutable_Record_Keeping_Outside_Of_Currency - \0.45Gender_NonBinary_Genderqueer_Or_Gender_NonConforming - \0.5Gender_Woman - 0.03*Age$$

  .outcome
Predictors Log-Odds CI p
(Intercept) 1.26 1.17 – 1.36 <0.001
RegionCaribbean 0.65 0.12 – 1.24 0.022
RegionCentral America 0.29 0.07 – 0.50 0.010
RegionEastern Africa 0.48 0.17 – 0.79 0.003
RegionEastern Asia 0.46 0.28 – 0.65 <0.001
RegionEastern Europe 0.48 0.39 – 0.57 <0.001
RegionMelanesia &
Australia and New Zealand
0.14 0.02 – 0.27 0.025
RegionMiddle & Western
Africa
0.62 0.33 – 0.93 <0.001
RegionNorthern Africa 0.39 0.10 – 0.70 0.011
RegionNorthern Europe -0.04 -0.11 – 0.03 0.276
RegionSouth America 0.45 0.33 – 0.57 <0.001
RegionSouth-Eastern Asia 0.65 0.47 – 0.83 <0.001
RegionSouthern Africa -0.03 -0.25 – 0.20 0.817
RegionSouthern Asia 0.10 0.02 – 0.18 0.017
RegionSouthern Europe -0.12 -0.21 – -0.03 0.011
RegionWestern & Central
Asia
0.07 -0.06 – 0.20 0.290
RegionWestern Europe -0.40 -0.46 – -0.33 <0.001
BlockchainIsAn
irresponsible use of
resources
-0.30 -0.37 – -0.23 <0.001
BlockchainIsUseful across
many domains and could
change many aspects of
our lives
0.39 0.33 – 0.46 <0.001
BlockchainIsUseful for
decentralized currency
(i.e., Bitcoin)
0.22 0.14 – 0.30 <0.001
BlockchainIsUseful for
immutable record keeping
outside of currency
0.18 0.12 – 0.24 <0.001
GenderNon-binary,
genderqueer, or gender
non-conforming
-0.45 -0.64 – -0.27 <0.001
GenderWoman -0.50 -0.58 – -0.41 <0.001
Age -0.03 -0.03 – -0.02 <0.001
Observations 40528
R2 Tjur 0.050

coefficients

Now, table shows the odds and confidence intervals for the final model.

OR 2.5 % 97.5 %
(Intercept) 3.5348173 3.2075017 3.8964751
RegionCaribbean 1.9169631 1.1272932 3.4566079
RegionCentral America 1.3299153 1.0751122 1.6552968
RegionEastern Africa 1.6082431 1.1903216 2.2102080
RegionEastern Asia 1.5889651 1.3281241 1.9100725
RegionEastern Europe 1.6168138 1.4817926 1.7654813
RegionMelanesia & Australia and New Zealand 1.1523770 1.0182183 1.3057894
RegionMiddle & Western Africa 1.8601818 1.3874075 2.5382697
RegionNorthern Africa 1.4804500 1.1009642 2.0208588
RegionNorthern Europe 0.9625680 0.8987819 1.0310131
RegionSouth America 1.5645756 1.3899310 1.7643605
RegionSouth-Eastern Asia 1.9094037 1.5965758 2.2962740
RegionSouthern Africa 0.9737107 0.7787505 1.2225291
RegionSouthern Asia 1.1010281 1.0173118 1.1920596
RegionSouthern Europe 0.8899249 0.8134082 0.9739942
RegionWestern & Central Asia 1.0731250 0.9423467 1.2238861
RegionWestern Europe 0.6729670 0.6313961 0.7172657
BlockchainIsAn irresponsible use of resources 0.7405123 0.6898891 0.7948075
BlockchainIsUseful across many domains and could change many aspects of our lives 1.4777415 1.3849138 1.5767700
BlockchainIsUseful for decentralized currency (i.e., Bitcoin) 1.2509784 1.1552887 1.3548357
BlockchainIsUseful for immutable record keeping outside of currency 1.1955734 1.1219330 1.2740050
GenderNon-binary, genderqueer, or gender non-conforming 0.6365716 0.5298527 0.7652612
GenderWoman 0.6081737 0.5585500 0.6622957
Age 0.9742796 0.9719264 0.9766346

Interpretation of several coefficients:

  • Intercept -- male developer in Northern America, believing that blockchain is a passing fad
  • Being in Caribbean region increases the odds by a factor of 1.92 compared to dev. in Nothern America. Other regions also seem to be positive thinking
  • Odds for developers in Southern Europe are by 11% lower compared to Northern America
  • Being a female or non-binary decreases the odds by the factor of 0.61 and 0.64 respectively compared to male developers
  • One unit increase in age is associated with 3% decrease in odds of positive belief
  • Developers who consider blockchain useful across many domains and could change many aspects of our lives have odds higher by 48% compared to those believing that blockchain is a passing fad; An irresponsible use of resources - lower by 26%

predictions

To begin with, let's look at the confusion matrix for the test data. Overall, the model fit is statistically significantly better than no model at all (Null Information Rate).

Accuracy is 65.35%, which is slightly better than the original 64% of our positive class. Recall of 0.92 means that the model was able to identify 92% of those who believe in better life of children, and precision -- 66% of those identified as "Yes" actually responded "Yes".

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   635  513
##        Yes 2997 5986
##                                           
##                Accuracy : 0.6535          
##                  95% CI : (0.6442, 0.6628)
##     No Information Rate : 0.6415          
##     P-Value [Acc > NIR] : 0.005826        
##                                           
##                   Kappa : 0.1129          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##               Precision : 0.6664          
##                  Recall : 0.9211          
##                      F1 : 0.7733          
##              Prevalence : 0.6415          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.8867          
##       Balanced Accuracy : 0.5479          
##                                           
##        'Positive' Class : Yes             
## 

roc-auc

Higher the AUC, the better the model is at predicting the classes. Perfect classifier has an AUC of 1, while 0.5 indicates no class separation capacity. In this case it equals 62.97%, which is quite ok.

## Area under the curve: 0.6297

marginal effects

Marginal effects show how a changes in the independent variable impact the outcome variable, taking all other independent variables into account.

Interpreting everything probably does not make much sense, so let's selected several ones. For example,

  • increase of Age by one year would decrease the p(y=1) by approximately -0.0057
  • for two hypothetical male developers from Northern America, the predicted probability of optimistic thoughts about better life of children is 0.07 lower for a user who considers blockchain rather an irresponsible use of resources compared to one having "a passing fad" attitude to blockchain
##                                                                             factor
##                                                                                Age
##                                      BlockchainIsAn irresponsible use of resources
##  BlockchainIsUseful across many domains and could change many aspects of our lives
##                      BlockchainIsUseful for decentralized currency (i.e., Bitcoin)
##                BlockchainIsUseful for immutable record keeping outside of currency
##                            GenderNon-binary, genderqueer, or gender non-conforming
##                                                                        GenderWoman
##                                                                    RegionCaribbean
##                                                              RegionCentral America
##                                                               RegionEastern Africa
##                                                                 RegionEastern Asia
##                                                               RegionEastern Europe
##                                        RegionMelanesia & Australia and New Zealand
##                                                      RegionMiddle & Western Africa
##                                                              RegionNorthern Africa
##                                                              RegionNorthern Europe
##                                                                RegionSouth America
##                                                           RegionSouth-Eastern Asia
##                                                              RegionSouthern Africa
##                                                                RegionSouthern Asia
##                                                              RegionSouthern Europe
##                                                       RegionWestern & Central Asia
##                                                               RegionWestern Europe
##      AME     SE        z      p   lower   upper
##  -0.0057 0.0002 -24.0455 0.0000 -0.0061 -0.0052
##  -0.0700 0.0076  -9.2172 0.0000 -0.0849 -0.0552
##   0.0881 0.0065  13.5283 0.0000  0.0754  0.1009
##   0.0547 0.0081   6.7561 0.0000  0.0388  0.0706
##   0.0409 0.0066   6.2458 0.0000  0.0281  0.0538
##  -0.1218 0.0201  -6.0740 0.0000 -0.1611 -0.0825
##  -0.1146 0.0091 -12.5786 0.0000 -0.1325 -0.0968
##   0.1356 0.0433   3.1286 0.0018  0.0506  0.2205
##   0.0547 0.0201   2.7258 0.0064  0.0154  0.0940
##   0.1032 0.0270   3.8212 0.0001  0.0503  0.1561
##   0.0888 0.0160   5.5457 0.0000  0.0574  0.1202
##   0.0996 0.0079  12.6064 0.0000  0.0841  0.1151
##   0.0232 0.0122   1.8952 0.0581 -0.0008  0.0472
##   0.1412 0.0238   5.9217 0.0000  0.0945  0.1880
##   0.0683 0.0280   2.4416 0.0146  0.0135  0.1231
##  -0.0101 0.0071  -1.4251 0.1541 -0.0239  0.0038
##   0.0871 0.0107   8.1465 0.0000  0.0661  0.1080
##   0.1258 0.0148   8.5204 0.0000  0.0969  0.1548
##   0.0028 0.0232   0.1194 0.9050 -0.0428  0.0483
##   0.0191 0.0080   2.3856 0.0170  0.0034  0.0347
##  -0.0243 0.0094  -2.5919 0.0095 -0.0427 -0.0059
##   0.0182 0.0131   1.3901 0.1645 -0.0074  0.0438
##  -0.0893 0.0068 -13.1500 0.0000 -0.1026 -0.0760

The values are plotted below. Those crossing the line are not significant.

Let's visualize marginal effect for each level of BlockchainIs in combination with Age (as it also did not cause several warning messages to appear :)) )

As you can see, the overall relationship is similar for all levels of BlockchainIs: the probability decreases with Age. The highest probabilities are among those who consider blockchain as useful across many domains, followed by another two groups of developers who think that this technology is good for decentralized currency and immutable record keeping outside of currency. The lowest probabilities occur among those who think that blockchain is an irresponsible use of resources, and this is far below the reference level -- a passing fad.


diagnostics

A brief check whether the model is ok or something should be changed.

residuals

Overall, almost all observations fall into 2 standard-error bounds, which is good!

The next plot shows the residuals plotted against individual predictors, i.e. show whether there is some non-linearity of the logit over the covariates. Some suspicious activity is present in Region, others look somewhat okay. Also, Age has a significant statistic, meaning that model should include its quadratic term.

##              Test stat Pr(>|Test stat|)    
## BlockchainIs                               
## Region                                     
## Gender                                     
## Age             31.393        2.107e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let's test. Well, it did not change the metrics, so I guess there is no need to dive deeper into manipulations with age.

## Area under the curve: 0.6297

multicollinearity check

No GVIF higher than 10, meaning that there is no multicollinearity.

##                  GVIF Df GVIF^(1/(2*Df))
## BlockchainIs 1.097852  4        1.011738
## Region       1.144075 16        1.004215
## Gender       1.016291  2        1.004048
## Age          1.072891  1        1.035805

outliers, leverage, influential

The outlier test shows that there is only one observation that may be considered as outlier. Luckily, not significant. Observations 6487 and 37945 also have the largest Cook's distances, according to the plot, so they may be influential.

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 4713 -1.992909           0.046271           NA

All the bad guys:

##         StudRes         Hat        CookD
## 4448  -1.991659 0.002390724 0.0006218848
## 4713  -1.992909 0.002350924 0.0006133267
## 6487  -1.941347 0.008385106 0.0019304206
## 11956 -1.545731 0.015109600 0.0014577058
## 37945 -1.975414 0.007946093 0.0019757676
## 43163  1.062364 0.015858794 0.0005100992

To check the effect of the outlier, I fit a model without these observations. No coefficients have changed significantly, so everything is okay. No need to change anything.

## Calls:
## 1: glm(formula = BetterLife ~ BlockchainIs + Region + Gender + Age, family =
##    "binomial", data = df)
## 2: glm(formula = BetterLife ~ BlockchainIs + Region + Gender + Age, family =
##    "binomial", data = df, subset = c(-4448, -4713, -6487, -11956, -37945, 
##   -43163))
## 
##                                                                                   Model 1
## (Intercept)                                                                        1.2570
## SE                                                                                 0.0444
##                                                                                          
## BlockchainIsAn irresponsible use of resources                                     -0.2971
## SE                                                                                 0.0323
##                                                                                          
## BlockchainIsUseful across many domains and could change many aspects of our lives  0.4037
## SE                                                                                 0.0296
##                                                                                          
## BlockchainIsUseful for decentralized currency (i.e., Bitcoin)                      0.2449
## SE                                                                                 0.0365
##                                                                                          
## BlockchainIsUseful for immutable record keeping outside of currency                 0.182
## SE                                                                                  0.029
##                                                                                          
## RegionCaribbean                                                                     0.681
## SE                                                                                  0.253
##                                                                                          
## RegionCentral America                                                              0.2535
## SE                                                                                 0.0967
##                                                                                          
## RegionEastern Africa                                                                0.500
## SE                                                                                  0.144
##                                                                                          
## RegionEastern Asia                                                                 0.4240
## SE                                                                                 0.0822
##                                                                                          
## RegionEastern Europe                                                               0.4807
## SE                                                                                 0.0404
##                                                                                          
## RegionMelanesia & Australia and New Zealand                                        0.1052
## SE                                                                                 0.0562
##                                                                                          
## RegionMiddle & Western Africa                                                       0.714
## SE                                                                                  0.141
##                                                                                          
## RegionNorthern Africa                                                               0.320
## SE                                                                                  0.138
##                                                                                          
## RegionNorthern Europe                                                             -0.0447
## SE                                                                                 0.0313
##                                                                                          
## RegionSouth America                                                                0.4153
## SE                                                                                 0.0541
##                                                                                          
## RegionSouth-Eastern Asia                                                           0.6246
## SE                                                                                 0.0827
##                                                                                          
## RegionSouthern Africa                                                              0.0124
## SE                                                                                 0.1042
##                                                                                          
## RegionSouthern Asia                                                                0.0862
## SE                                                                                 0.0363
##                                                                                          
## RegionSouthern Europe                                                              -0.107
## SE                                                                                  0.041
##                                                                                          
## RegionWestern & Central Asia                                                       0.0820
## SE                                                                                 0.0596
##                                                                                          
## RegionWestern Europe                                                               -0.384
## SE                                                                                  0.029
##                                                                                          
## GenderNon-binary, genderqueer, or gender non-conforming                           -0.5315
## SE                                                                                 0.0847
##                                                                                          
## GenderWoman                                                                       -0.5013
## SE                                                                                 0.0387
##                                                                                          
## Age                                                                               -0.0260
## SE                                                                                 0.0011
##                                                                                          
##                                                                                   Model 2
## (Intercept)                                                                        1.2581
## SE                                                                                 0.0444
##                                                                                          
## BlockchainIsAn irresponsible use of resources                                     -0.2973
## SE                                                                                 0.0323
##                                                                                          
## BlockchainIsUseful across many domains and could change many aspects of our lives  0.4042
## SE                                                                                 0.0296
##                                                                                          
## BlockchainIsUseful for decentralized currency (i.e., Bitcoin)                      0.2451
## SE                                                                                 0.0365
##                                                                                          
## BlockchainIsUseful for immutable record keeping outside of currency                 0.182
## SE                                                                                  0.029
##                                                                                          
## RegionCaribbean                                                                     0.817
## SE                                                                                  0.270
##                                                                                          
## RegionCentral America                                                              0.2533
## SE                                                                                 0.0967
##                                                                                          
## RegionEastern Africa                                                                0.500
## SE                                                                                  0.144
##                                                                                          
## RegionEastern Asia                                                                 0.4238
## SE                                                                                 0.0822
##                                                                                          
## RegionEastern Europe                                                               0.4806
## SE                                                                                 0.0404
##                                                                                          
## RegionMelanesia & Australia and New Zealand                                        0.1052
## SE                                                                                 0.0562
##                                                                                          
## RegionMiddle & Western Africa                                                       0.748
## SE                                                                                  0.143
##                                                                                          
## RegionNorthern Africa                                                               0.320
## SE                                                                                  0.138
##                                                                                          
## RegionNorthern Europe                                                             -0.0447
## SE                                                                                 0.0313
##                                                                                          
## RegionSouth America                                                                0.4151
## SE                                                                                 0.0541
##                                                                                          
## RegionSouth-Eastern Asia                                                           0.6243
## SE                                                                                 0.0827
##                                                                                          
## RegionSouthern Africa                                                              0.0123
## SE                                                                                 0.1042
##                                                                                          
## RegionSouthern Asia                                                                0.0859
## SE                                                                                 0.0363
##                                                                                          
## RegionSouthern Europe                                                              -0.107
## SE                                                                                  0.041
##                                                                                          
## RegionWestern & Central Asia                                                       0.0819
## SE                                                                                 0.0596
##                                                                                          
## RegionWestern Europe                                                               -0.384
## SE                                                                                  0.029
##                                                                                          
## GenderNon-binary, genderqueer, or gender non-conforming                           -0.5267
## SE                                                                                 0.0848
##                                                                                          
## GenderWoman                                                                       -0.5017
## SE                                                                                 0.0387
##                                                                                          
## Age                                                                               -0.0260
## SE                                                                                 0.0011
## 

conclusion

This work examined the relationship between belief in whether youth will have better life than their parents and several variables. The final model included Region, BlockchainIs, Gender and Age. All of them were significant.

Overall, developers who consider blockchain useful have more optimistic thoughts compared to those with negative attitude. Region matters as well: most regions had higher odds of belief in better life compared to developers in Northern America. As for gender and age: females and non-binary are more suspicious about the future, as well as people with higher age.


references

  1. https://www.ipsos.com/ipsos-mori/en-uk/only-third-generation-y-think-their-generation-will-have-better-quality-life-their-parents
  2. https://www.ft.com/content/cfccb29a-40a7-40a1-bd03-877c779e7ac7