Skip to content

Commit f6ddde6

Browse files
杂项 (#149)
* 常见的抽样分布 * lung 数据集替换 aml 数据集 * 调用 spacyr 包识别词性,根据词性还原 lemma * 移动配置 Python 环境的位置 * 代码格式化和抛光
1 parent 16c0ead commit f6ddde6

File tree

8 files changed

+326
-50
lines changed

8 files changed

+326
-50
lines changed

.github/workflows/quarto-book-macos.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ jobs:
3737
with:
3838
pandoc-version: ${{ env.PANDOC_VERSION }}
3939

40+
4041
- name: Install TinyTeX
4142
uses: r-lib/actions/setup-tinytex@v2
4243
env:
@@ -51,6 +52,17 @@ jobs:
5152
brew install --cask font-noto-sans-cjk-sc font-noto-serif-cjk-sc
5253
fc-list | sort
5354
55+
56+
- name: Setup Python
57+
run: |
58+
sudo mkdir -p /opt/.virtualenvs/r-tensorflow
59+
sudo chown -R $(whoami):staff /opt/.virtualenvs/r-tensorflow
60+
virtualenv -p /usr/bin/python3 $RETICULATE_PYTHON_ENV
61+
source $RETICULATE_PYTHON_ENV/bin/activate
62+
pip3 install -r docker/requirements.txt
63+
python -m spacy download en_core_web_sm
64+
deactivate
65+
5466
- name: Install LaTeX packages
5567
run: |
5668
if(!require('tinytex')) install.packages('tinytex')
@@ -62,9 +74,13 @@ jobs:
6274

6375
- name: Render Book
6476
run: |
77+
source $RETICULATE_PYTHON_ENV/bin/activate
6578
quarto check
6679
quarto render
6780
shell: bash
81+
env:
82+
RETICULATE_PYTHON_ENV: /opt/.virtualenvs/r-tensorflow
83+
RETICULATE_PYTHON: /opt/.virtualenvs/r-tensorflow/bin/python
6884

6985
- name: Deploy book to bookdown.org
7086
if: github.event_name == 'pull_request'

.github/workflows/quarto-book-ubuntu.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,16 @@ jobs:
3838
with:
3939
pandoc-version: ${{ env.PANDOC_VERSION }}
4040

41+
- name: Setup Python
42+
run: |
43+
pip3 install virtualenv
44+
mkdir -p /opt/.virtualenvs/r-tensorflow
45+
virtualenv -p /usr/bin/python3 $RETICULATE_PYTHON_ENV
46+
source $RETICULATE_PYTHON_ENV/bin/activate
47+
pip3 install -r docker/requirements.txt
48+
python -m spacy download en_core_web_sm
49+
deactivate
50+
4151
- name: Install TinyTeX
4252
uses: r-lib/actions/setup-tinytex@v2
4353
env:
@@ -68,9 +78,13 @@ jobs:
6878
6979
- name: Render Book
7080
run: |
81+
source $RETICULATE_PYTHON_ENV/bin/activate
7182
quarto check
7283
quarto render --to html
7384
shell: bash
85+
env:
86+
RETICULATE_PYTHON_ENV: /opt/.virtualenvs/r-tensorflow
87+
RETICULATE_PYTHON: /opt/.virtualenvs/r-tensorflow/bin/python
7488

7589
- name: Deploy to Github Pages
7690
uses: JamesIves/github-pages-deploy-action@v4

DESCRIPTION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Imports:
7878
RSQLite,
7979
scatterplot3d,
8080
scs,
81+
SemNetCleaner,
8182
sf,
8283
showtext,
8384
shiny,

_common.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,7 @@ to_png <- function(fig_path) {
9494
)
9595
return(png_path)
9696
}
97+
98+
# 设置 Python
99+
Sys.setenv(RETICULATE_PYTHON = "/opt/.virtualenvs/r-tensorflow/bin/python")
100+
Sys.setenv(RETICULATE_PYTHON_ENV = "/opt/.virtualenvs/r-tensorflow")

_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ book:
4646
- documents-office.qmd
4747
- part: "统计分析"
4848
chapters:
49+
- sampling-distributions.qmd
4950
- common-statistical-tests.qmd
5051
- regression-and-correlation.qmd
5152
- categorical-data-analysis.qmd

analyze-survival-data.qmd

Lines changed: 40 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,17 @@
1111
1212
library(survival) # survfit
1313
library(ggplot2)
14-
library(ggfortify) # autoplot
1514
library(glmnet) # Cox Models
16-
library(VGAM) # R >= 4.4.0
17-
library(INLA)
15+
library(ggsurvfit)
16+
# library(VGAM) # R >= 4.4.0
17+
# library(INLA)
1818
```
1919

20-
生存分析可以用于用户流失分析,注册、激活、活跃。 分析次日留存、7日留存、15日留存。有学生来上体验课,多久来付费上课。 有一个人医院看病之后,多久办理住院。 最早,生存分析用于研究飞机出去之后,有多少返回的。还是要回归到原始文献去了解基本概念,及其背后的思考和应用
20+
生存分析可以用于用户流失分析,注册、激活、活跃。 分析次日留存、7日留存、15日留存。有学生来上体验课,多久来付费上课。 有一个人医院看病之后,多久办理住院。 最早,生存分析用于研究飞机出去之后,有多少还能返回的。生存分析的学习还是要回归到原始文献去了解基本概念,及其背后的思考和应用。
21+
22+
以一个生存问题引出本章主题,讲述和展示一个数据集,先探索和分析数据,之后建立和拟合模型,结果解释。
2123

22-
以一个问题提出本章主题,讲述和展示一个数据集。建立模型,拟合模型,结果解释。
24+
lung 数据集 **survival** 包(模型)和 **ggsurvfit** 包(可视化)
2325

2426
## 问题背景 {#sec-aml}
2527

@@ -28,7 +30,7 @@ library(INLA)
2830
```{r}
2931
library(survival)
3032
data(cancer, package = "survival")
31-
str(aml)
33+
str(lung)
3234
```
3335

3436
数据的分布情况如下
@@ -40,7 +42,7 @@ str(aml)
4042
#| fig-width: 4.5
4143
#| fig-height: 3
4244
43-
ggplot(data = aml, aes(x = time, y = status, color = x)) +
45+
ggplot(data = lung, aes(x = time, y = status, color = factor(sex))) +
4446
geom_jitter(height = 0.2) +
4547
theme_minimal()
4648
```
@@ -55,46 +57,46 @@ Cox 比例风险回归模型与 Box-Cox 变换 [@Box1964]
5557
- `MASS::boxcox()` Box-Cox 变换
5658
- `glmnet::glmnet(family = "cox")`
5759
- INLA 包的函数 `inla()``inla.surv()` 一起拟合,[链接](https://becarioprecario.bitbucket.io/inla-gitbook/ch-survival.html)
58-
- [survstan](https://github.com/fndemarqui/survstan) Stan 与生存分析
59-
- rstanarm 包的函数 `stan_jm()` 使用说明 Estimating Joint Models for Longitudinal and Time-to-Event Data with rstanarm [链接](https://cran.r-project.org/web/packages/rstanarm/vignettes/jm.html)
60-
- rstanarm 包的[生存分析分支](https://github.com/stan-dev/rstanarm/pull/323)
6160

6261
### survival
6362

6463
R 软件内置了 [survival](https://github.com/therneau/survival) 包,它是实现生存分析的核心 R 包 [@Terry2000],其函数 `survfit()` 拟合模型。
6564

6665
```{r}
67-
aml_survival <- survfit(Surv(time, status) ~ x, data = aml)
68-
summary(aml_survival)
66+
lung_surv <- survfit(Surv(time, status) ~ sex, data = lung)
67+
lung_surv
6968
```
7069

7170
拟合 Cox 比例风险回归模型(Cox Proportional Hazards Regression Model)
7271

7372
```{r}
74-
aml_coxph <- coxph(Surv(time, status) ~ 1 + x, data = aml)
75-
summary(aml_coxph)
73+
lung_coxph <- coxph(Surv(time, status) ~ 1 + sex, data = lung)
74+
summary(lung_coxph)
7675
```
7776

78-
展示拟合结果。可以绘制生存分析的图的 R 包有很多,比如 ggfortify 包、[ggsurvfit](https://github.com/ddsjoberg/ggsurvfit/) 包和 [survminer](https://github.com/kassambara/survminer) 包等。ggfortify 包可以直接针对函数 `survfit()` 的返回对象绘图,[ggsurvfit](https://github.com/ddsjoberg/ggsurvfit/) 包提供新函数 `survfit2()` 拟合模型、函数 `ggsurvfit()` 绘制图形,画面内容更加丰富,而 [survminer](https://github.com/kassambara/survminer) 包依赖很多。
77+
展示拟合结果。可以绘制生存分析的图的 R 包有很多,比如 ggfortify 包、[ggsurvfit](https://github.com/pharmaverse/ggsurvfit) 包和 [survminer](https://github.com/kassambara/survminer) 包等。ggfortify 包可以直接针对函数 `survfit()` 的返回对象绘图,ggsurvfit 包提供新函数 `survfit2()` 拟合模型、函数 `ggsurvfit()` 绘制图形,画面内容更加丰富,而 survminer 包依赖很多。
7978

8079
```{r}
8180
#| label: fig-leukemia-surv
8281
#| fig-cap: "急性粒细胞白血病生存数据"
8382
#| fig-showtext: true
8483
#| fig-width: 6
85-
#| fig-height: 3
86-
87-
library(ggplot2)
88-
library(ggfortify)
89-
autoplot(aml_survival, data = aml) +
90-
theme_minimal()
91-
```
92-
93-
参数化的生存分析模型(参数模型,相对于非参数模型而言)
94-
95-
```{r}
96-
aml_surv_reg <- survreg(Surv(time, status) ~ x, data = aml, dist = "weibull")
97-
summary(aml_surv_reg)
84+
#| fig-height: 5
85+
86+
p <- survfit2(Surv(time, status) ~ sex, data = lung) |>
87+
ggsurvfit(linewidth = 1) +
88+
add_confidence_interval() +
89+
add_risktable() +
90+
add_quantile(y_value = 0.6, color = "gray50", linewidth = 0.75) +
91+
scale_ggsurvfit()
92+
p +
93+
# limit plot to show 8 years and less
94+
coord_cartesian(xlim = c(0, 1000)) +
95+
# update figure labels/titles
96+
labs(
97+
y = "Percentage Survival",
98+
title = "Recurrence by Time From Surgery to Randomization",
99+
)
98100
```
99101

100102
### glmnet
@@ -106,19 +108,22 @@ glmnet 包拟合 Cox 比例风险回归模型 [@simon2011] 适合需要多变量
106108
107109
library(glmnet)
108110
# alpha = 1 lasso
109-
aml_glmnet <- glmnet(x = aml$x, y = Surv(aml$time, aml$status), family = "cox", alpha = 1)
110-
aml_glmnet_cv <- cv.glmnet(x = aml$x, y = Surv(aml$time, aml$status), family = "cox", alpha = 1)
111+
lung_glmnet <- glmnet(x = lung$sex, y = Surv(lung$time, lung$status), family = "cox", alpha = 1)
112+
lung_glmnet_cv <- cv.glmnet(x = lung$sex, y = Surv(lung$time, lung$status), family = "cox", alpha = 1)
111113
```
112114

113115
### INLA
114116

115117
INLA 包拟合 Cox 比例风险回归模型 [@Virgilio2020] 采用近似贝叶斯推断。
116118

117119
```{r}
120+
#| eval: false
121+
118122
library(INLA)
119123
inla.setOption(short.summary = TRUE)
120-
aml_inla <- inla(inla.surv(time, status) ~ x, data = aml, family = "exponential.surv", num.threads = "1:1")
121-
summary(aml_inla)
124+
lung_inla <- inla(inla.surv(time, status) ~ sex, data = lung,
125+
family = "exponential.surv", num.threads = "1:1")
126+
summary(lung_inla)
122127
```
123128

124129
## Tobit 回归 {#sec-tobit-regression}
@@ -135,17 +140,7 @@ Tobit (Tobin's Probit) regression 起源于计量经济学中的 Tobit 模型,
135140
136141
library(VGAM) # Vector Generalized Linear and Additive Models
137142
# VGAM::vglm(family = tobit(Upper = 800)) # Tobit regression
138-
```
139-
140-
```{r}
141-
library(VGAM)
142-
with(aml, SurvS4(time, status))
143-
```
144-
145-
```{r}
146-
#| eval: false
147-
#| echo: false
148-
149-
aml_vglm <- vglm(SurvS4(time, status) ~ x, data = aml, family = cens.poisson)
150-
summary(aml_vglm)
143+
with(lung, SurvS4(time, status))
144+
lung_vglm <- vglm(SurvS4(time, status) ~ sex, data = lung, family = cens.poisson)
145+
summary(lung_vglm)
151146
```

analyze-text-data.qmd

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ library(quanteda.textstats) # 查询、统计
1717
library(quanteda.textmodels) # LSA
1818
library(ggplot2) # 绘图
1919
library(text2vec) # LDA 算法
20+
library(spacyr)
21+
library(data.table)
2022
```
2123

2224
接着,调用 `tools` 包的函数 `CRAN_package_db()` 获取 R 包元数据,为了方便后续重复使用,保存到本地。
@@ -45,15 +47,57 @@ pdb$Title <- gsub(pattern = '"', replacement = "", x = pdb$Title, fixed = T)
4547
pdb$Title <- tolower(pdb$Title)
4648
```
4749

48-
- 提取词干和词形还原。这一步比较麻烦,需要根据词性使用不同的规则处理。做名词还原调用 **SemNetCleaner** 包的函数 `singularize()` ,如 models / modeling 还原为 model, methods 还原为 method 等等。
50+
- 提取词干和词形还原。这一步比较麻烦,需要先使用 **spacyr** 包解析出词性,再根据词性使用不同的规则处理。做名词还原调用 **SemNetCleaner** 包的函数 `singularize()` ,如 models / modeling 还原为 model, methods 还原为 method 等等。
4951

5052
```{r}
53+
#| message: false
54+
55+
# 向量化函数 singularize 用于函数 fcase
56+
vec_singularize <- function(word, ...){
57+
unlist(lapply(word, SemNetCleaner::singularize, ...))
58+
}
59+
vec_singularize(word = c("methods", "models", "data"))
60+
library(spacyr)
61+
# OpenMP
62+
Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)
63+
# 初始化 不需要实体识别
64+
spacy_initialize(model = "en_core_web_sm", entity = F)
65+
66+
# 准备解析文本向量
67+
title_desc <- pdb$Title
68+
names(title_desc) <- pdb$Package
69+
# 解析文本需要一点时间约 1 分钟
70+
title_token <- spacy_parse(x = title_desc, entity = F)
71+
72+
# 调用 data.table 操作数据提升效率
73+
library(data.table)
74+
title_token <- as.data.table(title_token)
75+
# 生成新的一列作为 lemma
76+
title_token$lemma2 <- title_token$lemma
77+
# 处理动词和名词
78+
title_token$lemma2 <- fcase(
79+
title_token$pos %in% c("VERB", "AUX"), title_token$lemma,
80+
title_token$pos %in% c("NOUN", "PROPN", "PRON"), vec_singularize(title_token$token),
81+
!title_token$pos %in% c("VERB", "AUX", "NOUN", "PROPN", "PRON"), title_token$token
82+
)
83+
# 还原成向量
84+
pdb <- aggregate(title_token, lemma2 ~ doc_id, paste, collapse = " ")
85+
colnames(pdb) <- c("Package", "Title")
86+
# 清理中间变量
87+
rm(title_token, title_desc)
88+
```
89+
90+
```{r}
91+
#| eval: false
92+
#| code-fold: true
93+
5194
# Token 化之后操作
52-
# 安装 tidytext 包
5395
# 名词
54-
SnowballC::wordStem(words = c("methods", "models"))
96+
SnowballC::wordStem(words = c("methods", "models", "data"))
5597
# pdb$Title_stem <- SnowballC::wordStem(pdb$Title)
56-
tokenizers::tokenize_word_stems(x = c("methods", "models"))
98+
tokenizers::tokenize_word_stems(x = c("methods", "models", "data"))
99+
# 调用 SnowballC 包 提取词干
100+
quanteda::tokens_wordstem(tokens(x = c("methods models data")))
57101
```
58102

59103
R 包标题文本的长度分布
@@ -119,7 +163,7 @@ R 语言作为一门主要用于数据获取、分析、处理、建模和可视
119163
#| code-fold: true
120164
# 词云
121165
set.seed(20252025)
122-
textplot_wordcloud(word1, min_size = 1, max_size = 5)
166+
textplot_wordcloud(word1, min_size = 0.9, max_size = 5)
123167
```
124168

125169
## 关联词、短语 {#sec-multi-word-expressions}
@@ -155,6 +199,8 @@ word3 |>
155199
(\(x) x[order(x$count, decreasing = T), ])()
156200
```
157201

202+
其中有两个词组 `via windsor.ai api``amazon web service` 乍一看有点奇怪,其实是这两个公司发布的一系列 R 包导致。
203+
158204
```{r}
159205
#| label: fig-frequency-phrase
160206
#| fig-cap: 高频短语

0 commit comments

Comments
 (0)