wrangling-manipulation.qmd

# 数据操作 {#sec-data-manipulation}

目前， R 语言在数据操作方面陆续出现三套工具，最早的是 Base R（1997 年 4月），之后是 **data.table**（2006年4月） 和 **dplyr**（2014 年1月）。下面将从世界银行下载的原始数据开始，以各种数据操作及其组合串联起来介绍，完成数据探查的工作。

## 操作工具 {#sec-tools}

本节所用数据来自世界银行，介绍 Base R、**data.table**、**dplyr** 的简介、特点、对比

### Base R {#sec-base-r}

在 data.frame 的基础上，提供一系列辅助函数实现各类数据操作。

```{r}
aggregate(iris, Sepal.Length ~ Species, FUN = length)
```

### data.table {#sec-datatable}

**data.table** 包在 Base R 的基础上，扩展和加强了原有函数的功能，提供一套完整的链式操作语法。

```{r}
library(data.table)
iris_dt <- as.data.table(iris)
iris_dt[ ,.(cnt = length(Sepal.Length)) , by = "Species"]
```

### dplyr {#sec-dplyr}

**dplyr** 包提供一套全新的数据操作语法，与 **purrr** 包和 **tidyr** 包一起形成完备的数据操作功能。在 R 环境下，**dplyr** 包提供一套等价的表示，代码如下：

```{r}
iris |> 
  dplyr::group_by(Species) |> 
  dplyr::count()
```

### SQL {#sec-sql}

实际工作中，SQL （结构化查询语言）是必不可少的基础性工具，比如 [SQLite](https://sqlite.org/)、 [Hive](https://hive.apache.org/) 和 [Spark](https://spark.apache.org/sql/) 等都提供基于 SQL 的数据查询引擎，没有重点介绍 SQL 操作是因为本书以 R 语言为数据分析的主要工具，而不是它不重要。以 **dplyr** 来说吧，它的诸多语义动词就是对标 SQL 的。

```{r}
library(DBI)
conn <- DBI::dbConnect(RSQLite::SQLite(),
  dbname = system.file("db", "datasets.sqlite", package = "RSQLite")
)
```

按 Species 分组统计数据条数， SQL 查询语句如下：

```{sql sql-query, connection=conn, output.var="iris_preview"}
SELECT COUNT(1) AS cnt, Species
FROM iris
GROUP BY Species;
```

SQL 代码执行的结果如下：

```{r}
iris_preview
```

**dplyr** 包能连接数据库，以上 SQL 代码也可以翻译成等价的 **dplyr** 语句。

```{r}
dplyr::tbl(conn, "iris") |> 
  dplyr::group_by(Species) |> 
  dplyr::count()
```

**dplyr** 包的函数 `show_query()` 可以将 **dplyr** 语句转化为查询语句，这有助于排错。

```{r}
dplyr::tbl(conn, "iris") |> 
  dplyr::group_by(Species) |> 
  dplyr::count() |> 
  dplyr::show_query()
```

**glue** 包可以使用 R 环境中的变量，相比于 `sprintf()` 函数，可以组合更大型的 SQL 语句，这在生产环境中广泛使用。

```{r}
# R 环境中的变量
group <- "Species"
# 组合 SQL
query <- glue::glue("
  SELECT COUNT(1) AS cnt, Species
  FROM iris
  GROUP BY ({group})
")
# 将 SQL 语句传递给数据库，执行 SQL 语句
DBI::dbGetQuery(conn, query)
```

用完后，关闭连接通道。

```{r}
dbDisconnect(conn = conn)
```

更多关于 SQL 语句的使用介绍见书籍[《Become a SELECT star》](https://wizardzines.com/zines/sql/)。

## Base R 操作 {#sec-basic-operator}

介绍最核心的 Base R 数据操作，如筛选、排序、变换、聚合、重塑等

### 筛选 {#sec-select}

筛选操作可以用函数 `subset()` 或 `[` 实现

```{r}
subset(iris, subset = Species == "setosa" & Sepal.Length > 5.5, select = c("Sepal.Length", "Sepal.Width"))
```

```{r}
iris[iris$Species == "setosa" & iris$Sepal.Length > 5.5, c("Sepal.Length", "Sepal.Width")]
```

### 变换 {#sec-transform}

变换操作可以用函数 `within()`/`transform()` 实现。最常见的变换操作是类型转化，比如从字符串型转为因子型、整型或日期型等。

```{r}
# iris2 <- transform(iris, Species_N = as.integer(Species))[1:3, ]
iris2 <- within(iris, {
  Species_N <- as.integer(Species)
})
str(iris2)
```

### 排序 {#sec-order}

排序操作可以用函数 `order()` 实现

```{r}
iris[order(iris$Sepal.Length, decreasing = FALSE)[1:3], ]
```

### 聚合 {#sec-aggregation}

聚合操作可以用函数 `aggregate()` 实现

```{r}
aggregate(iris, Sepal.Length ~ Species, mean)
```

### 合并 {#sec-merge}

两个数据框的合并操作可以用函数 `merge()` 实现

```{r}
df1 <- data.frame(a1 = c(1, 2, 3), a2 = c("A", "B", "C"))
df2 <- data.frame(b1 = c(2, 3, 4), b2 = c("A", "B", "D"))
# LEFT JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all.x = TRUE)
# RIGHT JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all.y = TRUE)
# INNER JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all = FALSE)
# FULL JOIN
merge(x = df1, y = df2, by.x = "a2", by.y = "b2", all = TRUE)
```

### 重塑 {#sec-reshape}

将数据集从宽格式转为长格式，可以用函数 `reshape()` 实现，反之，亦然。

```{r}
# 长格式
df3 <- data.frame(
  extra = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4),
  group = c("A", "A", "A", "B", "B", "B"),
  id = c(1, 2, 3, 1, 2, 3)
)
# 长转宽
reshape(df3, direction = "wide", timevar = "group", idvar = "id")
# 也可以指定组合变量的列名
reshape(df3, direction = "wide", timevar = "group", idvar = "id",
        v.names = "extra", sep = "_")
```

提取并整理分组线性回归系数。函数 `split()` 将数据集 iris 按分类变量 Species 拆分成列表， 函数 `lapply()` 将线性回归操作 `lm()` 应用于列表的每一个元素上，再次用函数 `lapply()` 将函数 `coef()` 应用于线性回归后的列表上，提取回归系数，用函数 `do.call()` 将系数合并成矩阵，最后，用函数`as.data.frame()` 转化成数据框。

```{r}
s1 <- split(iris, ~Species)
s2 <- lapply(s1, lm, formula = Sepal.Length ~ Sepal.Width)
s3 <- lapply(s2, coef)
s4 <- do.call("rbind", s3)
s5 <- as.data.frame(s4)
s5
do.call(
  "rbind",
  lapply(
    lapply(
      split(iris, ~Species), lm,
      formula = Sepal.Length ~ Sepal.Width
    ),
    coef
  )
)
```

## data.table 操作 {#sec-data-table}

掌握此等基础性的工具，再去了解新工具也不难，更重要的是，只要将一种工具掌握的足够好，也就足以应付绝大多数的情况。

1.  介绍 **data.table** 基础语法，对标 Base R，介绍基础操作，同时给出等价的 **dplyr** 实现，但不运行代码。

2.  **data.table** 扩展 Base R 数据操作，介绍常用的操作 8 个，讲清楚出现的具体场景，同时给出等价的 dplyr 实现，但不运行代码。

3.  **data.table** 特有的高级数据操作 `on`、`.SD` 、`.I` 、`.J` 等。

### 筛选 {#sec-dt-select}

data.table 扩展了函数 `[` 功能，简化 `iris$Species == "setosa"` 代码 `Species == "setosa"`

```{r}
iris_dt[Species == "setosa" & Sepal.Length > 5.5, c("Sepal.Length", "Sepal.Width")]
```

### 变换 {#sec-dt-transform}

变换操作可以用函数 `:=`

```{r}
iris_dt[, Species_N := as.integer(Species)]
str(iris_dt)
```

### 排序 {#sec-dt-order}

排序操作可以用函数 `order()`

```{r}
iris_dt[order(Sepal.Length, decreasing = FALSE)[1:3], ]
```

### 聚合 {#sec-dt-aggregate}

聚合操作用函数 `.()` 和 `by` 组合

```{r}
iris_dt[, .(mean = mean(Sepal.Length)), by = "Species"]
```

### 合并 {#sec-dt-merge}

合并操作也是用函数 `merge()` 来实现。

```{r}
dt1 <- data.table(a1 = c(1, 2, 3), a2 = c("A", "B", "C"))
dt2 <- data.table(b1 = c(2, 3, 4), b2 = c("A", "B", "D"))
# LEFT JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all.x = TRUE)
# RIGHT JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all.y = TRUE)
# INNER JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all = FALSE)
# FULL JOIN
merge(x = dt1, y = dt2, by.x = "a2", by.y = "b2", all = TRUE)
```

### 重塑 {#sec-dt-reshape}

将数据集从宽格式转为长格式，可以用函数 `dcast()` 实现，反之，可以用函数 `melt()` 实现。

```{r}
# 长格式
dt3 <- data.table(
  extra = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4),
  group = c("A", "A", "A", "B", "B", "B"),
  id = c(1, 2, 3, 1, 2, 3)
)
# 长转宽
dcast(dt3, id ~ group, value.var = "extra")
```

类似 Base R，也用 **data.table** 来实现 iris 分组线性回归

```{r}
iris_dt[, as.list(coef(lm(Sepal.Length ~ Sepal.Width))), by = "Species"]
```