-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
210 lines (141 loc) · 6.22 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
output: github_document
---
```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "##",
fig.path = "README-"
)
library(dplyr)
library(xplor)
```
# xplor
A package of lightweight utility functions for assisting with interactive data exploration, particularly geared toward understanding record layouts of relational data (e.g., the grain of a table, mappings between primary and affiliate dimensions, extent of normalization/denormalization, etc.). NSE and SE versions of core functions are provided to facilitate interactive and programming use. Functions are designed to work well with `dplyr`.
## Installation
Install **xplor** from github with:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("rebelrebel04/xplor")
```
## Core Functions
### cpf
Create **c**umulative **p**ercentile/**f**requency tables with `cpf`. The most common use case is interactive crosstabs (though note that the output frequency table is always in long format).
```{r}
data(mtcars)
cpf(mtcars, gear)
```
You can also request a lightly-formatted `kable` when running `cpf` within an Rmarkdown file:
```{r}
cpf(mtcars, gear, kable = TRUE)
```
A nice feature is that you can pass `cpf` computed variables using bare variable names; optionally provide descriptive names for the table columns:
```{r}
cpf(mtcars, carb >= 4)
cpf(mtcars, has_cyl = !is.na(cyl))
```
Optionally specify a weighting variable for frequencies:
```{r}
cpf(mtcars, carb, wt = "cyl")
```
A chi-square test can be requested for frequency tables that have two or more dimensions:
```{r}
cpf(mtcars, cyl > 4, hp > 123, chi_square = TRUE)
```
`cpf` returns a dataframe containing the frequency table, which can be useful for downstream processing or plotting:
```{r}
cpf(mtcars, gear, cyl, margin = FALSE) %>%
ggplot2::ggplot(ggplot2::aes(x = gear, y = cyl, fill = pct)) +
ggplot2::geom_tile() +
ggplot2::scale_fill_gradient(label = scales::percent)
```
The SE version of `cpf_` is provided in the event you have variable names built up through some other means:
```{r}
vars <- names(mtcars)[grepl("^c", names(mtcars))]
cpf_(mtcars, .dots = vars)
```
### has
`has` is a wrapper around the common pattern of using `cpf` to cross-tab missing value (`NA`) counts across multiple dimensions:
```{r}
mtcars.na <- mtcars
set.seed(1234)
mtcars.na[runif(10, 1, nrow(mtcars.na)), "gear"] <- NA
mtcars.na[runif(5, 1, nrow(mtcars.na)), "cyl"] <- NA
has(mtcars.na, gear, cyl, carb, kable = TRUE)
# This is equivalent to:
# cpf(mtcars.na, has_gear = is.na(gear), has_cyl = is.na(cyl), has_carb = is.na(carb), kable = TRUE)
```
### dup
Check for duplicate keys in a dataframe with `dup`:
```{r}
dup(mtcars, gear, cyl)
```
`dup` divides cases into those with unique keys (which appear only once in the dataset) and those with non-unique keys. This varies from the `duplicated` function in base R, which only counts cases as duplicates *after the first (or last) occurrence*. This is why `dup` reports the number of *unique vs. non-unique* cases by the specified keys.
`dup` invisibly returns the original dataframe filtered to rows with non-unique keys, with a variable `.n` attached indicating the total number of cases with each key combination:
```{r}
print(dup(mtcars, disp))
```
A weight can be supplied to generate weighted duplication rates. This can be useful to determine if duplication is limited to low-frequency cases when weighting by a separate frequency variable:
```{r}
dup(mtcars, disp, wt = "mpg")
```
### ss
Produce a quick table of **s**ummary **s**tatistics with `ss`. The function accepts a named list of functions (specified as formulas) via the `funs` argument, with the default list covering the basics. If no variables are specified the table will summarize all numeric columns in the dataset. If `plot = TRUE` a facet-wrapped plot of histograms will be produced as a side-effect.
```{r}
ss(mtcars, plot = TRUE, kable = TRUE)
```
A `dplyr`-grouped dataframe can be passed to produce a dimensional summary table:
```{r}
library(dplyr)
mtcars %>%
group_by(cyl) %>%
ss(disp, mpg, kable = TRUE)
```
Pass a custom list of summary functions as formulas on `x`:
```{r}
mtcars %>%
ss(
funs = list(
"variance" = ~var(x, na.rm = TRUE),
"skewness" = ~moments::skewness(x, na.rm = TRUE),
"kurtosis" = ~moments::kurtosis(x, na.rm = TRUE)
),
kable = TRUE
)
```
### mapping
For dimensions `x` and `y` in a dataset, there may be only one unique value of `y` associated with each unique value of `x`, or there may be multiple unique values of `y` associated with each unique value of `x`. `mapping` summarizes the association between `x` and `y` in a dataset, specified as a formula `x ~ y`:
```{r}
mapping(cyl ~ gear, mtcars)
```
In this example, there is one value of the "x" variable `cyl` that has a single unique "y" (`gear`) value associated with it, and two values of `cyl` that each have two unique `gear` values associated with them.
### findGrain
Often with a new dataset, you may not know the grain of the table (i.e., the combination of dimensions at which cases are unique). This function searches through candidate dimensions and ranks the top combinations in descending degree of de-duplication (aka normalization). In theory, the true grain of a table would be defined by the smallest combination of dimensions that yields 0 duplicates.
```{r}
data(sleep)
findGrain(sleep, group, ID)
```
### vjoin
Verbose join wrapper around `dplyr` join functions to provide diagnostic output. Joined result is returned invisibly.
```{r}
a <- mtcars[1:10, c("hp", "mpg", "disp")]
b <- mtcars[11:15, c("hp", "cyl")]
vjoin(a, b, "hp")
```
Match rates can be weighted:
```{r}
vjoin(a, b, "hp", wt = "disp")
```
### %&%
Why does `paste(NA)` actually print "NA"?
```{r}
paste(NA)
```
I don't know, but this is one of those places where base R is kind of dumb. Besides just making concatenation easier to read, `% %` ignores NAs (silently replacing them with `""`), which is probably what you wanted R to do in the first place 99.9% of the time:
```{r}
first_names <- c("Abe", "George", "Thomas")
middle_names <- c("J.", NA, "J.")
last_names <- c("Lincoln", "Washington", "Jefferson")
first_names % % middle_names % % last_names
```