Skip to content

Commit ac54ae9

Browse files
authored
Merge pull request #78 from NIEHS/dariusmb_0219
Add Analysis of hcup/amadeus data
2 parents 34ad8a3 + 7c0bdac commit ac54ae9

File tree

1 file changed

+153
-2
lines changed

1 file changed

+153
-2
lines changed

chapters/05-01-hcup-amadeus-usecase.Rmd

Lines changed: 153 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ The `fwf_positions()` function is utilizing column start and end positions found
123123
df2 <- readr::read_fwf(
124124
data_file,
125125
readr::fwf_positions(start = df$X6, end = df$X7, col_names = df$X5),
126-
skip = 20,
126+
skip = 2,
127127
na = missing_values
128128
)
129129
@@ -235,7 +235,7 @@ temp_covar <- calculate_hms(
235235
)
236236
237237
# Save processed data
238-
saveRDS(temp_covar, "smoke_plume2021_covar.R")
238+
saveRDS(temp_covar, "smoke_plume_covar.R")
239239
```
240240

241241
In preparation for the next section we are going to make two new dataframes from our `temp_covar` object. The first collapses our zipcodes taking the average of light, medium, or heavy days.
@@ -250,6 +250,17 @@ avg_smoke_density <- temp_covar %>%
250250
)
251251
print(avg_smoke_density)
252252
saveRDS(avg_smoke_density, "smoke_density_avg_byZip.R")
253+
254+
# > head(avg_smoke_density)
255+
# # A tibble: 6 × 4
256+
# ZCTA5CE10 avg_light avg_medium avg_heavy
257+
# <fct> <dbl> <dbl> <dbl>
258+
# 1 97833 0.129 0.194 0.419
259+
# 2 97840 0.161 0.226 0.387
260+
# 3 97330 0.290 0.129 0.0323
261+
# 4 97004 0.258 0.0968 0.0323
262+
# 5 97023 0.194 0.0968 0.0323
263+
# 6 97042 0.258 0.129 0.0323
253264
```
254265

255266
The second dataframe also groups by our zip but takes the summation of the smoke plume days instead of an average.
@@ -265,6 +276,146 @@ total_smoke_density <- temp_covar %>%
265276
)
266277
print(total_smoke_density)
267278
saveRDS(total_smoke_density, "smoke_density_total_byZip.R")
279+
280+
# > head(total_smoke_density)
281+
# # A tibble: 6 × 5
282+
# ZCTA5CE10 sum_light sum_medium sum_heavy geometry
283+
# <fct> <int> <int> <int> <GEOMETRY [°]>
284+
# 1 97833 4 6 13 MULTIPOLYGON (((-118.1571 44.9990,…
285+
# 2 97840 5 7 12 POLYGON ((-116.9899 44.88256, -116…
286+
# 3 97330 9 4 1 POLYGON ((-123.1829 44.64559, -123…
287+
# 4 97004 8 3 1 POLYGON ((-122.4867 45.22209, -122…
288+
# 5 97023 6 3 1 POLYGON ((-122.0758 45.10881, -122…
289+
# 6 97042 8 4 1 POLYGON ((-122.5842 45.20546, -122…
268290
```
269291

270292
## Data Analysis using HCUP and Amadeus data sources
293+
294+
First we will load in our hcup data file we processed earlier and subset the file to a set of observations that make the data easier to work with (702 to 39 columns) and are still interesting for analysis. This includes zipcodes, age at admission, admission month, race identifier, sex, and ICD 10 diagnosis codes.
295+
296+
```{r eval=FALSE}
297+
or_sedd_2021 <- fread("OR_SEDD_2021_CORE.csv")
298+
subset_data <- or_sedd_2021 %>%
299+
select(FEMALE, ZIP, PSTCO, AGE, RACE, AMONTH, starts_with("I10_"))
300+
head(subset_data)
301+
302+
# [1] "FEMALE" "ZIP" "PSTCO"
303+
# [4] "AGE" "RACE" "AMONTH"
304+
# [7] "I10_DX_Visit_Reason1" "I10_DX_Visit_Reason2" "I10_DX_Visit_Reason3"
305+
# [10] "I10_DX1" "I10_DX2" "I10_DX3"
306+
# [13] "I10_DX4" "I10_DX5" "I10_DX6"
307+
# [16] "I10_DX7" "I10_DX8" "I10_DX9"
308+
# [19] "I10_DX10" "I10_DX11" "I10_DX12"
309+
# [22] "I10_DX13" "I10_DX14" "I10_DX15"
310+
# [25] "I10_DX16" "I10_DX17" "I10_DX18"
311+
# [28] "I10_DX19" "I10_DX20" "I10_DX21"
312+
# [31] "I10_DX22" "I10_DX23" "I10_DX24"
313+
# [34] "I10_DX25" "I10_DX26" "I10_DX27"
314+
# [37] "I10_DX28" "I10_NDX" "I10_PROCTYPE"
315+
316+
```
317+
318+
Next we will select July as our month of interest to further reduce the size of the data and to focus on a time frame where we know fires took place in Oregon. We will also load in our environmental data files we made above from amadeus.
319+
320+
```{r eval=FALSE}
321+
# subset data to July
322+
july_subset_hcup_data <- subset_data[subset_data$AMONTH == 7, ]
323+
324+
# load in amadeus files we made previously
325+
avg_smoke_density <- readRDS("smoke_density_avg_byZip.R")
326+
total_smoke_density <- readRDS("smoke_density_total_byZip.R")
327+
```
328+
329+
### Merging Environmental Data with Hospital Data
330+
331+
We will now merge our environmental data into our hospital discharge (HCUP) data using an inner join on ZIP codes present in both datasets.
332+
333+
```{r eval=FALSE}
334+
# Perform an inner join to merge `july_subset_hcup_data` with
335+
# `avg_smoke_density` based on the ZIP code (`ZIP` in HCUP data and
336+
# `ZCTA5CE10` in smoke density data)
337+
merged_data <- inner_join(july_subset_hcup_data, avg_smoke_density,
338+
by = c("ZIP" = "ZCTA5CE10"))
339+
340+
# Perform another inner join to add `total_smoke_density` to the existing
341+
# `merged_data`
342+
merged_data <- inner_join(merged_data, total_smoke_density,
343+
by = c("ZIP" = "ZCTA5CE10"))
344+
```
345+
346+
### Identifying Asthma Cases
347+
348+
Next, we will identify individuals diagnosed with asthma. This involves searching for the ICD-10 code "J45" within the diagnosis columns of our dataset.
349+
350+
```{r eval=FALSE}
351+
# Identify the columns containing diagnosis codes (prefix "I10_")
352+
diag_columns <- grep("^I10_", colnames(merged_data), value = TRUE)
353+
354+
# Create a new column `has_asthma` that checks if any diagnosis contains "J45"
355+
smoke_summary <- merged_data %>%
356+
mutate(has_asthma = apply(select(., all_of(diag_columns)), 1, function(x) {
357+
any(grepl("J45", x))
358+
}))
359+
360+
# Count total number of individuals in the dataset
361+
total_individuals <- nrow(smoke_summary)
362+
363+
# Count the number of individuals with an asthma diagnosis
364+
asthma_cases <- sum(smoke_summary$has_asthma, na.rm = TRUE)
365+
366+
# Calculate the proportion of individuals diagnosed with asthma
367+
asthma_rate <- asthma_cases / total_individuals
368+
```
369+
370+
### Visualizing the Relationship Between Heavy Smoke Exposure and Asthma
371+
372+
We will now generate a boxplot to visualize the distribution of average heavy smoke days across individuals with and without an asthma diagnosis.
373+
374+
```{r eval=FALSE}
375+
ggplot(smoke_summary, aes(x = factor(has_asthma), y = avg_heavy,
376+
fill = factor(has_asthma))) +
377+
geom_boxplot() +
378+
labs(
379+
x = "Has Asthma (J45)",
380+
y = "Average Heavy Smoke Days",
381+
fill = "Has Asthma",
382+
title = "Average Heavy Smoke Days vs Asthma Diagnosis"
383+
) +
384+
theme_minimal()
385+
```
386+
387+
### Logistic Regression Analysis
388+
389+
Finally, we fit a logistic regression model to examine the relationship between asthma diagnoses and exposure to different levels of smoke density.
390+
391+
```{r eval=FALSE}
392+
# Fit a logistic regression model with asthma diagnosis as the outcome variable
393+
# and different smoke exposure levels as predictors
394+
model <- glm(has_asthma ~ avg_light + avg_medium + avg_heavy,
395+
data = smoke_summary, family = binomial)
396+
397+
# Display model summary
398+
summary(model)
399+
# Call:
400+
# glm(formula = has_asthma ~ avg_light + avg_medium + avg_heavy,
401+
# family = binomial, data = smoke_summary)
402+
#
403+
# Coefficients:
404+
# Estimate Std. Error z value Pr(>|z|)
405+
# (Intercept) -3.38823 0.09077 -37.329 < 2e-16 ***
406+
# avg_light -0.21258 0.30322 -0.701 0.483
407+
# avg_medium 1.74996 0.32456 5.392 6.98e-08 ***
408+
# avg_heavy 1.82572 0.16826 10.850 < 2e-16 ***
409+
# ---
410+
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
411+
#
412+
# (Dispersion parameter for binomial family taken to be 1)
413+
#
414+
# Null deviance: 42004 on 111124 degrees of freedom
415+
# Residual deviance: 41674 on 111121 degrees of freedom
416+
# AIC: 41682
417+
#
418+
# Number of Fisher Scoring iterations: 6
419+
```
420+
421+
The output provides estimates for each predictor, helping us assess the impact of light, medium, and heavy smoke exposure on asthma prevalence.

0 commit comments

Comments
 (0)