Skip to content

Commit

Permalink
Provide solution section for data modeling exercise.
Browse files Browse the repository at this point in the history
  • Loading branch information
mbjones committed Jan 28, 2025
1 parent f4fcccd commit e935cf3
Showing 1 changed file with 101 additions and 42 deletions.
143 changes: 101 additions & 42 deletions materials/sections/data-modeling-socialsci.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

## Introduction

:::callout-note
::: callout-note
Slides for this lesson available [here](files/intro-tidy-data-slides.pdf).
:::

Expand All @@ -20,38 +20,37 @@ In this lesson we are going to learn what relational data models are, and how th

A great paper called 'Some Simple Guidelines for Effective Data Management' [@borer_simple_2009] lays out exactly that - guidelines that make your data management, and your reproducible research, more effective.

- **Use a scripted program (like R!)**
- **Use a scripted program (like R!)**

A scripted program helps to make sure your work is reproducible. Typically, point-and-click actions, such as clicking on a cell in a spreadsheet program and modifying the value, are not reproducible or easily explained. Programming allows you to both reproduce what you did, and explain it if you use a tool like Rmarkdown.

- **Non-proprietary file formats are preferred (eg: csv, txt)**

- **Non-proprietary file formats are preferred (eg: csv, txt)**
Using a file that can be opened using free and open software greatly increases the longevity and accessibility of your data, since your data do not rely on having any particular software license to open the data file.

Using a file that can be opened using free and open software greatly increases the longevity and accessibility of your data, since your data do not rely on having any particular software license to open the data file.

- **Keep a raw version of data**
- **Keep a raw version of data**

In conjunction with using a scripted language, keeping a raw version of your data is definitely a requirement to generate a reproducible workflow. When you keep your raw data, your scripts can read from that raw data and create as many derived data products as you need, and you will always be able to re-run your scripts and know that you will get the same output.

- **Use descriptive file and variable names (without spaces!)**
- **Use descriptive file and variable names (without spaces!)**

When you use a scripted language, you will be using file and variable names as arguments to various functions. Programming languages are quite sensitive with what they are able to interpret as values, and they are particularly sensitive to spaces. So, if you are building reproducible workflows around scripting, or plan to in the future, saving your files without spaces or special characters will help you read those files and variables more easily. Additionally, making file and variables descriptive will help your future self and others more quickly understand what type of data they contain.
When you use a scripted language, you will be using file and variable names as arguments to various functions. Programming languages are quite sensitive with what they are able to interpret as values, and they are particularly sensitive to spaces. So, if you are building reproducible workflows around scripting, or plan to in the future, saving your files without spaces or special characters will help you read those files and variables more easily. Additionally, making file and variables descriptive will help your future self and others more quickly understand what type of data they contain.

- **Include a header line in your tabular data files**
- **Include a header line in your tabular data files**

Using a single header line of column names as the first row of your data table is the most common and easiest way to achieve consistency among files.

- **Use plain ASCII text**
- **Use plain ASCII text**

ASCII (sometimes just called plain text) is a very commonly used standard for character encoding, and is far more likely to persist very far into the future than proprietary binary formats such as Excel.

The next three are a little more complex, but all are characteristics of the relational data model:

- Design tables to add rows, not columns
- Each column should contain only one type of information
- Record a single piece of data only once; separate information collected at different scales into different tables.
- Design tables to add rows, not columns
- Each column should contain only one type of information
- Record a single piece of data only once; separate information collected at different scales into different tables.

### File and folder organization {-}
### File and folder organization {.unnumbered}

Before moving on to discuss the last 3 rules, here is an example of how you might organize the files themselves following the simple rules above. Note that we have all open formats, plain text formats for data, sortable file names without special characters, scripts, and a special folder for raw files.

Expand All @@ -61,39 +60,39 @@ Before moving on to discuss the last 3 rules, here is an example of how you migh

Before we learn how to create a relational data model, let's look at how to recognize data that does not conform to the model.

### Data Organization {-}
### Data Organization {.unnumbered}

This is a screenshot of an actual dataset that came across NCEAS. We have all seen spreadsheets that look like this - and it is fairly obvious that whatever this is, it isn't very tidy. Let's dive deeper in to exactly **why** we wouldn't consider it tidy.

![](images/tidy-data-images/tidy_data/excel-org-01.png)

### Multiple tables {-}
### Multiple tables {.unnumbered}

Your human brain can see from the way this sheet is laid out that it has three tables within it. Although it is easy for us to see and interpret this, it is extremely difficult to get a computer to see it this way, which will create headaches down the road should you try to read in this information to R or another programming language.

![](images/tidy-data-images/tidy_data/excel-org-02.png)

### Inconsistent observations {-}
### Inconsistent observations {.unnumbered}

Rows correspond to **observations**. If you look across a single row, and you notice that there are clearly multiple observations in one row, the data are likely not tidy.

![](images/tidy-data-images/tidy_data/excel-org-03.png)

### Inconsistent variables {-}
### Inconsistent variables {.unnumbered}

Columns correspond to **variables**. If you look down a column, and see that multiple variables exist in the table, the data are not tidy. A good test for this can be to see if you think the column consists of only one unit type.

![](images/tidy-data-images/tidy_data/excel-org-04.png)

### Marginal sums and statistics {-}
### Marginal sums and statistics {.unnumbered}

Marginal sums and statistics also are not considered tidy, and they are not the same type of observation as the other rows. Instead, they are a combination of observations.

![](images/tidy-data-images/tidy_data/excel-org-05.png)

## Good enough data modeling

### Denormalized data {-}
### Denormalized data {.unnumbered}

When data are "denormalized" it means that observations about different entities are combined.

Expand All @@ -103,7 +102,7 @@ In the above example, each row has measurements about both the community in whic

People often refer to this as *wide* format, because the observations are spread across a wide number of columns. Note that, should one survey another individual in either community, we would have to add new columns to the table. This is difficult to analyze, understand, and maintain.

### Tabular data {-}
### Tabular data {.unnumbered}

**Observations**. A better way to model data is to organize the observations about each type of entity in its own table. This results in:

Expand All @@ -121,38 +120,37 @@ People often refer to this as *wide* format, because the observations are spread
- All columns pertain to the same observed entity (e.g., row)
- Each column represents either an identifying variable or a measured variable

:::{.callout-note}
::: callout-note
### Challenge

Try to answer the following questions:

What are the observed entities in the example above?

What are the measured variables associated with those observations?

:::

:::{.callout-note collapse=true}
::: {.callout-note collapse="true"}
### Answer

![](images/table-denorm-entity-var-ss.png)
:::

If we use these questions to tidy our data, we should end up with:

- one table for each entity observed
- one column for each measured variable
- additional columns for identifying variables (such as community)
- one table for each entity observed
- one column for each measured variable
- additional columns for identifying variables (such as community)

Here is what our tidy data look like:

![](images/tables-norm-ss.png)

Note that this normalized version of the data meets the three guidelines set by [@borer_simple_2009]:

- Design tables to add rows, not columns
- Each column should contain only one type of information
- Record a single piece of data only once; separate information collected at different scales into different tables.
- Design tables to add rows, not columns
- Each column should contain only one type of information
- Record a single piece of data only once; separate information collected at different scales into different tables.

## Using normalized data

Expand All @@ -163,24 +161,23 @@ When one has normalized data, we often use unique identifiers to reference parti
- Primary Key: unique identifier for each observed entity, one per row
- Foreign Key: reference to a primary key in another table (linkage)

:::{.callout-note}
::: callout-note
### Challenge

![](images/tables-norm-ss.png)

In our normalized tables above, identify the following:

- the primary key for each table
- any foreign keys that exist

- the primary key for each table
- any foreign keys that exist
:::

:::{.callout-note collapse=true}
::: {.callout-note collapse="true"}
### Answer

The primary key of the top table is `community`. The primary key of the bottom table is `id`.

The `community` column is the *primary key* of that table because it uniquely identifies each row of the table as a unique observation of a community. In the second table, however, the `community` column is a *foreign key* that references the primary key from the first table.
The `community` column is the *primary key* of that table because it uniquely identifies each row of the table as a unique observation of a community. In the second table, however, the `community` column is a *foreign key* that references the primary key from the first table.

![](images/tables-keys-ss.png)
:::
Expand Down Expand Up @@ -226,25 +223,87 @@ Sometimes people represent these as Venn diagrams showing which parts of the lef

In the figure above, the blue regions show the set of rows that are included in the result. For the INNER join, the rows returned are all rows in A that have a matching row in B.

## Data modeling exercise
## Exercise: Data modeling for course surveys

:::{.callout-note}

## Exercise

- Break into groups

Our funding agency requires that we take surveys of individuals who complete our training courses so that we can report on the demographics of our trainees and how effective they find our courses to be. In your small groups, design a set of tables that will capture information collected in a participant survey that would apply to many courses.
Our funding agency requires that we take surveys of individuals who complete our training courses so that we can report on the demographics of our trainees and how effective they find our courses to be. In your small groups, design a set of tables that will capture information collected in a participant survey that would apply to many courses.

Don't focus on designing a comprehensive set of questions for the survey, one or two simple stand ins (eg: "Did the course meet your expectations?", "What could be improved?", "To what degree did your knowledge increase?") would be sufficient.

Include as variables (columns) a basic set of information not only from the surveys, but about the courses, such as the date of the course and name of the course.

Include as variables (columns) a basic set of information not only from the surveys (such as survey question responses), but about the courses, such as the date of the course and name of the course. Try to account for the same person participating in multiple courses, multiple courses being held each year, and the same survey questions can be asked of the participants for those different courses.

Draw your entity-relationship model for your tables.

:::

:::{.callout-note collapse=false}

## Solution

We can start by creating one box for each type of entity that we want to collection data about. Each box represents a data table in our design. As we are collecting survey responses, we might start with a table for `Response` that would contian one observation for each survey response that we want to store. That Response is about a particular course, so we can add another table for information about that `Course`, and link those two tables.


```{mermaid}
erDiagram
Response ||--|| Course : about
```

Next, we can change the cardinality of the relationship to indicate that each course can contain multiple responses. We can also add a new table to hold the details for each `Participant` that `takes` each `Course`, and that each `Participant` provides a `Response` when filling out a survey. And lastly, because questions might be reused across surveys, we create a linkage from the `Response` table to a new `Question` table that has one row for each unique question.

```{mermaid}
erDiagram
Response |{--|| Course : about
Participant |{--|{ Course : takes
Participant ||--|| Response : provides
Response |{--|| Question : for
```

Finally, we can add the attributes that we would have for each table, indicating which are primary keys and which are foreign keys.

```{mermaid}
erDiagram
Response |{--|| Course : about
Participant |{--|{ Course : takes
Participant ||--|| Response : provides
Response |{--|| Question : for
Response {
string response_id PK
string participant_id FK
string course_id FK
string question_id FK
string response_value
}
Participant {
string participant_id PK
string name_first
string name_last
string email
}
Course {
string course_id PK
string course_name
date date_start
date date_end
}
Question {
string question_id PK
string question_text
}
```

:::

## Resources

- [Borer et al. 2009. **Some Simple Guidelines for Effective Data Management.** Bulletin of the Ecological Society of America.](http://matt.magisa.org/pubs/borer-esa-2009.pdf)
- [White et al. 2013. **Nine simple ways to make it easier to (re)use your data.** Ideas in Ecology and Evolution 6.](https://doi.org/10.4033/iee.2013.6b.6.f)
- [Software Carpentry SQL tutorial](https://swcarpentry.github.io/sql-novice-survey/)
- [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf)
- [Intro to Tidy Data slides](files/intro-tidy-data-slides.pdf) (this lesson module)

- [Intro to Tidy Data slides](files/intro-tidy-data-slides.pdf) (this lesson module)

0 comments on commit e935cf3

Please sign in to comment.