day_1.Rmd

---
title: "Day One"
author: "Jody Holland"
date: "2023-03-20"
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: github
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(error = FALSE,
                      warning = FALSE,
                      message = FALSE)
```

<style>
body {
text-align: justify}
</style>

# R and Statistics Crash Course

*Day One, (Spring Equinox!)*

# Why Stats, Why R?

-   Stats are painful but good R skills makes it easier
-   You don't have to know much about how the statistical models work, just how to use them
-   Let us focus on stats models that are intuitive, but have a lot of conceptual weight:
    -   You don't have to be a mechanic to drive a car, we need to find that "Goldilocks Zone" between the maths and operationalisation
-   Important that your work is repeatable:
    -   Values of open data (most papers these days want to see your code and your data)
-   The community for R is amazing! Lots of packages and community driven innovation!
    -   Millions of packages, maybe too many
-   Learning R is tough. It will feel weird at first but will you will hit the speedy upwards curve.
    -   Though you gotta work bitch. R is hard and requires self-teaching and discipline

    -   Use Youtube (and TikTok) xxx

# What are we doing?

-   We are going to look at several families of stats models:

    -   General Linear Model:

        -   ANOVA

        -   ANCOVA

        -   MANOVA

        -   Straight line regression (OLS)

    -   Generalised Linear Model:

        -   Poisson

        -   Logistic

        -   Multivariate GLM

    -   Mixed Models! Model Selection:

        -   AIC

        -   BIC

        -   Tests of Fit/Robustness

# Process of Quantitative Analysis

1.  Go outside, do some research, make some inferences.

2.  Build some thesis, looking at how some forces influence another

3.  Write this as a hypothesis, wherein x leads to y. Hypotheses lead to their antithesis the null hypothesis (H~o~), which is the statement wherein the predicted relationship doesn't occur

4.  Collect data through quantitative measurements

5.  Store data in database

6.  Apply statistical models aiming to test the hypothesis

7.  Based on result, accept or reject the null hypothesis

*When you start a new study you have to ask yourself many questions with regards to the collection of data, thinking about the space, time-span, and manner in which you make your observations. This is all included and justified in the "experimental design section".*

# Basic Concepts

Below are some basic concepts of stats.

## Statistical Inference

We need stats because we can't do a universal sample of every possible data set, we can only make a subsection an extrapolate our results

## Random Sampling

In theory every data point should be independent, it should be random in the manner they are selected, independent of researcher bias. In practice this is very hard.

## Replicate

The lowest level in your design. I would call this the "unit of analysis"

## Variables

Any type of measurement that you use in your data:

-   Dependent Variable: Response Variable: Y

-   Independent Variable: Predictor Variable: Explanatory Variable: X

-   These variables can be categories, there are ways to make these pseudo-numeric, such as using dummy variables (assigning each data point either a 1 or 0 to signify if its a member of a given group)

-   Categorical Variables can be configured into "treatments" which denotes various groupings of categorical variables

## Error

The degree to which our predictors are unable to determine the dependent variable.

## Experimental Unit

Different areas or time-spans wherein your data is collected. Having multiple areas wherein data is gathered shows replicatability in your experiment

## Confounding relationships

It may appear sometimes that there is a relationship is present in the results. However what it might be is that both your dependent and independent variables are actually mutually caused by another, not included third variable. Confounding variables are hard to find, you have to do a lot of research and accept criticism wholeheartedly

## Sampling

Sampling is working out what proportion of the population you have to sample in order to achieve a given certainty in your results ability to be extrapolated to the wider population

In general, the smaller your sample size, the greater the standard error.

However, the relationship between sample size and standard error is not linear, there are diminishing returns in increasing the sample size and improving your accuracy

We must also consider the scale of the differences in our findings. Say our sample is small but the differences we find between groups is large, then maybe having a smaller sample is OK. However, if the differences in the results are more nuance, to truly conclude that the slight differences that do exist can be extrapolated to the population, we need a large sample.

## Confidence

Sampling plays an important role in determining the confidence of our results and subsequent inferences. In many circles, confidence is denoted using alpha/⍺. ⍺ of 0.05 means 95% confidence, ⍺ of 0.01 means 99% and so on

## P Values

Be careful about over using the P value (the probability you can reject the null hypothesis). Use it in the context of the effect size. High P + H effect size... then your cooking. Small effect size however could signify P-hacking/overestimating results based on large sampling