February 2019 Chris Cameron
I am auditing Microsoft Professional Program for Artificial Intelligence track.
The fitth of 10 courses is Data Science Research Methods: Python Edition.
The Jupyter Notebooks for the labs are here
- Question Development, with Analytics in mind
- Collecting data for the question you want to answer
- straight analytics you already have the data sources, but you might not like the variables
- in research, you get to find the data you need to answer your question
- Dynamic, iterative problem-solving process
Process
- frame the question
- form a theory
- form a hypothesis
- design an experiment / study and test
- draw conclusions; repeat as necessary
- final conclusion
Goals for you (me) in the course
- when given data out of context, you can be more effective
- prevent data misuse; not all data are created equal
- know when to recommend data collection _ can design your own research
Basic Reasearch
- describe / explain / predict / control (these are the subtypes)
- goal: understand X
- how does x work
- why do customers do x
- advantage: planning next moves
- about understanding your topic
- doesn't allow us to "make new moves"
Applied Research
- evaluation of practice, product, idea, assumption
- choices often made based on assumptions or intuition
- be right. stand on firm ground!
- a way to test assumptions
- could be smaller scale tests, shorter, tests
- Theory: detailed explanation of how something works
- Hypothesis: what i should expect to see if theory is true
- Data Collection: use appropriate research methods
- Descriptive Statistics: is my hypothesis suported in my sample?
- Inferential Statistics: If so, can I reject 'chance' as an explanation and generalize to the population?
- Draw Conclusions: implications for Theory and Methods
- back to 1!
Common Problem: overstuff a survey
- we want to know everything
Problem: depth(quality) vs breadth
Clarifying Interview
- uncover the 'question behind the question'
- re-imagining the project around that
goal
- 1-2 learning objectives
- several lines of attack for those objectives
- tight focus on key objectives
Research Foxtrot
- Look at proposed design. Step back. What are underlying assumptions and questions?
- Formalize that (theory)
- Identify focused several lines of attack (hypotheses) ways of testing that theory
- Develop study / survey around those
Participants are people.
- not robots giving info
- biased!
- influenced by design decisions
People want to manage their impressions of you and themselves
Rule 1: Be NEUTRAL. If you say "how often do you recycle?" they'll say "ALL THE TIME!" because they want to feel good
Rule 2: Sound Objective, because people are really good at knowing what answer you're looking for. Some want to sabotage you, others want to co-operate.
People do not know why they do things and will make up answers when asked.
People are biased by the order of things.
Especially with "why?" questions, people make things up.
more generally: avoid more complex questions.
goal for survey questions: you want Simple and Knowable questions. If people can't "gut" know the answer to it, don't ask them.
Placebo Effects
- Beliefs matter. e.g., products are more effective when people think they are.
- manage expectations. if something is more expensive, they assume it's better
Observer bias: see what you want to see Observer effect: act differently when observed
Solution: Blinding
neither participants, nor people collecting the data, should know what to expect.
if you expose people to words about the elderly, people appeared to walk away slowly. so then they didn't tell the observers what the study was about and the difference went away.
double-blind is best
not available
samples are error-prone guesses. we almost never have population data in data science
problem is Sampling Error.
- sample may not be representative.
- the large the sample, the more trustworthy it is
Standard Error lets you know how trustworthy your sample is
A lot of this is repeating stuff from the Essential Math course. I'll only take notes that add to what is already learned.
large p value - result easily produced by random chance small p value - not easily produced by random chance
"significant", but chance has not been disproved!
Confidence interval can also tell you a range for where the true value is. larger sample sizes will give you a better interval too
How do you know you have a big enough sample?
Power/Statistical Power: % of the time an effect will be detected
Power depends on
- effect size (bigger effect is easier to detect)
- sample size (bigger sample has more power)
Effect Size
Cohen's d (difference)
Numbers here in standard deviations. Jay Cohen (statistician) came up with these ranges:
- 0-0.2 = trivial
- 0.2 - 0.5 = small
- 0.5 - 0.8 = medium
-
0.8 = large
r-value (correlation coefficient) is another way to measure.
you can convert between r
and d
r = d/sqrt(d**2 + 4)
d = (2*r)/sqrt(1 - r**2)
Basically this video is about making sure you have a large enough sample based on your effect size.
effect size to sample size is not a linear relationship
he shows a power chart which shows you several curves showing you x axis sample size and y axis power. each curve represents different Cohen's d-values.
Need samples > 400-500 for smaller effects
When comparing groups, size is per group
Better to just use the power calculation tools available in statistical packages.
What happens when you have weak Power?
False Positive is a Type 1 Error
- Find effect when absent.
- 5% of the time with 95% power
False Negative is a Type 2 Error
- Find no effect/relationship when actually present
- 20% of time when actually present with 80% power
Note: This video reminds me of the Replicability crisis in several fields right now.
Filtering for Significance
- what: selecting only
p < 0.05
results to share - why it's bad: context of all results makes false positives clear
P-hacking
- what: trying many analyses, data subsets, etc to get
p < 0.5
- why it's bad: guaranteed to find false positives
- really common when someone is motivated to find a significant result
HARKing
- what: hypothesizing after results are known
- why it's bad: overfitting. sample results are estimates only
Optional Stopping
- what: stopping your test during data collection when
p < 0.5
- why it's bad: false positives guaranteed
- HUGE no-no. some people keep checking the data at every point coming in hoping for p<0.5. BAD
Unavailable
Five labs for module 2
- Sampling
- P Values
- CIs
- Power (part 1)
- Power (part 2)
I copied this verbatim from the P Values lab because even the guy's slides went against some of this advice
There is a lot of confusion about p-values, so let's review:
- p-values represent how often you could get a result as big as you did if the null were true
- p-values therefore represent how easy/hard it would be to get a result by chance
- p-values do not tell you the probability that the result is due to chance; only the probability of seeing your result if the null were true
- If the p-value for a result is small, it would be rare to get that result by chance (i.e., if the null were true)
- If the p-value for a result is large, it would be common to get that result by chance (i.e., if the null were true)
- Conclusion: the p-value is a measure of "incompatibility" between your result and the null. If the p-value is small, one of the two (the data, or the null) is likely wrong. We opt to trust our data and reject the null.
To be clear: the p-value is a backwards way of testing the null hypothesis. We would love to know the probability that the null hypothesis is true--the probability that the results are due to chance--but we cannot know that. You will often hear the p-value described this way, but that is very wrong.
So, to repeat, the p-value states the probability of getting your result if the null is true. It is essentially a statement of incompatibility between your data and the null. A small p-value (typically, less than 5% or "< . 05") tells you that the data and null are highly incompatible. Since you did in fact observe the data, you conclude the null hypothesis is false. This is the only use for the p-value.
Note: In the lab about Confidence Intervals it needs access to a file I cloned outside of the docker container. I got lazy and instead of mounting the labs in docker I just curl
ed them from Github like the first course did, e.g.,
!curl https://raw.githubusercontent.com/MicrosoftLearning/Research-Methods-for-Data-Science-with-Python/master/Module2/datasets/attitude.csv -o attitude.csv