Slides, notebooks, references etc for Data Science for Beginners sessions
This was a weekly onsite/remote course, designed to get people who aren’t coders up to speed on data science techniques, and trying some of those techniques out for themselves.
Course structure is one topic per week, with:
- Weekly reading list: blogposts, book chapters etc on that week’s topic, to be read before the session
- Weekly tool setups: tool needed to be installed on PCs before each session
- In session: Introduction to techniques on that topic
- In session: Lab time, trying Python/R code related to those techniques
- Post-session: “further reading” list of blogposts, books, courses etc, for anyone wanting to dive further into that topic
This gives an overview of 5-7 concepts per week, totalling up to a better understanding of how data scientists work, and hopefully also a desire to explore this topic further. Examples used will be taken from social data science and free/open-source tools; these should be supplemented by guest talks from company data scientists about their projects, work and toolsets. The sessions are based on a Spring 2016 Columbia University.
- People form = not taking notes on who attends, but want you to have somewhere you can tell other people what you’re interested in, and find other people in your office/ area
- Tool install instructions
- Course reading list
- Places to look for development datasets
- Places to get more help (communities, courses etc)
- 1: Designing and Scoping a Data Science Project
- 2: Python basics
- 3: Acquiring Data
- 4: Communicating Results
- 5: Cleaning and Exploring Data
- 6: Machine Learning
- 7: Handling Text Data
- 8: Handling Geospatial Data
- 9: Learning Relationships from Data
- 10: Handling Big Data
This session:
-
Introduces students to the content and supporting materials needed for data scientists to work from a problem specification. Students will also comment on existing data science problem specifications.
-
Outcome: students will understand some of the needs and pitfalls in problem specifications, and will have started their own data science project specification.
-
Preparing for this session: Look at the problem statements on Kaggle.com, Drivendata.com and Datakind.org, and think about the types of questions being asked, the datasets being used and who benefits from each problem solution.
This session:
-
Introduces one of the most-used data science languages: Python. Outcome: students will have set up Python and R on their personal machines, and be able to run basic commands in Python.
-
Preparing for this session: Install instructions are in the reference folder. Get familiar with their terminal window, and install iPython (if not already on your machine) and Git.
This session:
-
Introduces students to the art of finding development data, and the idea that almost anything can be a dataset if you look hard enough at it, to the basic concepts of APIs, webscraping tools (including the google spreadsheets webpage scraping tool) and PDF conversion tools (e.g. Cometdocs).
-
Preparing for this session: Download the Tabula tool, and think about data relevant to your projects that isn’t in machine-readable form (e.g. xls, pdf, images, maps etc).
This session:
-
Introduces communication and visualisation ideas and tools (Tableau, Highcharts/D3 etc). Students will also pitch their project ideas to the rest of the class. Before this lab, students will be asked to install Tableau, and download the Highcharts and D3 libraries.
-
Outcome: students will have a basic knowledge of persuasion through data visualisation, and have set up and know basic commands in Tableau.
-
Preparing for this session: Download the Tableau tool.
This session:
-
Introduces students to data munging and manually exploring patterns in data before using algorithms on it. Introduces the tools used for this: OpenRefine, R, Matplotlib etc. Before this lab, students will be asked to install Google OpenRefine Outcome: students will have cleaned a ‘dirty’ dataset with OpenRefine, and explored its contents with R
-
Preparing for this session:
This session:
-
Introduces students to machine learning, and the regression and classification algorithms used in machine learning.
-
Outcomes: students will have run a regression algorithm on a dataset using both Python and R. students will have run a classification algorithm on a dataset using both Python and R.
-
Preparing for this session:
This session:
-
Introduces students to the idea of text as data, to methods and tools for obtaining text (Twitter API etc), and for methods for finding patterns in text (the NLTK library, Overview etc)
-
Outcome: students will understand the basic concepts of text analysis and language understanding, including issues specific to development data science (multiple languages, missing stopword lists etc).
-
Preparing for this session:
This session:
-
Introduces students to the idea of maps as data, and to visualising and reasoning about data with spatial components. Introduces techniques and tools commonly used in these processes (Gdal, Shapely, QGIS, CartoDB etc)
-
Outcome: students will understand basic concepts of spatial data, including issues specific to development data science (missing maps, satellite datasets etc)
-
Preparing for this session:
This session:
-
Introduces students to the network theory used in machine learning, and often used to understand social relationships. Also introduces some common network visualisation tools (e.g. Gephi, NetworkX)
-
Outcome: student will understand basic network analysis concepts and will have run Python network analysis algorithms and viewed a social dataset in Gephi.
-
Preparing for this session:
This session:
-
Introduces students to big data concepts (the three Vs, the other three Vs etc) and commonly used tools (Hadoop etc). Introduces students to the analysis of streaming data. If needed, class will also spend time talking about any outstanding issues participants ran into during their projects, and potential ways to work around them.
-
Outcome: students will understand basic mechanisms for handling large volume and velocity data (variety is already covered above).
-
Preparing for this session: Download Hadoop.
This session:
-
Covers some of the enterprise data science tools out there (IBM Watson, Palantir, Ayasdi, Teradata etc… )
-
Preparing for this session:
This session:
-
Continues further into machine learning techniques
-
Preparing for this session: