No prior experience of R is assumed. Though students may find it useful to work through the CIM introductory workshop prior to attending. Those with a technical or statistical background may find my qstep statistics workshops useful too.
The software we will use is open-source and cross-platform. We use R for the programatic component and draw upon Python for natural language processing. Chrome with the dataminer and autoscroll extensions are optional.
Installing Chrome, R and RStudio should be straight forward on Windows, MacOS and Linux. Links for these are provided in the sections below.
Installing Python and spaCY (to provide natural language processing) differs slightly between operating systems. Details instructions for each platform are given below.
Note The R version used in the creation of this workshop is 3.5.2 'Eggshell Igloo'.
R can be downloaded and installed from the R project page. There are installers for both Windows and MacOS.
R can be installed on ubuntu by entering the following in the terminal:
sudo apt-get update
sudo apt-get install r-base
However, the version in the Ubuntu repository may be slightly older. If you want the latest version then one can add the r-project repository as detailed here.
RStudio is an integrated development enviroment (IDE) which makes working with R easier. RStudio can be downloaded from here.
We use several packages for R to help with our data collection, analysis and visualisation. Enter the following to download these packages in R:
install.packages(c('tidyverse', 'tidytext', 'RedditExtractoR', 'rvest', 'httr'))
The installation will take a while. Once complete, you can load in the package into R by typing in library followed by the package name.
library('tidyverse')
Chrome can be downloaded here. In the Chrome store, you can find the Data Miner, Recipe Creator and autoscroll.
We use Chrome and these extensions to download social media data. The social media page is loaded in the browser. Autoscroll allows us to jump to the bottom of the page and is useful for page which offer infinite scroll. Then recipe creator enables us to filter the data in our browser and download a csv file. The file can then be loaded into R and we can carry out an analysis.
The simplest way to install Python for data science purposes is to install the Anaconda distribution. The installer is quite large and the process may take a while.
Anaconda includes lots of cutting edge Python packages. Python is a very widely used programming language which we are not going to use. There are lots of tutorials which can help you use python for data science such as The Python Data Science Handbook and the Real Python Python Data Science Tutorials.
Spacy is a python module used for natural language processing. We are going to use this from R in order to classify the words downloaded from social media.
After installing Anaconda, go to the Anaconda-Navigator. Click on Environments, select All from the drop down box and type in spacy into the search box (as shown below).
Click on the checkbox and then apply. Anaconda will find the package you need to install. Click on Apply to install the packages (as shown below).
Anaconda will now install spacy. Once this is complete we can download the machine learning model we will use to classify our words in the last part of the workshop. The instructions for each operating system differ.
-
Go to the Anaconda install location
-
Right click on Anaconda Prompt and select 'Run as administrator' (see below screenshot)
-
Type into the prompt the following:
python -m spacy download en
- Open the terminal app. You can find the terminal by typing CMD + Space, then typing in terminal followed by enter.
- Type the following into the terminal
python -m spacy download en