Skip to content
Rocio Ng edited this page Aug 18, 2016 · 24 revisions

Data Science Tools

NOTE: If you would like to suggest a tool to be added to this page please send a DM to Rocio on Slack with a link to the tool and a short description.

  • Please limit suggestions to tools you have had some experience using

R

IDEs

  • R-Studio Just don't use R without this. Just don't..

Packages

  • Hadley Wickham Anything that this guy has made. Over the past 10 years, he's made a bunch of tools that have made R a much less clunky language.
    • ggplot2 The best plotting.
    • ggvis An upcoming alternative to ggplot2; offers some nice features at the moment for web displays (including interactivity).
    • dplyr An incredibly useful data.frame manipulation package. Supports all sorts of things like aggregation, grouping, and even lets you lazily evaluate manipulations of connections to SQL databases (or BQ!)
    • tidyr For making your data tidy. An extension of reshape2.
    • httr Simple manipulation of HTTP.
    • rvest Simple web scraping.
  • bigrquery A decent interface to BigQuery.
  • magrittr Understand this as soon as possible. It will make your life much easier.
  • pipeR A competing version of the magrittr package. Do the tutorial.
  • rlist Like dplyr but for lists.
  • data.table Offers an alternative to data.frames, is very fast and incorporates some of the features of dplyr in its DF manipulation syntax. Do the tutorial.
  • purrr Functional programming additions for R. Lets you do a lot of useful function composition/application easily.
  • sparkTable Makes Tufte-style spark-* charts or tables. Compatible with shiny.
  • ShinyJS Great for incorporating interactive javascript into Shiny apps and markdowns via R code.
  • caret Functional and easy package for prototyping and comparing different machine learning models. Streamlines, pre-processing, cross-validation, hyper-parameter tuning etc with minimal code

Python

IDEs

  • jupyter notebooks Interactive workbooks for data analysis in Python
  • Rodeo Promising IDE similar to RStudio for data analysis in Python. Still a fairly new project so may be buggy
  • PyCharm Python IDE with integrated terminal and neat features such as smart autocomplete and SQL database interfaces

Packages

  • pandas Data wrangling/manipulation
  • numpy and scipy Data analysis and statistics tools
  • matplotlib Most commonly used library for data visualization and plotting

  • seaborn For creating 'prettier' data visualizations

  • scikit-learn Commonly used machine learning library

  • psycopg PostgreSQL adapter for Python. Easy to use and reliable

  • nltk Extensive library for doing natural language processing (NLP). However to take advantage of different corpora they need to be downloaded separately using nltk.download()

  • itertools Extremely useful library for faster/efficient looping in Python. Not the easiest for beginners, but read this and give it a shot

Spark

APIs and Misc Resources

  • GeoNames REST API that returns ZIP Codes, and other properties, from Lat/Lon points
  • Data Science Toolkit Variety of tools including Street Address to Coordinates, Coordinates to Political Areas, Coordinates to Statistics, and several Text parsing/sentiment tools.
  • FCC Census Block Conversions API for getting the census tract from a Lat/Lon
  • SoQL Socrata, which hosts SF Open Data, has API access for every data set and a variety of SQL-like functions that make queries powerful
  • CitySDK The Census Bureau has a SDK package and API. You'll have to sign up for a key, but it's free.