We analyze data from nearly 43,000 first-class men's cricket matches -- a near census of the relevant population. And we make a series of discoveries that upend some conventional wisdom, and understanding based on analysis of much smaller datasets --- in fact, one prominent previous study (pdf) basis its analysis on just about 1% of the data we have.
-
Match Level Data: We got our data from espncricinfo.com. We went about downloading and parsing the data a couple of different ways. Gaurav just scraped and parsed the HTML pages. Derek, clearly the sharper of the two, realized that espncricinfo also provides a nice json API and developed a python module.
Aware of the duplication of work, in this repository, we only provide scripts and data that aren't available elsewhere (except for the final dataset we use). These include, a script to download match ids, match ids by match type (json), a script for making the requests and parsing the requests using the json data, and output for ODI matches based on the script. However, the final dataset we use is the same as posted on Gaurav's repository. -
Rankings Data: parse_rankings gets monthly rankings for ODIs from 1981-2013 and for tests from 1952-2013. ICC changed its site in 2014 so that it only shows the most recent rankings. The script outputs odi rankings and test rankings.
We began by merging the ranking and the match data. We next analyzed the data. The script produces these figures. The tex and pdf files for the final write-up can be found here.
Gaurav Sood and Derek Willis
Scripts, figures, and writing are released under CC BY 2.0.