Skip to content

Proposed Statistics features

Alp Mestanogullari edited this page May 20, 2014 · 2 revisions

The following text gives an outline of my plan to get some pretty plots for hackage, both global and per-package/per-version. This plan should be executed at ZuriHac (beginning of June).

I tried to come up with a solution that would actually encourage building more things on top of it, instead of just solving the problem in the more direct way possible.

Current status

Currently, the download count is the farthest statistics go on hackage. It is exposed through the download feature. I certainly plan to reuse the code from there, but I aim at having various kinds of statistics, many of them already computable from some CountMap (but not all), like:

  • for a specific package, package downloads per day, over time
  • package downloads/uploads per day, over time
  • a pie chart of package licenses
  • user-agent statistics (cabal version, os, arch), provided by cabal-install starting from 1.20.x
  • some metrics that could let us see how fast some "Categories" grow compared to the others
  • if the reverse dep code is usable (it's disabled by default right? especially on the main server), we could even display some statistics from there (most depended on packages, which is a metric we use to judge trustworthiness)
  • people might come up with crazy ideas of uses for these statistics, and we want a way for them to easily access that data, provided they don't abuse it.

And since Ian Ross found the words to get me to hack on hackage during ZuriHac, I came up with the following plan, after discussing it with Ian and Duncan.

New feature: Statistics

This feature (in Hackage terminology) would have access to the DownloadCount's state and to whatever else is necessary for the statistics we want to provide. A lot of these metrics aren't accessible right away, so maybe we should simply make this feature run some kind of more or less complex fold over the ondisk data, as it reads it, instead of making it maintain a lot of state.

The API for this feature would let the user tweak parameters like the time window we're interested in, a precise version of a package, etc. Then the feature would do the hard work of fetching the right data and presenting it to the user in CSV format, so that the user can run some fun scripts on it, or just stick a d3.js page on top. I anticipate that compiling the stats will require a non-neglectible amount of time, so this shouldn't be used for a live interface obviously. I'm not totally sure we should let users access this feature (in terms of workload for the server), or should do so in a restricted way, but if we can make it happen, that would be awesome.

New page: Hackage Statistics

We would have a new page on hackage, or maybe a few, that would display some daily-generated static plots using the first feature, with a list of predefined queries. Some package-specific stats, some global, it wouldn't (and shouldn't) necessarily have to cover the full spectrum of the data the first feature could provide. This lets us display decent and useful plots for little cost and without any Javascript (this could be a problem for some people). However it doesn't keep us from building fancier visualizations in the future, but I just think they shouldn't be hosted on the hackage server. Even a github page could do it, provided it has a CSV file along.

Feedback, questions, advices?

I'm alpounet on IRC (#hackage, #haskell, #ghc), alpmestan at gmail if i'm not around.