github-repo-stats

A GitHub Action (in Marketplace) built to overcome the 14-day limitation of GitHub's built-in traffic statistics.

Start collecting data with this action today!

Data that you don't collect today will be gone in two weeks from now.

High-level method description:

This GitHub Action runs once per day. Each run yields a "snapshot" of repository traffic statistics (influenced by the past 14 days). Snapshots are persisted via git.
Each run performs data analysis on all individual snapshots and generates a report from the aggregate — covering an arbitrarily long time frame.

Demo

Report:
- HTML report
- PDF report
Action setup (how the above's report is generated):
- Workflow file
- Data branch

Highlights

The report is generated in two document formats: HTML and PDF.
The HTML report resembles how GitHub renders Markdown and is meant to be exposed via GitHub pages.
Charts are based on Altair/Vega.
The PDF report contains vector graphics.
Data updates, aggregation results, and report files are stored in the git repository that you install this Action in: this Action commits changes to a special branch. No cloud storage or database needed. As a result, you have complete and transparent history for data updates and reports, with clear commit messages, in a single place.
The observed repository (the one to build the report for) can be different from the repository you install this Action in.
The HTML report can be served right away via GitHub pages (that is how the demo above works).
Careful data analysis: there are a number of traps (example) when aggregating data based on what the GitHub Traffic API returns. This project tries to not fall for them. One goal of this project is to perform advanced analysis where possible.

Report content

Traffic stats:
- Unique and total views per day
- Unique and total clones per day
- Top referrers (where people come from when they land in your repository)
- Top paths (what people like to look at in your repository)
Evolution of stargazers
Evolution of forks

Credits

This walks on the shoulders of giants. Shoutout to

Pandoc for rendering HTML from Markdown.
Altair and Vega-Lite for visualization.
Pandas for data analysis.
The CPython ecosystem which has always been fun for me to build software in.

Documentation

Terminology: stats repository and data repository

Naming is hard :-). Let's define two concepts and their names:

The stats repository is the repository to fetch stats for and to generate the report for.
The data repository is the repository to store data and report files in.

Let me know if you can think of better names.

These two repositories can be the same. But they don't have to be :-).

That is, you can for example set up this Action in a private repository but have it observe a public repository.

Setup

Example scenario:

stats repository: bob/nice-project
data repository: bob/private-ghrs-data-repo

Create a GitHub Actions workflow file in the data repository (in the example this is the repo bob/private-ghrs-data-repo). Example path: .github/workflows/repostats-for-nice-project.yml.

Example workflow file content with code comments:

on:
  schedule:
    # Run this once per day, towards the end of the day for keeping the most
    # recent data point most meaningful (hours are interpreted in UTC).
    - cron: "0 23 * * *"
  workflow_dispatch: # Allow for running this manually.

jobs:
  j1:
    name: repostats-for-nice-project
    runs-on: ubuntu-latest
    steps:
      - name: run-ghrs
        uses: jgehrcke/[email protected]
        with:
          # Define the stats repository (the repo to fetch
          # stats for and to generate the report for).
          # Remove the parameter when the stats repository
          # and the data repository are the same.
          repository: bob/nice-project
          # Set a GitHub API token that can read the stats
          # repository, and that can push to the data
          # repository (which this workflow file lives in),
          # to store data and the report files.
          ghtoken: ${{ secrets.ghrs_github_api_token }}

Note: the recommended way to run this Action is on a schedule, once per day. Really.

Note: if you set ghtoken: ${{ secrets.ghrs_github_api_token }} as above then in the data repository (where the action is executed) you need to have a secret defined, with the name GHRS_GITHUB_API_TOKEN (of course you can change the name in both places). The content of the secret needs to be an API token that has the repo scope for accessing the stats repository. You can create such a personal access token under github.com/settings/tokens.

Input parameter reference

Extract from action.yml:

  repository:
    description: >
      Repository spec (<owner-or-org>/<reponame>) for the repository to fetch
      statistics for.
    default: ${{ github.repository }}
  ghtoken:
    description: >
      GitHub API token for reading repo stats and for interacting with the data
      repo (must be set if repo to fetch stats for is not the data repo).
    default: ${{ github.token }}
  databranch:
    description: >
      Data branch: Branch to push data to (in the data repo).
    default: github-repo-stats
  ghpagesprefix:
    description: >
      Set this if the data branch in the data repo is exposed via GitHub pages.
      Must not end with a slash. Example: https://jgehrcke.github.io/ghrs-test
    default: none

It is recommended that you create the data branch and delete all files from that branch before setting this Action up in your reposistory, so that this data branch appears as a tidy environment. You can of course do that later, too.

Tracking multiple repositories via `matrix`

The GitHub Actions workflow specification language allows for defining a matrix of different job configurations through the jobs.<job_id>.strategy.matrix directive. This can be used for efficiently tracking multiple stats repositories from within the same data repository.

Example workflow file:

name: fetch-repository-stats
concurrency: fetch-repository-stats

on:
  schedule:
    - cron: "0 23 * * *"
  workflow_dispatch:

jobs:
  run-ghrs-with-matrix:
    name: repostats-for-nice-projects
    runs-on: ubuntu-latest
    strategy:
      matrix:
        # The repositories to generate reports for.
        statsRepo: ['bob/nice-project', 'alice/also-nice-project']
      # Do not cancel&fail all remaining jobs upon first job failure.
      fail-fast: false
      # Help avoid commit conflicts. Note(JP): this should not be
      # necessary anymore, feedback appreciated
      max-parallel: 1
    steps:
      - name: run-ghrs
        uses: jgehrcke/[email protected]
        with:
          # Repo to fetch stats for and to generate the report for.
          repository: ${{ matrix.statsRepo }}
          # Token that can read the stats repository and that
          # can push to the data repository.
          ghtoken: ${{ secrets.ghrs_github_api_token }}
          # Data branch: Branch to push data to (in the data repo).
          databranch: main

Developer instructions

Here is how to run some sanity checks from within a fresh checkout:

$ git clone https://github.com/jgehrcke/github-repo-stats
$ cd github-repo-stats/

$ make clitests
...
1..5
ok 1 analyze.py: snapshots: some, vcagg: yes, stars: some, forks: none
ok 2 analyze.py: snapshots: some, vcagg: yes, stars: none, forks: some
ok 3 analyze.py: snapshots: some, vcagg: yes, stars: some, forks: some
ok 4 analyze.py: snapshots: some, vcagg: no, stars: some, forks: some
ok 5 analyze.py + pdf.py: snapshots: some, vcagg: no, stars: some, forks: some

$ make lint
...
All done! ✨ 🍰 ✨
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Action

github-repo-stats