Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ongoing data tracking to evidence the value of Hacktoberfest contributions #5

Open
Gadgetoid opened this issue Oct 1, 2020 · 3 comments

Comments

@Gadgetoid
Copy link

Ahoy!

First off, well done! This is an amazing project!

However - the data is missing several key metrics that would make it useful as a gauge for Hacktoberfest's value to the wider OSS community. Given the extreme visibility of the spam PRs that have appeared as a consequence of HF this October, I would suggest some of the following information could be added for transparency and balance:

  1. The percentage of PRs from this data that have not been merged or have been closed without merging since November 2019
  2. The percentage of contributors that have gone on- through November 2019 through September 2020- to raise additional PRs (merged perhaps?).
  3. The percentage of contributors that have gone on- through November 2019 through September 2020- to:
    • Create their own repositories
    • Participate in other repositories by raising or commenting on issues

Additionally the statistics claim a spam rate of 4.82%, but elaborates that this includes only PRs that were labelled as "invalid." This requires maintainers to take a positive action- labelling and closing a PR- and have some awareness of the spam reporting mechanisms of Hacktoberfest. I venture that spam is considerably underreported due to this fact and the suggested closed, unmerged PRs metric would be useful to get the bigger picture here.

Finally your summary and documentation mentions that:

the Hacktoberfest 2019 raw data isn't public currently.

This implies that the raw data could and might be made available. Is there any progress on this front?

@MattIPv4
Copy link
Owner

MattIPv4 commented Oct 1, 2020

👋 1, 2 & 3 would all need logic added to talk to GitHub and fetch updated data. These stats were generated from a static data set that was pulled at the end of Hacktoberfest 2019. Fetching this volume of updated data would likely take a rather long time with a single account auth'ed with GitHub's API, but I am more than happy to run things if folks contribute the code to do these lookups.

I venture that spam is considerably underreported due to this fact and the suggested closed, unmerged PRs metric would be useful to get the bigger picture here.

I do agree, but this is a flaw in the operations of Hacktoberfest itself rather than the reporting here per-say. The report here accurately reports the number of PRs that the app considered as spam for 2019. However, adding a new data point to the export that gets total closed PRs would be quite interesting, I welcome that as a simple contribution to update the relevant code.

This implies that the raw data could and might be made available. Is there any progress on this front?

Unfortunately not, sharing the dataset would be illegal as it contains PII, including revealing the set of users that signed up for Hacktoberfest 2019.

@Gadgetoid
Copy link
Author

Gadgetoid commented Oct 1, 2020

Unfortunately not, sharing the dataset would be illegal as it contains PII, including revealing the set of users that signed up for Hacktoberfest 2019.

This is unfortunate and makes it difficult for third parties to contribute any meaningful expansion to the scripts.

Is there a reduced, sanitised or sample dataset available that could take its place? My pessmistic take is that nobody will have the time or inclination to make these changes. If they did- I would suggest a representative, random sample might reduce the processing overhead and work - somewhat - around GitHub's excruciating API rate limit.

adding a new data point to the export that gets total closed PRs would be quite interesting

I'm struggling to locate the code path that fetches data from GitHub, or even any reference to an API client library for GitHub. I'm no NodeJS wizard, though. What am I missing?

Edit: Aha! I think I've got it. These scripts deal purely with the data "as it stands" and fetching updated information about the state of PRs would require - as you suggest - "logic added to talk to GitHub." For some reason it didn't register that this would apply also to the new data point.

@MattIPv4
Copy link
Owner

Sorry for the delayed response here -- I don't think there'd be an easy way to provide a sanitised data set, though if I have time I can look at exporting a schema of the data used so that folks could genearte their own seed data (its just GitHub REST API user/repository/PR objects).

As you alluded to in your edit, yeah, this code does not talk to GitHub at all, it just uses already fetched data from GitHub that is stored in a Mongo database. The code that generates this data export should be in digitalocean/hacktoberfest somewhere, not sure where though off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants