-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ongoing data tracking to evidence the value of Hacktoberfest contributions #5
Comments
👋 1, 2 & 3 would all need logic added to talk to GitHub and fetch updated data. These stats were generated from a static data set that was pulled at the end of Hacktoberfest 2019. Fetching this volume of updated data would likely take a rather long time with a single account auth'ed with GitHub's API, but I am more than happy to run things if folks contribute the code to do these lookups.
I do agree, but this is a flaw in the operations of Hacktoberfest itself rather than the reporting here per-say. The report here accurately reports the number of PRs that the app considered as spam for 2019. However, adding a new data point to the export that gets total closed PRs would be quite interesting, I welcome that as a simple contribution to update the relevant code.
Unfortunately not, sharing the dataset would be illegal as it contains PII, including revealing the set of users that signed up for Hacktoberfest 2019. |
This is unfortunate and makes it difficult for third parties to contribute any meaningful expansion to the scripts. Is there a reduced, sanitised or sample dataset available that could take its place? My pessmistic take is that nobody will have the time or inclination to make these changes. If they did- I would suggest a representative, random sample might reduce the processing overhead and work - somewhat - around GitHub's excruciating API rate limit.
I'm struggling to locate the code path that fetches data from GitHub, or even any reference to an API client library for GitHub. I'm no NodeJS wizard, though. What am I missing? Edit: Aha! I think I've got it. These scripts deal purely with the data "as it stands" and fetching updated information about the state of PRs would require - as you suggest - "logic added to talk to GitHub." For some reason it didn't register that this would apply also to the new data point. |
Sorry for the delayed response here -- I don't think there'd be an easy way to provide a sanitised data set, though if I have time I can look at exporting a schema of the data used so that folks could genearte their own seed data (its just GitHub REST API user/repository/PR objects). As you alluded to in your edit, yeah, this code does not talk to GitHub at all, it just uses already fetched data from GitHub that is stored in a Mongo database. The code that generates this data export should be in digitalocean/hacktoberfest somewhere, not sure where though off the top of my head. |
Ahoy!
First off, well done! This is an amazing project!
However - the data is missing several key metrics that would make it useful as a gauge for Hacktoberfest's value to the wider OSS community. Given the extreme visibility of the spam PRs that have appeared as a consequence of HF this October, I would suggest some of the following information could be added for transparency and balance:
Additionally the statistics claim a spam rate of 4.82%, but elaborates that this includes only PRs that were labelled as "invalid." This requires maintainers to take a positive action- labelling and closing a PR- and have some awareness of the spam reporting mechanisms of Hacktoberfest. I venture that spam is considerably underreported due to this fact and the suggested closed, unmerged PRs metric would be useful to get the bigger picture here.
Finally your summary and documentation mentions that:
This implies that the raw data could and might be made available. Is there any progress on this front?
The text was updated successfully, but these errors were encountered: