Skip to content

Commit

Permalink
KubeCon 2022 --> KubeCon 2023. Basic reports.
Browse files Browse the repository at this point in the history
  • Loading branch information
Matt Young authored and halcyondude committed Apr 6, 2024
1 parent 7c9513b commit 25ab887
Show file tree
Hide file tree
Showing 30 changed files with 110,112 additions and 0 deletions.
69 changes: 69 additions & 0 deletions db/scm/sgm-gharchive/cncf-consolidate.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
====================================
GZ File Consolidation Script
====================================

Source: /Users/matt/gharchive-cncf/cncf.all
Target: /Users/matt/gharchive-cncf/cncf.byrepo
Dry Run: 0
Verbose: 1
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent
dirName: CommitCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CreateEvent
dirName: CreateEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CreateEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/DeleteEvent
dirName: DeleteEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/DeleteEvent into /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/ForkEvent
dirName: ForkEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ForkEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/GollumEvent
dirName: GollumEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/GollumEvent into /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent
dirName: IssueCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssuesEvent
dirName: IssuesEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssuesEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/MemberEvent
dirName: MemberEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/MemberEvent into /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PublicEvent
dirName: PublicEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PublicEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent
dirName: PullRequestEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent
dirName: PullRequestReviewCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewEvent
dirName: PullRequestReviewEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PushEvent
dirName: PushEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PushEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PushEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PushEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/ReleaseEvent
dirName: ReleaseEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ReleaseEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ReleaseEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ReleaseEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/WatchEvent
dirName: WatchEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/WatchEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/WatchEvent into /Users/matt/gharchive-cncf/cncf.byrepo/WatchEvent-consolidated.gz...
Concatenation complete.
36 changes: 36 additions & 0 deletions db/scm/sgm-gharchive/cncf-gharchive-concat-daily.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

handle_sigint() {
echo "Caught Ctrl+C, stopping..."
# Perform any necessary cleanup here
exit 1
}

# Trap SIGINT and call handle_sigint when it's received
trap 'handle_sigint' SIGINT

set -euox pipefail

# ᐅ ./gharchive-concat-daily.sh --help
# Usage: ./gharchive-concat-daily.sh [options]

# Options:
# -s, --source <dir> Source directory (required)
# -t, --target <dir> Target directory (required)
# -d, --dry-run Perform a dry run without creating files
# -v, --verbose Enable verbose output
# -f, --fast-mode Use faster but less resilient to mix-match compression, concatenation (cat) method
# -p, --use-pigz Use pigz instead of gzip for compression
# -r, --report Generate a report with line counts
# -h, --help Display this help text


# ./gharchive-concat-daily.sh --source ~/gharchive-cncf/debug.cncf.all \
# --target ~/gharchive-cncf/debug.cncf.byrepo \
# --verbose \
# --fast-mode > gharchive-concat-daily.log

./gharchive-concat-daily.sh --source ~/gharchive-cncf/debug.cncf.all \
--target ~/gharchive-cncf/debug.cncf.byrepo \
--verbose \
--fast-mode
17 changes: 17 additions & 0 deletions db/scm/sgm-gharchive/consolidate-gz.debug.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
====================================
GZ File Consolidation Script
====================================

====================================
GZ File Consolidation Script
====================================

Source: /Users/matt/gharchive-cncf/debug.cncf.all
Target: /Users/matt/gharchive-cncf/debug.cncf.byrepo
Dry Run: 0
Verbose: 1
Processing directory: /Users/matt/gharchive-cncf/debug.cncf.all/CommitCommentEvent
dirName: CommitCommentEvent
outputFile: /Users/matt/gharchive-cncf/debug.cncf.byrepo/CommitCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/debug.cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/debug.cncf.byrepo/CommitCommentEvent-consolidated.gz...
Concatenation complete.
5 changes: 5 additions & 0 deletions db/scm/sgm-gharchive/gharchive-concat-daily.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
=============================================================
GitHub Archive: combine daily archives into per repo archives
=============================================================

Creating target directory: /Users/matt/gharchive-cncf/cncf.byrepo
Binary file added docs/data-action-cover.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
122 changes: 122 additions & 0 deletions docs/on-classifying-projects-and-communities.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# On Classifying Projects and Communities

<!-- TOC tocDepth:2..3 chapterDepth:2..6 -->

- [Types of Projects, Contributors, and Communities](#types-of-projects-contributors-and-communities)
- [Project Types](#project-types)
- [Project member types](#project-member-types)
- [Contributor Cohorts (Segmentation)](#contributor-cohorts-segmentation)
- [Project Metrics, Measures, and Attributes](#project-metrics-measures-and-attributes)

<!-- /TOC -->

_apart from the diagram, what's below is reproduced from "Working in Public," by Nadia Eghbal ([https://press.stripe.com/working-in-public](https://press.stripe.com/working-in-public))_

## Types of Projects, Contributors, and Communities

### Project Types

The upper right quadrant (Federations) have the **highest user and contributor growth**, while the lower left quadrant (Toys) have the lowest of both measures.

![image](./project-types-quadrants.jpg)

#### Federations

- rare, impactful, ubiquitous
- ~ < 3% of OSS projects
- outsized impact and adoption
- growth pattern: shard
- complex governance, large scale

#### Stadiums

- very low maintainer to user ratio.
- Unlike Federations and Clubs, which exhibit *decentralized communities*, Stadiums typically have a *centralized community topology*.

- Often enjoy large, sometimes federated user communities and groups, oftentimes replicated and segmented by geography. .

#### Clubs

- often contributors are users.
- Niche Languages, Frameworks, Libraries
- domain specific solutions
- analogous to meetups or hobby groups self-selected users, often aligned around a singular axis/dimension of common needs or interests.
- Passionate, dedicated, cadre of contributors. High Net Promoter Score (NPS).

#### Toys

- Side Projects, Hackathon outcomes, experiments, personal growth/learning projects.

### Project member types

#### Maintainers

Maintainers are those who are responsible for the future of a project's repository (or repositories), whose decisions affect the project laterally. Maintainers can be thought of as "trustees" or stewards of the project.

#### Contributors

Contributors are those who make contributions to a project's repository, ranging from casual to significant, but who aren't responsible for its overall success.

##### Active Contributors

(aka "regular" or "long-term" contributors) are considered members of the project, based on their reputation, the consistency of their contributions, or in many cases by explicit declaration from the project's governance mechanism(s) or via fiat.

##### Casual Contributors

Also known as drive-by, reactive, or passive contributors. Often motivated by interests of self or employer, commonly presenting with a transactional engagement style.

#### Users

Users are those whose primary relationship to a project's repository is to consume or use its code [and/or artifacts].

##### Active Users

Frequently self-identified in ADOPTERS.md or via other declarative mechanisms, and captured in case studies and whitepapers as part of project collateral. Historically (and more generally) a project's maintainers don't have a way to identify users, an expectation shared by Users.

##### Passive Users

#### On project member type mobility and fluidity

TODO: Contributor Ladder, and its utility as a signal type.

TODO: Reference and/or link to tag-contributor-strategy docs

### Contributor Cohorts (Segmentation)

#### What's "Cohort Analysis?"

> **Cohort analysis** is a kind of [behavioral analytics](https://en.wikipedia.org/wiki/Behavioral_analytics) that breaks the data in a [data set](https://en.wikipedia.org/wiki/Data_set) into related groups before analysis. These groups, or [cohorts](https://en.wikipedia.org/wiki/Cohort_(statistics)), usually share common characteristics or experiences within a defined time-span.^[1]^^[2]^ Cohort analysis allows a company to "see patterns clearly across the life-cycle of a customer (or user), rather than slicing across all customers blindly without accounting for the natural cycle that a customer undergoes."^[3]^ By seeing these patterns of time, a company can adapt and tailor its service to those specific cohorts. While cohort analysis is sometimes associated with a [cohort study](https://en.wikipedia.org/wiki/Cohort_study), they are different and should not be viewed as one and the same. Cohort analysis is specifically the analysis of cohorts in regards to [big data](https://en.wikipedia.org/wiki/Big_data) and [business analytics](https://en.wikipedia.org/wiki/Business_analytics), while in cohort study, data is broken down into similar groups.

_(source: [https://en.wikipedia.org/wiki/Cohort_analysis](https://en.wikipedia.org/wiki/Cohort_analysis))_

#### n-th Time Contributors

- First Time Contributors
- Second Time Contributors
- Third Time Contributors

#### Reputation Index

This is problematic if not done transparently. We might consider generating a number of indices and considering their utility in practice.

### Project Metrics, Measures, and Attributes

These form part of a picture, but taken alone, in isolation, or without local, project specific context they are in practice often misunderstood.

For all given points in time, aggregated by cohort(s) or other dimensions (never by individual):

- OSS Scorecard Metrics
- Active vs Passive Contributors
- Active vs Passive Users
- Number of open Issues
- Number of open Pull Requests (PR)
- Average time to close an Issue
- Average time to close a PR
- Average time to First Response to Issue
- Average time to First Response to PR
- Granularity of code/issue/pr churn (index over time)
- Patterns of Project Activity over time
- Bus Factor (low number of contributors working on the same areas of code/project over time).
- Popularity (Stars, @mentions, #hashtags)
- Depended Upon (aka PageRank) by other OSS projects
- Depended Upon by Apple Services as correlated via SBOM data.
80 changes: 80 additions & 0 deletions docs/on-contributor-privacy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
<!-- TOC tocDepth:2..3 chapterDepth:2..6 -->

- [Values: How we use data](#values-how-we-use-data)
- [Sharing Data Creates Transparency, Public Participation, and Collaboration](#sharing-data-creates-transparency-public-participation-and-collaboration)
- [It's how we work with data that really matters](#its-how-we-work-with-data-that-really-matters)
- [How to responsibly use data](#how-to-responsibly-use-data)
- [Axioms](#axioms)

<!-- /TOC -->

# On Contributor Privacy

Open Source Projects are created, managed, and sustained by communities of contributors, maintainers, users, & vendors. As we seek to better understand the size, composition, topology, and shape of open source communities, we must exercise care and caution. The evolution and prevalence of open source software has made it increasingly easy to inadvertently, accidentally and unknowingly violate the privacy of contributors, causing harm and potentially putting people at risk, and in some cases danger.

We have guidelines and rubrics for understanding how to build secure systems and defined controls to ensure that once built, our systems remain secure. Analogous guidelines and controls for privacy for the project are informed by some of the following ideas, which are presented in

[Data Action: Using data for public good](https://direct.mit.edu/books/book/4983/Data-ActionUsing-Data-for-Public-Good)_"How to use data as a tool for empowerment rather than oppression"_.

![data-action-cover](./data-action-cover.jpeg)
_<https://direct.mit.edu/books/book/4983/Data-ActionUsing-Data-for-Public-Good>_


#### Additional Resources

* Sophia Vargas's talk, "**Design Metric Programs to Respect Contributor Expectations and Promote Safety**" ([video](https://www.youtube.com/watch?v=b3KuTUc_mw0), [sched](https://sched.co/1R2qL)), and found it to be insightful.

## Values: How we use data

*Excerpts and quotes taken from “Data Action: Using Data For Public Good” unless otherwise noted.*

<https://mitpress.mit.edu/9780262545310/data-action>

### Sharing Data Creates Transparency, Public Participation, and Collaboration

"Sharing Data does so much more than provide access to information. It **creates trusting relationships, changes power dynamics, teaches us about policies, fosters debate, and helps to generate collaborative knowledge sharing,** all of which are essential to building strong, deliberative communities." *S. Williams, Data Action: Using data for public good, p. 137*

"Data visualizations help create a narrative around an idea, and **it's the narrative that ultimately has the ability to change people's hearts and minds**. When using data for action, **we must focus on the story we want to tell** with the data." *S. Williams, Data Action: Using data for public good, p. 141*

### It's how we work with data that really matters

…big data in its raw form cannot perform on its own; rather how data is transformed and operationalized can change the way we see the world. More specifically, **data can be used for civic action and policy change by communicating with the data clearly and responsibly to expose the hidden patterns and ideologies** to audiences inside and outside the policy arena. Communicating with data in this way requires the ability to **ask the right questions, **find or **collect the appropriate data, analyze and interpret that data, and visualize the results in a way that can be understood by broad audiences.**

**Combining these methods transforms data** from a simple point on a map **to a narrative that has meaning.** Data is not often processed in this way because data analysts are often not familiar with the techniques that can be used to tell stories with the data ethically and responsibly.

## How to responsibly use data

1. We must interrogate the reasons we want to use data and determine the potential for our work to do more harm than good.

2. Building teams to create narratives around data for action is essential for communicating the results effectively, but team collaboration also helps to make sure no harm is done to the people represented in the data itself.

3. Building data helps change the power dynamics inherent in controlling and using data, while also having numerous side benefits, such as teaching data literacy.

4. Coming up with unique ways to acquire, quantify, and model data can expose messages previously hidden from the public eye; however, we must expose ideas ethically, going back to the first principle above.

5. We must validate the work we do with data by literally observing the phenomenon on the ground and asking those it [affects] to interpret the results.

6. Sharing data is essential for communicating the need for policy change and generating a debate essential for that work. Data visualizations are effective at doing that.

7. We must remember that data are people, and we must do them no harm. Regulations help provide standards of practice for the use of data, but they often are not developed in line with technological change; therefore, we must seek to develop our own standards and call upon others to do the same.

### Axioms

#### The Purpose for Using Data Analytics Must Be Interrogated

…analysts must begin by asking policy questions of people with on-the-ground expertise, those who know the issue the best - and by believing ultimately that this collaboration will create smarter models. *p.215*

#### Building Expert Teams is Essential to Making Data Work for Policy Change

…working collaboratively with policy experts, communities, and designers is essential to reduce the potential for analytics to guide us toward misleading, unethical, or inaccurate conclusions. But more importantly, building expert teams helps *communicate* the work.

#### Building Data Changes Power Dynamics and Shapes Communities

…building data has other benefits: it teaches data literacy, builds communities around shared ideas, and creates media buzz around topics by placing them on the policy agenda.

#### Quantify Ingeniously, but Remember Data Is Biased by Its Creator

#### Data Brings Insights to the Public in Dynamic Ways

#### Data Are People, and We Must Do Them No Harm

Binary file added docs/project-types-quadrants.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 25ab887

Please sign in to comment.