Skip to content

Commit f064d4e

Browse files
committed
W3Techs analysis.
- Added Gini coefficient tools. - Fixed the way we compute gini - Documented how we compute Gini exactly in w3techs/README.md.
1 parent 473b14c commit f064d4e

9 files changed

+359
-680
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ When you're done making changes and you'd like to propose them for review, open
3939
### Your PR is merged!
4040
Congratulations! :sparkles:
4141

42-
Once your PR is merged, you will be proudly listed as a contributor in the contributor chart.
42+
Once your PR is merged, you will be listed as a contributor in the contributor chart.
4343

4444

4545
## Attribution

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# taaraxtak
22

33
taaraxtak is a platform for collecting, storing, and visualizing data about
4-
political dimensions of the Internet. See [this blog post for
5-
context on its motivation](https://nickmerrill.substack.com/p/the-story-so-far).
4+
political dimensions of the Internet. See [some context on its
5+
motivation](https://nickmerrill.substack.com/p/the-story-so-far).
66

77
See a live version at XXX
88

@@ -37,12 +37,12 @@ See [CONTRIBUTING.MD](CONTRIBUTING.md).
3737
taaraxtak (IPA: taːɾaxtak) means "the sky" in the [Chochenyo
3838
Ohlone](https://sogoreate-landtrust.org/lisjan-history-and-territory/) language.
3939

40-
Looking up at the sky, we see only what's above us. Similarly, the Internet is
41-
something we can only know partially. When we measure the Internet we are, in
40+
Looking up at the sky, I can only see what's above me. Similarly, the Internet
41+
is something I can only know partially; when we measure the Internet we are, in
4242
more senses than one, looking at the shape of clouds. This name is meant to
43-
remind us to look at what we /can/ see, acknowledging its partiality, and being
44-
thankful for whatever we can take from it. Everything we learned about the
45-
atmosphere started by looking up at the sky. The message is, all we need is a
43+
remind us to look at what we *can* see, acknowledging its partiality, and being
44+
thankful for whatever we can take from it. Remember, everything we learned about
45+
the atmosphere started by looking up at the sky. All we've ever needed is a
4646
partial perspective and an open system of thought.
4747

4848
# License

src/w3techs/README.md

Lines changed: 101 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,113 @@ centralization of core Internet services among particular providers and
55
jurisdictions. Our primary tool is the Gini coefficient.
66

77
[See this blogpost for
8-
context](https://nickmerrill.substack.com/p/measuring-internet-decentralization),
9-
though mind the particulars may have changed.
8+
context](https://nickmerrill.substack.com/p/measuring-internet-decentralization)
9+
on our motivation and the Gini coefficient.
1010

11+
# Scraping W3Techs data
1112

12-
## Method
13+
`collect.py` scrapes published data on technology usage from `w3techs.com`.
14+
These data are typically updated daily. We label the jurisdiction for each
15+
service provider using `analysis/providers_labeled.csv`. `types.py` defines the
16+
Postgres tables for these data.
1317

14-
`collect.py` scrapes data on technology usage from `w3techs.com`. We combine
15-
these with some data with some external data sources (like the population of
16-
countries and the jurisdiction of providers), maintained more manually and kept
17-
in `analysis`.
18+
# Computing gini coefficients
1819

19-
`types.py` defines the Postgres tables for these data.
20+
## Included markets
2021

22+
The markets we include in our Gini metric are (listed here by their W3Techs names):
23+
'data-centers',
24+
'web-hosting',
25+
'dns-server',
26+
'proxy',
27+
'ssl-certificate',
28+
'server-location',
29+
'top-level-domain'.
2130

22-
## Caveats
31+
Explanations of each, and rationale for including them, are as follows:
32+
33+
### 1. Data centers
34+
Data center providers supply hardware and software infrastructure to serve
35+
websites on the internet.
36+
37+
If data centers were overly centralized, providers could effectively take down
38+
content. If data centers were overly centralized in particular jurisdictions,
39+
jurisdictions could take down that content by legal decree.
40+
41+
### 2. Web hosts
42+
A web hosting service provides hardware and software infrastructure to enable
43+
webmasters to make their website accessible via the internet. Web hosts are
44+
distinct from data centers: for example, for WordPress sites, WP Engine is a
45+
hosting provider, because customers buy hosting services from them. However, WP
46+
Engine uses Google to run its physical servers. Therefore Google, is the data
47+
center provider and WP Engine is the web host.
48+
49+
If web hosting were overly centralized, providers or jurisdictions could make
50+
content inaccessible.
51+
52+
### 3. DNS servers
53+
DNS (domain name system) servers manage internet domain names and their
54+
associated records such as IP addresses.
55+
56+
If DNS servers were overly centralized, providers or jurisdictions could make
57+
content inaccessible by severing users' path to that content.
58+
59+
### 4. Reverse proxies
60+
A reverse proxy service is an intermediary for a website which handles requests
61+
from web clients on behalf of the website's server. Common uses for reverse
62+
proxies are content delivery networks (CDNs, typically located in different
63+
geographical regions) and DDoS (distributed denial of service) protection
64+
services.
65+
66+
If reverse proxies were overly centralized, providers could make content
67+
inaccessible by refusing to serve key content.
68+
69+
### 5. Certificate authorities
70+
SSL certificate authorities are institutions that issue SSL certificates.
71+
72+
If certificate authorities were overly centralized, they could make content more
73+
difficult to access by revoking or denying TLS certificates.
74+
75+
### 5. Server locations
76+
77+
Servers must be positioned in the world. We assume countries have jurisdiction
78+
over the servers located in their countries (meaning those countires can
79+
"legitimately" (by Weber's definition) seize those servers or cut Internet
80+
access to them).
81+
82+
If server locations were overly centralized, particular jurisdictions could make content impossible to access by seizing or blocking access to servers in their jurisdiction.
83+
84+
### 5. Top-level domain
85+
86+
The Domain Name System supports top-level domains (e.g., .com, .net, .ar). Those top-level domains are amdinistered by registrars with clear national jurisdiction.
87+
88+
If top-level domains were overly centralized, particular jurisdictions could
89+
block access to content by compelling certificate authorities to drop or reroute
90+
DNS requests.
91+
92+
## Collecting data for Gini
93+
94+
A Gini is computed for each market. When computing a Gini coefficient, we pass
95+
in the current time `date`, and get everything between timestamp '{date}' -
96+
interval '12 hour' AND '{date}'.
97+
98+
## Weighting Gini
99+
100+
Finally, we weight each country's marketshare by its proportion of global
101+
Internet users (`marketshare / proportion of Internet users`). In other words,
102+
if everyone's proportion of core Internet services were equal to their
103+
proportion of the global population of Internet users, the Gini coefficient
104+
would be 1. (We get data on the proportion of world's Internet users from
105+
WorldBank data. We keep that data in `analysis/`.)
106+
107+
An example of where this matters: Indonesia and Russia have a comprable share of
108+
the world's Internet users: 3.2% vs 3.1% (as of April 21, 2021). But Indonesia
109+
has jurisdiction over only 0.1% of the world's core Internet services, whereas
110+
Russia has 0.5%. So Indonesia's weighted value ends up being 0.030380 vs
111+
Russia's 1.643119, reflecting the population-weighted disparity in service
112+
provision.
113+
114+
# Caveats
23115

24116
W3Techs provides monthly data paying subscribers. Relative to the data we scrape
25117
from the webpage, that data lists slightly more providers, but does not change

0 commit comments

Comments
 (0)