You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`collect.py` scrapes published data on technology usage from `w3techs.com`.
14
+
These data are typically updated daily. We label the jurisdiction for each
15
+
service provider using `analysis/providers_labeled.csv`. `types.py` defines the
16
+
Postgres tables for these data.
13
17
14
-
`collect.py` scrapes data on technology usage from `w3techs.com`. We combine
15
-
these with some data with some external data sources (like the population of
16
-
countries and the jurisdiction of providers), maintained more manually and kept
17
-
in `analysis`.
18
+
# Computing gini coefficients
18
19
19
-
`types.py` defines the Postgres tables for these data.
20
+
## Included markets
20
21
22
+
The markets we include in our Gini metric are (listed here by their W3Techs names):
23
+
'data-centers',
24
+
'web-hosting',
25
+
'dns-server',
26
+
'proxy',
27
+
'ssl-certificate',
28
+
'server-location',
29
+
'top-level-domain'.
21
30
22
-
## Caveats
31
+
Explanations of each, and rationale for including them, are as follows:
32
+
33
+
### 1. Data centers
34
+
Data center providers supply hardware and software infrastructure to serve
35
+
websites on the internet.
36
+
37
+
If data centers were overly centralized, providers could effectively take down
38
+
content. If data centers were overly centralized in particular jurisdictions,
39
+
jurisdictions could take down that content by legal decree.
40
+
41
+
### 2. Web hosts
42
+
A web hosting service provides hardware and software infrastructure to enable
43
+
webmasters to make their website accessible via the internet. Web hosts are
44
+
distinct from data centers: for example, for WordPress sites, WP Engine is a
45
+
hosting provider, because customers buy hosting services from them. However, WP
46
+
Engine uses Google to run its physical servers. Therefore Google, is the data
47
+
center provider and WP Engine is the web host.
48
+
49
+
If web hosting were overly centralized, providers or jurisdictions could make
50
+
content inaccessible.
51
+
52
+
### 3. DNS servers
53
+
DNS (domain name system) servers manage internet domain names and their
54
+
associated records such as IP addresses.
55
+
56
+
If DNS servers were overly centralized, providers or jurisdictions could make
57
+
content inaccessible by severing users' path to that content.
58
+
59
+
### 4. Reverse proxies
60
+
A reverse proxy service is an intermediary for a website which handles requests
61
+
from web clients on behalf of the website's server. Common uses for reverse
62
+
proxies are content delivery networks (CDNs, typically located in different
63
+
geographical regions) and DDoS (distributed denial of service) protection
64
+
services.
65
+
66
+
If reverse proxies were overly centralized, providers could make content
67
+
inaccessible by refusing to serve key content.
68
+
69
+
### 5. Certificate authorities
70
+
SSL certificate authorities are institutions that issue SSL certificates.
71
+
72
+
If certificate authorities were overly centralized, they could make content more
73
+
difficult to access by revoking or denying TLS certificates.
74
+
75
+
### 5. Server locations
76
+
77
+
Servers must be positioned in the world. We assume countries have jurisdiction
78
+
over the servers located in their countries (meaning those countires can
79
+
"legitimately" (by Weber's definition) seize those servers or cut Internet
80
+
access to them).
81
+
82
+
If server locations were overly centralized, particular jurisdictions could make content impossible to access by seizing or blocking access to servers in their jurisdiction.
83
+
84
+
### 5. Top-level domain
85
+
86
+
The Domain Name System supports top-level domains (e.g., .com, .net, .ar). Those top-level domains are amdinistered by registrars with clear national jurisdiction.
87
+
88
+
If top-level domains were overly centralized, particular jurisdictions could
89
+
block access to content by compelling certificate authorities to drop or reroute
90
+
DNS requests.
91
+
92
+
## Collecting data for Gini
93
+
94
+
A Gini is computed for each market. When computing a Gini coefficient, we pass
95
+
in the current time `date`, and get everything between timestamp '{date}' -
96
+
interval '12 hour' AND '{date}'.
97
+
98
+
## Weighting Gini
99
+
100
+
Finally, we weight each country's marketshare by its proportion of global
101
+
Internet users (`marketshare / proportion of Internet users`). In other words,
102
+
if everyone's proportion of core Internet services were equal to their
103
+
proportion of the global population of Internet users, the Gini coefficient
104
+
would be 1. (We get data on the proportion of world's Internet users from
105
+
WorldBank data. We keep that data in `analysis/`.)
106
+
107
+
An example of where this matters: Indonesia and Russia have a comprable share of
108
+
the world's Internet users: 3.2% vs 3.1% (as of April 21, 2021). But Indonesia
109
+
has jurisdiction over only 0.1% of the world's core Internet services, whereas
110
+
Russia has 0.5%. So Indonesia's weighted value ends up being 0.030380 vs
111
+
Russia's 1.643119, reflecting the population-weighted disparity in service
112
+
provision.
113
+
114
+
# Caveats
23
115
24
116
W3Techs provides monthly data paying subscribers. Relative to the data we scrape
25
117
from the webpage, that data lists slightly more providers, but does not change
0 commit comments