Skip to content

Commit ab5fac2

Browse files
authored
Add optional tranco list to demo script (#1016)
* Add tranco option to demo script * Fix formatting * Add missing dependencies * Bump number of browsers and number of test sites in demo
1 parent abf10d7 commit ab5fac2

File tree

5 files changed

+53
-31
lines changed

5 files changed

+53
-31
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Tranco list cache directory
2+
.tranco/
3+
14
# Docker volume
25
docker-volume/
36

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,11 @@ Once installed, it is very easy to run a quick test of OpenWPM. Check out
8888
`openwpm/config.py::BrowserParams`, with the exception of the changes
8989
specified in `demo.py`.
9090

91+
The demo script also includes a sample of how to use the
92+
[Tranco](https://tranco-list.eu/) top sites list via the optional command line
93+
flag `demo.py --tranco`. Note that since this is a real top sites list it will
94+
include NSFW websites, some of which will be highly ranked.
95+
9196
More information on the instrumentation and configuration parameters is given
9297
below.
9398

demo.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,35 @@
1+
import argparse
12
from pathlib import Path
23

4+
import tranco
5+
36
from custom_command import LinkCountingCommand
47
from openwpm.command_sequence import CommandSequence
58
from openwpm.commands.browser_commands import GetCommand
69
from openwpm.config import BrowserParams, ManagerParams
710
from openwpm.storage.sql_provider import SQLiteStorageProvider
811
from openwpm.task_manager import TaskManager
912

10-
# The list of sites that we wish to crawl
11-
NUM_BROWSERS = 1
12-
sites = [
13-
"http://www.example.com",
14-
"http://www.princeton.edu",
15-
"http://citp.princeton.edu/",
16-
]
13+
parser = argparse.ArgumentParser()
14+
parser.add_argument("--tranco", action="store_true", default=False),
15+
args = parser.parse_args()
16+
17+
if args.tranco:
18+
# Load the latest tranco list. See https://tranco-list.eu/
19+
print("Loading tranco top sites list...")
20+
t = tranco.Tranco(cache=True, cache_dir=".tranco")
21+
latest_list = t.list()
22+
sites = ["http://" + x for x in latest_list.top(10)]
23+
else:
24+
sites = [
25+
"http://www.example.com",
26+
"http://www.princeton.edu",
27+
"http://citp.princeton.edu/",
28+
]
1729

1830
# Loads the default ManagerParams
1931
# and NUM_BROWSERS copies of the default BrowserParams
20-
32+
NUM_BROWSERS = 2
2133
manager_params = ManagerParams(num_browsers=NUM_BROWSERS)
2234
browser_params = [BrowserParams(display_mode="native") for _ in range(NUM_BROWSERS)]
2335

environment.yaml

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,47 @@ channels:
33
- main
44
dependencies:
55
- beautifulsoup4=4.11.1
6-
- black=22.8.0
6+
- black=22.10.0
77
- click=8.1.3
8-
- codecov=2.1.11
9-
- dill=0.3.5.1
8+
- codecov=2.1.12
9+
- dill=0.3.6
1010
- easyprocess=1.1
11-
- gcsfs=2022.8.2
12-
- geckodriver=0.30.0
13-
- ipython=8.5.0
11+
- gcsfs=2022.11.0
12+
- geckodriver=0.32.0
13+
- ipython=8.6.0
1414
- isort=5.10.1
1515
- leveldb=1.23
16-
- multiprocess=0.70.13
17-
- mypy=0.982
18-
- nodejs=18.10.0
19-
- pandas=1.5.0
16+
- multiprocess=0.70.14
17+
- mypy=0.991
18+
- nodejs=18.12.1
19+
- pandas=1.5.1
2020
- pillow=9.2.0
21-
- pip=22.2.2
21+
- pip=22.3.1
2222
- pre-commit=2.20.0
23-
- psutil=5.9.2
23+
- psutil=5.9.4
2424
- pyarrow=9.0.0
25-
- pytest-asyncio=0.19.0
25+
- pytest-asyncio=0.20.2
2626
- pytest-cov=4.0.0
27-
- pytest=7.1.3
28-
- python=3.10.6
27+
- pytest=7.2.0
28+
- python=3.11.0
2929
- pyvirtualdisplay=3.0
3030
- recommonmark=0.7.1
3131
- redis-py=4.3.4
32-
- s3fs=2022.8.2
33-
- selenium=4.5.0
34-
- sentry-sdk=1.9.10
32+
- s3fs=2022.11.0
33+
- selenium=4.6.0
34+
- sentry-sdk=1.11.0
3535
- sphinx-markdown-tables=0.0.17
36-
- sphinx=5.2.3
36+
- sphinx=5.3.0
3737
- tabulate=0.9.0
3838
- tblib=1.7.0
3939
- wget=1.20.3
4040
- pip:
4141
- dataclasses-json==0.5.7
4242
- domain-utils==0.7.1
43-
- jsonschema==4.16.0
44-
- plyvel==1.4.0
45-
- types-pyyaml==6.0.12
46-
- types-redis==4.3.21.1
43+
- jsonschema==4.17.0
44+
- plyvel==1.5.0
45+
- tranco==0.6
46+
- types-pyyaml==6.0.12.2
47+
- types-redis==4.3.21.4
4748
- types-tabulate==0.9.0.0
4849
name: openwpm

scripts/environment-unpinned.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ dependencies:
3232
- plyvel
3333
- domain-utils
3434
- dataclasses-json
35+
- tranco

0 commit comments

Comments
 (0)