This repository servers both for providing additional data for our publication branch tma17 and ongoing development of our tool branch master.
Our architecture approach:
The Codes Parsing process collect location codes from the selected sources, merges them if desired and stores them in the database.
The pre-processing step parses IP/DNS files and classifies/filters the domain into multiple groups. The domains with their group information are then stored in the database. We cannot deliver the IP/DNS files due to its size.
The find step does:
- Create a trie out of all location information
- Match the domain label stored in the pre-processing step against this trie
- Store the resulting location hints in the database
The measure step can:
- Read all domains randomly
- Conduct measurements with various frameworks for all location hints of a domain
- Store all measurement results in the database
We produce a daily export with all location hints for all domains with the minimal RTT measurement to the corresponding IP address. The file contains the following columns (the csv column title is in brackets):
- Domain id (domain_id): The id of the domain entry in the domains table of the database.
- Domain name (domain_name): The full domain name.
- IP address (ip_address): The IP address of the domain name from the time of the rDNS export.
- Location hint id (location_hints_id): The id of the location hint in the location_hints table of the database.
- Location code (hint_location_code): The location code which has been found in the domain name and which is checked
- Location code type (location_hint_type): The type of the location code (e.g. iata, geonames, \ldots)
- Hint Location id (hint_location_id): The id of the location, corresponding to the location code, in the locations table of the database.
- Hint location latitude (hint_location_lat): The latitude of the hint location.
- Hint location longitude (hint_location_lon): The latitude of the hint location.
- Probe id (probe_id): The id of the measurement probe in the probes table of the database. The probe information is for the measurement with the global minimum RTT to the IP address.
- Probe location latitude (probe_location_lat): The latitude of the hint location.
- Probe location longitude (probe_location_lon): The latitude of the hint location.
- Measurement result id (measurement_results_id): The id of the measurement result in the measurement_results table of the database. This measurement conatins the current global minimum RTT to the destination IP address.
- RIPE Atlas measurement id (ripe_measurement_id): If the fastest measurement was from RIPE Atlas this column contains the RIPE Atlas measurement id else it is empty.
- Measurement timestamp (measurement_timestamp): The timestamp of the measurement as a UNIX timestamp in UTC
- Measurement type (measurement_type): The source of the measurement, e.g. RIPE Atlas, Caida, \ldots
- Is from traceroute (from_traceroute): A boolean value indicating if this measurement result was extracted from a traceroute measurement.
- Minimal RTT (min_rtt_ms): The RTT in milliseconds of the measurement
- Distance (distance_km): The distance between the probe location and the hints location (the suspected location). This is relevant to determine the maximal error and if a hint can be considered valid.
- Is the hint possible (possible): A boolean value indicating if the location hint is theoretical still possible considering this global minimal RTT.
- Postgres v10.0 or newer: HLOC 2.0 uses a Postgres database to store the collected information
- Our current code often assumes a user named "hloc" with the password "hloc2017". Most scripts have parameters to set these but our current recommendation is to use the hloc user.
- To use parallel queries (these improve data export significantly) execute
ALTER SYSTEM set max_parallel_workers_per_gather TO #numCPUs;
in your Postgres console
- Python v3.4.2: We tested everything on 3.4.2 but also newer versions should work
- Install all Python dependencies using
pip install -r requirements.txt
- All shell scripts were only tested on a standard Debian bash
-
First you need download several location code sources (we used the
location-data
directory as collection point for these):location-data
already contains our self created list of IATA metropolitan codes- unpack the
offline-pages.tar.xz
archive. It contains a scraped list of pages from www.world-airport-codes.com. Our HTML parser is outdated for their current data format. Therefore, we use this stored version. - Get the locode files from UNECE we only need the three CodeListPart files
Unfortunately we could not find a public available CLLI list.Thanks to @WesWrench #7 there is now a source available which needs some tweeking. This list contains the full CLLI codes. Our framework needs only the location part, i.e the first 6 characters. Therefore you first need to preprocess this list and to convert it to the following format:
CLLI code<tab (\t)>latitude<tab (\t)>longitude
Finally you can execute the codes parsing script. An example of how this could look like can be seen in the shell script example-initial-db-setup
.
For more information on the different parameters please read the help output of the script.
- To preprocess the list of domains to geolocate you only need two file:
- The valid TLDs file (get if from IANA)
- The domain list file in the format: IP,DOMAIN - without a space in between
ATTENTION
This script assumes the tables domains and domain_labels are empty! This is due to a drawback in our current implementation.
The second command executed in example-initial-db-setup
shows how to preprocess the domains.
When this two steps are finished you need to load our SQL functions in db-functions.sql
with:
psql -d <database-name> -f db-functions.sql
- No additional sources are needed here. Our blacklist can be found in the
blacklists
directory
Try to execute python -m hloc.scripts.find -p <nr_cores> -c blacklists/code.blacklist.txt -f blacklists/word.blacklist.txt -s blacklists/special.blacklist.txt -dbn <database_name> -l <log_file_name>
Before executing the validate script you need to create the folder /var/cache/hloc if you do not want to run the script from root.
If you want to perform active measurements on the RIPE Atlas platform you need an account and credits to do that.
Use option -o
to validate against the current available measurements for the IP address.
The example-validation.sh
script provides an easier access to the script with usefull prefilled parameters.
Check and adopt these accordingly.