simple-osmWd2csv

Simplest and stupid algoritm (v0.0.2) and some evolution to convert big OSM files into simple CSV file for Wikidata-tag analysis.

This project have two distinct parts:

XML parser: converts a big OSM XML file into a lean tabular format, the intermediary CSV "raw" format. There are other tolls (eg. OPL), this one is suitable for OSM beginners, for Unix agile pipeline lovers... And for those who do not trust anyone: to fact checking.
The pipeline of step1-osmWd2csv_pre.sh and step2-osmWd2csv.php.
Final parser and SQL database. It is not so simple, have some mix of libraries and medium-complexity SQL processing. Produce all final products, including the exportation to the proposed data interchange format, .wdDump.csv.

The target of the "OSM Wikidata-elements CSV file" is to feed PostgreSQL (or SQLite) with complete and reliable data. See eg. Semantic-bridge OSM-Wikidata project.

Basic results and aims

This project also define two simple data-interchange formats for tests and benchmark of OSM-Wikidata tools.

.wdDump.csv format: is the best way to dump, analyse or interchange OSM-Wikidata "bigdata". Is a standard CSV file with columns
<osm_type,osm_id,wd_ids,wd_member_ids>.

The first field, osm_type, is the OSM element type, abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a qid without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, wd_member_ids, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The .wdDump.csv is the final format of the parsing process described also in this repo.
Consumes ~0.01% of the XML (.osm) format, of the wikidata-filtered file, and its zipped file ~0.4% of the .osm.pbf format — see the summary of file sizes. In CPU time to process or analyse is also a big gain... And for data-analists is a standard source of the truth at any SQL tool. Example:

osm_type	osm_id	wd_id	geohash	wd_member_ids
n	32011242	Q49660	`u0qu3jyt`
w	28712148	Q1792561	`u0qu0tmz`
w	610491098	Q18482699	`6vjwdr`	Q18482699:2
r	1988261		`u1j1`	Q315548:49
r	51701	Q39	`u0m`	Q11925:1 Q12746:1
r	3366718	Q386331	`6ur`	Q386331:15

The same table in SQL can be converted in JSON or JSONb with the following structure:

TABLE  wdOsm.raw (
   osm_type char NOT NULL, -- reduced to n/w/r
   osm_id bigint NOT NULL,
   wd_id bigint,  -- "Q" removed
   geohash bigint, -- must convert to base32
   member_wd_ids JSONb,  -- e.g. {"315548":49}
  -- option bigint[[key,value]] = array[array[315548,49],array[392600,2]]
  -- and bigint2d_find(a,needle)
   UNIQUE(osm_type,osm_id)
);

.wdDump.raw.csv format: a standard CSV file with columns <osm_type,osm_id,otherIDs>.
Is a intermediary format, with a lot of redundant lines and IDs. Is easy to interchange or feed SQL for final parsing.
The first field, osm_type, is the OSM element type, the second its ID, and otherIDs a set of space-separed IDs (node-IDs, way-IDs or Wikidata-IDs).
Consumes ~10% of the XML (.osm) format.

So, the following algoritms are also reference specifications of these file formats:

XML pre-parser, osmWd2csv_pre.sh: a simple Unix pipeline where you can easily change or debug the intermediary XML format.
XML parser, osmWd2csv.php: the simplest and stupid algoritm to convert pre-processed OSM into *_wdDump.raw.csv format. Good for fact checking (auting process) or statrting point for adaptations.
The ideal algoritm will be C or C++ doing the job of both, pre-parser and parser. Will be easy also to include in the parse the "SQL processing", because it only checks IDs and refs.
SQL processing osmWd_raw-transform.sql: transforms *_wdDump.raw.csv into *_wdDump.csv. Is not necessary to be a SQL algoritm, but SQL is the simplest and most popular standard way to specify procedures and test.

Presentation

There are a lot of tools, ranging from the simplest grep to counting tags (see also example), to complicated (but very useful) ones:

Direct tools:
- OPL File Format conversors;
- Osmconvert and its Writing CSV tools.
Indirect tools:
- OSM to GeoJSON complex convertions used by taginfo and others;
- osm2pgsql to feed PostgreSQL;
- Osmosis also to feed PostgreSQL.

But the problem to be solved here was to track back only one tag (in this case the wikidata=* tag)... And by a "dummy user", that never used OPL, Osmconvert, etc. but knows something about XML and CSV formats. The only external OSM-tool here will be Osmium, to convert PBF format.

The first step in the OSM's Bigdata analysis is to get the file in .osm (that is a XML) from "the planet"... For beginners, using Linux:

Install. apt install osmium-tool; osmium --version, will show "osmium version 1.3.0" for UBUNTU16 and "osmium version 1.7.1 / libosmium version 2.13.1" for UBUNTU18.
Select here the region, eg. wget https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf, showing its with MD5 sum, so compare it with md5sum liechtenstein-latest.osm.pbf.
... Or big-data as Europe wget http://download.geofabrik.de/europe-latest.osm.pbf
See tests and instructions at Osmium-tool/manual. The main is osmium fileinfo -e liechtenstein-latest.osm.pbf to see all details.

Prepare OSM file

We need a filter (only elements with Wikidata tag) and a convertion (PBF to OSM).

osmium tags-filter liechtenstein-latest.osm.pbf nwr/wikidata -o wikidata-liechtenstein.osm.pbf
osmium cat -f osm wikidata-liechtenstein.osm.pbf -o wikidata-liechtenstein.osm

So the last output is our XML working-file, in this case wikidata-liechtenstein.osm.

Use

To run samples in your database, the simplest is to sh src/make.sh.

There are some examples at example.md, starting with the .osm.pbf files.

cat wikidata-liechtenstein.osm | php src/distincTags.php is a fact-check for osmium fileinfo -e and to confirm the DTD of the OSM file, as well its IDs.
sh src/osmWd2csv_pre.sh < wikidata-liechtenstein.osm | more will show the pre-parsed XML.
sh src/osmWd2csv_pre.sh < wikidata-liechtenstein.osm | php src/osmWd2csv.php > LI_wdDump.csv the useful thing!
... for the complete process, with SQL outputing LI_wdDump.csv, see "example 3 - parsing".

License

Contents, data and software of this project are dedicated to

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
data		data
samples		samples
src		src
.gitignore		.gitignore
README.md		README.md
example.md		example.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simple-osmWd2csv

Basic results and aims

Presentation

Prepare OSM file

Use

License

About

Releases 2

Packages

Languages

OSMBrasil/simple-osmWd2csv

Folders and files

Latest commit

History

Repository files navigation

simple-osmWd2csv

Basic results and aims

Presentation

Prepare OSM file

Use

License

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages