Skip to content

j-rausch/wtf_wikipedia

 
 

Repository files navigation

wikipedia markup parser
by Spencer Kelly and contributors

wtf_wikipedia turns wikipedia's markup language into JSON,
so getting data from wikipedia is easier.

🏠 Try to have a good time. 🛀

seriously,
this is among the most-curious data formats you can find.
(then we buried our human-record in it)

Consider:

wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.

image

It will try it's best, and fail in reasonable ways.

→ building your own parser is never a good idea →

← but this library aims to be a straight-forward way to get data out of wikipedia

... so don't be mad at me, be mad at this.

well ok then,

npm install wtf_wikipedia

var wtf = require('wtf_wikipedia');

wtf.fetch('Whistling').then(doc => {

  doc.categories();
  //['Oral communication', 'Vocal music', 'Vocal skills']

  doc.sections('As communication').text();
  // 'A traditional whistled language named Silbo Gomero..'

  doc.images(0).thumb();
  // 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'

  doc.sections('See Also').links().map(l => l.page)
  //['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});

on the client-side:

<script src="https://unpkg.com/wtf_wikipedia@latest/builds/wtf_wikipedia.min.js"></script>
<script>
  //(follows redirect)
  wtf.fetch('On a Friday', 'en', function(err, doc) {
    var data = doc.infobox(0).data
    data['current_members'].links().map(l => l.page);
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  });
</script>

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parses citation metadata
  • Eliminate xml, latex, css, and table-sorting cruft

But what about...

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.

Full data-dumps:

wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.

API

const wtf = require('wtf_wikipedia')
//parse a page
var doc = wtf(wikiText, [options])

//fetch & parse a page - wtf.fetch(title, [lang_or_wikiid], [options], [callback])
(async () => {
  var doc = await wtf.fetch('Toronto');
  console.log(doc.text())
})();

//(callback format works too)
wtf.fetch(64646, 'en', (err, doc) => {
  console.log(doc.categories());
});

Main parts:

  • .sections()       -   ==these things==
  • .sentences()
  • .links()
  • .tables()
  • .lists()
  • .images()
  • .templates()     -  {{these|things}}
  • .categories()
  • .citations()     -   <ref>these guys</ref>
  • .infoboxes()
  • .coordinates()

outputs:

  • .json()   -     handy, workable data
  • .text()   -     reader-focused plaintext
  • .html()
  • .markdown()
  • .latex()   -     (ftw)
fancy-times:
  • .isRedirect()     -   boolean
  • .isDisambiguation()     -   boolean
  • .title()       -      guess the title of this page

Examples

wtf(wikiText)

flip your wikimedia markup into a Document object

import wtf from 'wtf_wikipedia'
wtf("==In Popular Culture==\n*harry potter's wand\n* the simpsons fence");
// Document {plaintext(), html(), latex()...}

wtf.fetch(title, [lang_or_wikiid], [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf.fetch('Toronto', 'de', function(err, doc) {
  doc.plaintext();
  //Toronto ist mit 2,6 Millionen Einwohnern..
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf.fetch(64646, 'de').then(console.log).catch(console.log)

the fetch method follows redirects.

doc.plaintext()

returns only nice text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).text();
//"Boston's baseball field has a 37ft wall."

Section traversal:

wtf(page).sections(1).children()
wtf(page).sections('see also').remove()

Sentence data:

s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()
s.dates() //structured date templates

Images

img = wtf(page).images(0)
img.url()     // the full-size wikimedia-hosted url
img.thumnail() // 300px, by default
img.format()  // jpg, png, ..
img.exists()  // HEAD req to see if the file is alive

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • 1️⃣ pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • 2️⃣ bundle multiple pages into one request as an array
  • 3️⃣ run it serially, or at least, slowly.
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
  'Api-User-Agent': '[email protected]'
}).then((docList) => {
  let allLinks = docList.map(doc => doc.links());
  console.log(allLinks);
});

Contributing

Join in! - projects like these are only done with many-hands, and we try to be friendly and easy.

See also:

Thank you to the cross-fetch and jshashes libraries.

MIT

Packages

No packages published

Languages

  • JavaScript 100.0%