Wikipedia to ElasticSearch

This project generates an ElasticSearch, or file index from Wikipedia (xml dumps). The process will analyze, extract and store Wikipedia article text and several distinct Wikipedia attributes and relations (detailed below).

Project Features:

Export Wikipedia in different languages {English, French, Spanish, German, Chinese}
Export other Wikimedia resources: {Wikipedia, Wikinews, Wikidata}
Support storing to either an Elastic index or file system (json files)
Support the extraction of Wikipedia article text clean of markdown and html tags
Integrated with Intel NLP Architect
Used in scientific research:
- WEC: Wikipedia Event Coreference
- Cross-document Event Coreference Search: Task, Dataset and Modeling

*Relations integrity tested only for English. Other languages might require some adjustments.

Introduction

In addition to exporting the clean article text from Wikipedia to Elastic or Json files, this project offers the ability to extract several distinct Wikipedia attributes and relations (listed below).

Special Wikipedia Resources and Attributes

3 different types of Wikipedia pages are used: {Redirect/Disambiguation/Title} in order to extract 6 different semantic features for tasks such as Identifying Semantic Relations, Entity Linking, Cross Document Co-Reference, Knowledge Graphs, Summarization and other.

Redirect Links - See details at Wikipedia Redirect
Disambiguation Links - See details at Wikipedia Disambiguation
Category Links - See details at Wikipedia Category
Link Title Parenthesis - See details at paper "Extracting Lexical Reference Rules from Wikipedia"
Infobox - See details at Wikipedia Infobox
Term Frequency (TBD/WIP) - Hold a map of term frequency for computing TFIDF on Wikipadia

Supported Relations Types

Listed below the Wikidata properties which can extend above attributes by running the Wikidata postprocess described below.

Click relation for further details:

Prerequisites

Java 11
Wikipedia xml.bz2 dump file in required language (For example latest en XML dump)
Optional: ElasticSearch 7.17.4 (needed when exporting to an elastic index)
- Recommended: Set Elastic using docker (docker/README.md)
- Alternative:
  - Install Elastic from the official Elasticsearch site
  - Install plugins: analysis-icu, analysis-smartcn (guide)
Optional: Wikidata json.bz2 dump file (latest JSON dump)

Configuration

Main Configuration File

conf.json is the main process configuration file:

exportMethod - Whether to export to Elastic Index (set to elastic) or json files (then set to json_files)
extractRelationFields - When set to true will extract the relations fields (listed in relationTypes) while processing the data (support only with english Wikipedia)
wikipediaDump - Wikipedia .bz2 downloaded dump file location
lang - Support {en (English), fr (French), es (Spanish), de (German), zh (Chinese)}
includeRawText - When set to true, will include the original wikipedia article text (in Markdown)
includeParsedParagraphs - When set to true, will include a list of parsed wikipedia article paragraphs, clean of any markdown or html tags
relationTypes - ["Category", "Infobox", "Parenthesis", "PartName"]. To export those relations, the extractRelationFields configuration need to be set to true (the full list of available relations is in /src/main/java/wiki/data/relations/RelationType.java)

Json Export Configuration File

config/json_file_conf.json is the configuration needed only when the exportMethod is set to json_files

outIndexDirectory - The folder location where to save the exported files
pagesPerFile - How many pages to save per file (100,000 pages ~ 0.5 GB)

Elastic Configuration Files

Main Elastic Configuration File

config/elastic_conf.json - Those configurations are needed only when the exportMethod is set to elastic

indexName - Set your desired Elastic Search index name
docType - Set your desired Elastic Search documnent type
insertBulkSize - Number of pages to bulk insert to elastic search every iteration (found 1000 to give best preformence)
mapping - Elastic Mapping file, should point to src/main/resources/mapping.json
setting - Elastic Setting file, current support {en, fr, es, de, zh}
host - Elastic host
port - Elastic port
scheme - Elastic host schema (default: http)
shards - Number of Elastic shards
replicas - Number of Elastic replicas

Elastic Mapping File

src/main/resources/mapping.json - Elastic wiki index mapping (Should probably stay unchanged)

Elastic Index Files

src/main/resources/{en,es,fr,de,zh}_map_settings.json - Elastic index settings (Should probably stay unchanged)
src/main/resources/lang/{en,es,fr,de,zh}.json - language specific configuration for relation key word translations
src/main/resources/stop_words/{en,es,fr,de,zh}.txt - language specific stop-words list

Build Run and Test

Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)
Checkout/Clone the repository
From command line navigate to project root directory and run:
./gradlew clean build -x test
Should get a message saying: BUILD SUCCESSFUL in 7s
Extract the build zip file created at this location build/distributions/WikipediaToElastic-1.0.zip
Put wiki xml.bz2 dump file (no need to extract the bz2 file!) in: dumps folder
Recommendation: Start with a small wiki dump, make sure you like what you get (or modify configurations to meet your needs) before moving to a full blown 15GB dump export.
Make sure conf.json configurations are set as expected
Make sure config folder configurations are set as expected
Run the process from command line:
java -Xmx6000m -DentityExpansionLimit=2147480000 -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 -jar build/distributions/WikipediaToElastic-1.0/WikipediaToElastic-1.0.jar
To test/query, you can run from terminal:
curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'
Should return a wikipedia page on Alan Turing

Integrating Wikidata Attributes

Running this process require a Wikipedia index (generated by the above process)

Wikidata Main Configuration File (`config/wikidata_conf.json`)

Main configuration file for Wikidata export process, currently only support if Wikipedia was export to an Elasticsearch index.

indexName - Elasticsearch index to enhance with wikidata attributes
docType - Set your desired documnent type
insertBulkSize - Number of pages to bulk insert to elastic search every iteration
host - Elastic host
port - Elastic port
wikidataDump - Wikidata .bz2 downloaded dump file location
scheme - Elastic host schema
lang - should correlate with the wikipedia index language

Wikidata Running and Testing

Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)
Make sure wikidata_conf.json configuration are set as expected
Run the process from command line:
java -cp WikipediaToElastic-1.0.jar wiki.wikidata.WikiDataFeatToFile
Process will read the full wikidata dump, parse, extract the relations and merge them relative wikipedia data in search index. Process might take a while to finish.
To test/query, you can run from terminal:
curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'

This should return a wikipedia page on Alan Turing including the new Wikidata relations.

Usage

Elastic Page Query

Once process is complete, two main query options are available (for more details and title query options, see mapping.json):

title.plain - fuzzy search (sorted)
title.keyword - exact match

Generated Elastic Page Example

Pages that have been created with the following structures (also see "Created Fields Attributes" for more details):

Page Example (Extracted from Wikipedia disambiguation page):

{
  "_index": "enwiki_v3",
  "_type": "wikipage",
  "_id": "40573",
  "_version": 1,
  "_score": 20.925367,
  "_source": {
    "title": "NLP",
    "text": "{{wiktionary|NLP}}\n\n'''NLP''' may refer to:\n\n; .....",
    "relations": {
      "isPartName": false,
      "isDisambiguation": true,
      "disambiguationLinks": [
        "Natural language programming",
        "New Labour",
        "National Library of the Philippines",
        "Neuro linguistic programming",
        "Natural language processing",
        "National Liberal Party",
        "Natural Law Party",
        "National Labour Party",
        "Normal link pulses",
        "New Labour Party"
      ],
      "categories": [
        "disambiguation"
      ],
      "infobox": "",
      "titleParenthesis": [],
      "partOf": [],
      "aliases": [
        "LmxM36.1060"
      ],
      "hasPart": [],
      "hasEffect": [],
      "hasCause": [],
      "hasImmediateCause": []
    }
  }
}

Page Example (Extracted from Wikipedia redirect page):

{
  "_index": "enwiki_v3",
  "_type": "wikipage",
  "_id": "2577248",
  "_version": 1,
  "_score": 20.925367,
  "_source": {
    "title": "Nlp",
    "text": "#REDIRECT",
    "redirectTitle": "NLP",
    "relations": {
      "isPartName": false,
      "isDisambiguation": false
    }
  }
}

Fields & Attributes

json field	Value	comment
_id	Text	Wikipedia page id
_source.title	Text	Wikipedia page title
_source.text	Text (optional)	Wikipedia page text
_source.parsedParagraphs	List (optional)	Clean of html/markdown Wikipedia article text split to passages
_source.redirectTitle	Text (optional)	Wikipedia page redirect title
_source.relations.infobox	Text (optional)	The article infobox element
_source.relations.categories	List (optional)	Categories relation list
_source.relations.isDisambiguation	Bool (optional)	is Wikipedia disambiguation page
_source.relations.isPartName	List (optional)	is Wikipedia page name description
_source.relations.titleParenthesis	List (optional)	List of disambiguation secondary links
_source.relations.aliases	List (optional)	Wikidata Rel
_source.relations.partOf	List (optional)	Wikidata Rel
_source.relations.hasPart	List (optional)	Wikidata Rel
_source.relations.hasEffect	List (optional)	Wikidata Rel
_source.relations.hasCause	List (optional)	Wikidata Rel
_source.relations.hasImmediateCause	List (optional)	Wikidata Rel

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
config		config
docker		docker
dumps		dumps
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
conf.json		conf.json
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

License

AlonEirew/wikipedia-to-elastic

Folders and files

Latest commit

History

Repository files navigation