ProVe is a system designed to automatically verify claims and references in Wikidata. It extracts claims from Wikidata entities, fetches the referenced URLs, processes the HTML content, and uses NLP models to determine whether the claims are supported by the referenced content.
The RQV system consists of several key components:
-
Data Collection and Processing:
WikidataParser
: Extracts claims and URLs from Wikidata based on QID (item identifier)HTMLFetcher
: Collects HTML content from reference URLsHTMLSentenceProcessor
: Converts HTML to sentences for analysis
-
Evidence Selection and Verification:
EvidenceSelector
: Selects relevant sentences as evidenceClaimEntailmentChecker
: Verifies entailment relationship between claims and evidence
-
NLP Models:
TextualEntailmentModule
: Checks textual entailment relationshipsSentenceRetrievalModule
: Retrieves relevant sentencesVerbModule
: Handles verbalization processing
-
Data Storage:
- MongoDB: Stores HTML content, entailment results, parser statistics, and status information
- SQLite: Stores verification results for API access
-
Service Structure:
ProVe_main_service.py
: Main service logicProVe_main_process.py
: Entity processing logicbackground_processing.py
: Background processing tasks
pip install -r requirements.txt
The 'base' folder contains essential NLP models for the RQV tool, including pre-trained & fine-tuned BERT, T5, and related parsers and NLP models.
Download from:
https://emckclac-my.sharepoint.com/:f:/r/personal/k2369089_kcl_ac_uk/Documents/base?csf=1&web=1&e=TBo3nE
Place the downloaded 'base' folder in the project root directory.
Review and modify the config.yaml
file to adjust database settings, HTML fetching parameters, and evidence selection thresholds.
from ProVe_main_process import initialize_models, process_entity
# Initialize models
models = initialize_models()
# Process entity by QID
qid = 'Q44' # Example: Barack Obama
html_df, entailment_results, parser_stats = process_entity(qid, models)
The main service can be started by running:
python ProVe_main_service.py
This will start the MongoDB handler and schedule background processing tasks.
The system can automatically process:
- Top viewed Wikidata items
- Items from a pagepile list
- Random QIDs
The config.yaml
file contains important settings:
- Database configurations
- Algorithm version
- HTML fetching parameters (batch size, delay, timeout)
- Text processing settings
- Evidence selection parameters
- A Wikidata QID is provided to the system
- The system extracts claims and reference URLs from the entity
- HTML content is fetched from the reference URLs
- The HTML is processed into sentences
- Relevant sentences are selected as evidence
- NLP models verify if the evidence supports the claims
- Results are stored in the database