Skip to content

A data processing pipeline to disambiguate speakers in the 19th-century British Parliamentary debates.

License

Notifications You must be signed in to change notification settings

stephbuon/hansard-speakers

Repository files navigation

Hansard Speaker Name Disambiguation

hansard-speakers is a data processing pipeline for disambiguating speaker names in the 19th-century British Parliamentary debates, also known as Hansard. The final dataset produced by this pipeline can be downloaded here (coming soon). An article describing our disambiguation efforts can be read here (coming soon). You can view the code to scrape and format our version of the Hansard corpus from the original XML files hosted by Historic Hansard.

Steps:

  1. Clone the repo and cd into hansard-speakers

  2. Start the disambiguation process.

    Over terminal: cythonize -3 -i util/*.pyx python3 run.py --cores <n> where "n" must be a minimum of three cores

    Over SLURM: sbatch job.sbatch

Requirements:

Our disambiguation process uses lower-level processing for computational speed and efficency. To run hansard-speakers, users must have Cython installed as well as Python.

About

A data processing pipeline to disambiguate speakers in the 19th-century British Parliamentary debates.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published