hansard-speakers
is a data processing pipeline for disambiguating speaker names in the 19th-century British Parliamentary debates, also known as Hansard. The final dataset produced by this pipeline can be downloaded here (coming soon). An article describing our disambiguation efforts can be read here (coming soon). You can view the code to scrape and format our version of the Hansard corpus from the original XML files hosted by Historic Hansard.
-
Clone the repo and
cd
intohansard-speakers
-
Start the disambiguation process.
Over terminal:
cythonize -3 -i util/*.pyx
python3 run.py --cores <n>
where "n" must be a minimum of three coresOver SLURM:
sbatch job.sbatch
Our disambiguation process uses lower-level processing for computational speed and efficency. To run hansard-speakers
, users must have Cython installed as well as Python.