Hansard Speaker Name Disambiguation

hansard-speakers is a data processing pipeline for disambiguating speaker names in the 19th-century British Parliamentary debates, also known as Hansard. The final dataset produced by this pipeline can be downloaded here (coming soon). An article describing our disambiguation efforts can be read here (coming soon). You can view the code to scrape and format our version of the Hansard corpus from the original XML files hosted by Historic Hansard.

Steps:

Clone the repo and cd into hansard-speakers
Start the disambiguation process.

Over terminal: cythonize -3 -i util/*.pyx python3 run.py --cores <n> where "n" must be a minimum of three cores

Over SLURM: sbatch job.sbatch

Requirements:

Our disambiguation process uses lower-level processing for computational speed and efficency. To run hansard-speakers, users must have Cython installed as well as Python.

Name		Name	Last commit message	Last commit date
Latest commit History 511 Commits
data		data
drivedownloader		drivedownloader
evaluation		evaluation
hansard		hansard
util		util
web-tools		web-tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
job.sbatch		job.sbatch
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hansard Speaker Name Disambiguation

Steps:

Requirements:

About

Releases

Packages

Contributors 6

Languages

License

stephbuon/hansard-speakers

Folders and files

Latest commit

History

Repository files navigation

Hansard Speaker Name Disambiguation

Steps:

Requirements:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages