-
Notifications
You must be signed in to change notification settings - Fork 8
Home
Welcome to the git-lit wiki!
We'll add more structure to this later, but for now it's just a single page to record various interesting/useful bits of information.
Quickstart
git clone [email protected]:Git-Lit/git-lit.git
cd git-lit
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
If use MacPorts and get a complaint about an old version of the libxml2
dynamic library when importing lxml
, using the following command may help:
export DYLD_LIBRARY_PATH=/opt/local/lib
This, however, can break other things (like git
), so a better solution is to preface commands that need it with the definition, e.g.
DYLD_LIBRARY_PATH=/opt/local/lib python stats.py -r data
The zip files, as delivered from the British Library, live in the data
directory (samples only) and the structure looks like:
data/000000037/000000037_0_1-42pgs__944211_dat.zip
data/000000196/000000196_0_1-164pgs__1031646_dat.zip
data/000000206/000000206_0_1-256pgs__594984_dat.zip
data/000000216/000000216_1_1-318pgs__632698_dat.zip
The file name format is:
{book id}_{volume}_{version?}-{page count}pgs__{?unknown?}_dat.zip
-
book id
is the Aleph system number (sysnum) of the catalog record for the original. This is different from the sysnum associated with catalog record for the electronic resource created by the scanning. -
volume
is 0 for single volume editions or 1-N for N volume editions -
version
is always 1. My guess is that it's to allow for rescans, but this should be confirmed with BL. -
page count
is per volume -
unknown
is ... ??? doesn't appear to be length or date
The zip file contains a {book id}_metadata.xml
file at the top level which contains limited metadata in MODS format. The OCR text is in the ALTO
subdirectory in using one of the following naming schemes:
ALTO/000000216_01_000001.xml
ALTO/000000216_01_000002.xml
ALTO/000000206_000001.xml
ALTO/000000206_000002.xml
ALTO/000000206_000003.xml
The first example is volume 1 of a multi-volume scan and the second is a single volume scan.