This repository contains code and data for the paper An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers:
-
00_demo_data
gives sample data that can be used to run the script in02_preprocessing
. Our full annotated data that was used in the paper can be found on Dropbox. -
01_selection
contains a random page selection script. -
02_preprocessing
contains the full pipeline used to postprocess the ground truth (before DNN training). -
03_training
contains the code used to train the DNN networks. Note thattrain.py
contains AdamW optimizer code copied from https://github.com/OverLordGoldDragon/keras-adamw. -
04_evaluation
contains various scripts for evaluating performance, as well as our raw data (as sacred runs, see04_evaluation/data
). -
05_prediction
gives scripts for running our final models for prediction (see graphics below for the demo result). To run it yourself on on this or other document images, first download the models from Dropbox and move them to05_prediction/data/models
. Then run05_prediction/src/main.py
to predict the files in05_prediction/data/pages
. Note that you need to have numpy, tensorflow and segmentation_models installed.
Legend. Red: Background, Orange: Horizontal Separators, Green: Vertical Separators, Blue: Table Column Separators.
Legend. Red: Background, Blue: Text Region, Orange: Table Region, Green: Illustrations/Borders.