Skip to content

ufal/atrium-page-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Image classification using fine-tuned ViT - for historical document sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training / evaluation of ViT model, input file/directory processing, class πŸͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format, HF 😊 hub 1 πŸ”— support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion

Table of contents πŸ“‘


Versions 🏁

There are currently 2 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v2.1 is considered to be default and can be found in the main branch of HF 😊 hub 1 πŸ”—

Version Base Pages PDFs Description
v2.0 vit-base-path16-224 10073 3896 annotations with mistakes, more heterogenous data
v2.1 vit-base-path16-224 11940 5002 main: more diverse pages in each category, less annotation mistakes
v2.2 vit-base-path16-224 15855 5730 same data as v2.1 + some restored pages from v2.0
v3.2 vit-base-path16-384 15855 5730 same data as v2.2, but a bit larger model base with higher resolution
v5.2 vit-large-path16-384 15855 5730 same data as v2.2, but the largest model base with higher resolution
Base model - size πŸ‘€
Version Disk space
vit-base-patch16-224 344 Mb
vit-base-patch16-384 345 Mb
vit-large-patch16-384 1.2 Gb

Model description πŸ“‡

πŸ”² Fine-tuned model repository: UFAL's vit-historical-page 1 πŸ”—

πŸ”³ Base model repository: Google's vit-base-patch16-224, vit-base-patch16-384, vit-large-patch16-284 2 3 4 πŸ”—

The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.

The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories πŸͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples πŸ“ directory.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.

In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“ format text, as well as to mark the presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.

Data πŸ“œ

Training πŸ’ͺ set of the model: 8950 images for v2.0

Training πŸ’ͺ set of the model: 10745 images for v2.1

Training πŸ’ͺ set of the model: 14565 images for v2.2, v3.2 and v5.2

90% of all - proportion in categories πŸͺ§ tabulated below

Evaluation πŸ† set: 1290 images (taken from v2.2 annotations)

10% of all - same proportion in categories πŸͺ§ as below and demonstrated in model_EVAL.csv πŸ“Ž

Manual ✍️ annotation was performed beforehand and took some time βŒ›, the categories πŸͺ§ were formed from different sources of the archival documents originated in the 1920-2020 years span.

Note

Disproportion of the categories πŸͺ§ in both training data and provided evaluation category_samples πŸ“ is NOT intentional, but rather a result of the source data nature.

In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).

The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW πŸ“)

Categories πŸͺ§

Label️ Description
DRAW πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
DRAW_L πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
LINE_HW βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure
LINE_P πŸ“ - printed text organized in a tabular or form-like structure
LINE_T πŸ“ - machine-typed text organized in a tabular or form-like structure
PHOTO πŸŒ„ - photographs or photographic cutouts, potentially with text captions
PHOTO_L πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations
TEXT πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
TEXT_HW βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)
TEXT_P πŸ“„ - only printed text in paragraph or block form (non-tabular)
TEXT_T πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)

The categories were chosen to sort the pages by the following criteria:

  • presence of graphical elements (drawings πŸ“ˆ OR photos πŸŒ„)
  • type of text πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
  • presence of tabular layout / forms πŸ“

The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification as mentioned above.

Examples of pages sorted by category πŸͺ§ can be found in the category_samples πŸ“ directory which is also available as a testing subset of the training data (can be used to run evaluation and prediction with a necessary --inner flag).


How to install πŸ”§

Step-by-step instructions on this program installation are provided here. The easiest way to obtain the model would be to use the HF 😊 hub repository 1 πŸ”— that can be easily accessed via this project.

Hardware requirements πŸ‘€

Minimal machine πŸ–₯️ requirements for slow prediction run (and very slow training / evaluation):

  • CPU with a decent (above average) operational memory size

Ideal machine πŸ–₯️ requirements for fast prediction (and relatively fast training / evaluation):

  • CPU of some kind and memory size
  • GPU (for real CUDA 5 support - only one of Nvidia's cards)

Warning

Make sure you have Python version 3.10+ installed on your machine πŸ’» and check its hardware requirements for correct program running provided above. Then create a separate virtual environment for this project

How to πŸ‘€

Clone this project to your local machine πŸ–₯️️ via:

cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git

OR for updating the already cloned project with some changes, go to the folder containing (hidden) .git subdirectory and run pulling which will merge upcoming files with your local changes:

cd /local/folder/for/this/project/atrium-page-classification
git add <changed_file>
git commit -m 'local changes'
git pull -X theirs

Alternatively, if you do NOT care about local changes OR you want to get the latest project files, just remove those files (all .py, .txt and README files) and pull the latest version from the repository:

cd /local/folder/for/this/project/atrium-page-classification
rm *.py
rm *.txt
rm README*
git pull

Next step would be a creation of the virtual environment. Follow the Unix / Windows-specific instruction at the venv docs 6 πŸ‘€πŸ”— if you don't know how to.

After creating the venv folder, activate the environment via:

source <your_venv_dir>/bin/activate

and then inside your virtual environment, you should install Python libraries (takes time βŒ›)

Caution

Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the Python libraries (Pytorch and its dependencies, etc)

Installation of Python dependencies can be done via:

pip install -r requirements.txt

Note

The so-called CUDA 5 support for Python's PyTorch library is supposed to be automatically installed at this point - when the presence of the GPU on your machine πŸ–₯️ is checked for the first time, later it's also checked every time before the model initialization (for training, evaluation or prediction run).

After the dependencies installation is finished successfully, in the same virtual environment, you can run the Python program.

To test that everything works okay and see the flag descriptions call for --help ❓:

python3 run.py -h

You should see a (hopefully) helpful message about all available command line flags. Your next step would be to pull the model from the HF 😊 hub repository 1 πŸ”— via:

python3 run.py --hf

OR for specific model version (e.g. main, v2.0 or vX.2) use the --revision flag:

python3 run.py --hf -rev v2.0

OR for specific base model version (e.g. google/vit-large-patch16-384) use the --base flag (only when the trained model version demands such base model as described above):

python3 run.py --hf -rev v5.2 -b google/vit-large-patch16-384

Important

If you already have the model files in the model/movel_<revision> directory next to this file, you do NOT have to use the --hf flag to download the model files from the HF 😊 repo 1 πŸ”— (only for the model version update).

You should see a message about loading the model from the hub and then saving it locally on your machine πŸ–₯️.

Only after you have obtained the trained model files (takes less time βŒ› than installing dependencies), you can play with any commands provided below.

After the model is downloaded, you should see a similar file structure:

Initial project tree 🌳 files structure πŸ‘€
/local/folder/for/this/project/atrium-page-classification
β”œβ”€β”€ model
    └── movel_<revision> 
        β”œβ”€β”€ config.json
        β”œβ”€β”€ model.safetensors
        └── preprocessor_config.json
β”œβ”€β”€ checkpoint
    β”œβ”€β”€ models--google--vit-base-patch16-224
        β”œβ”€β”€ blobs
        β”œβ”€β”€ snapshots
        └── refs
    └── .locs
        └── models--google--vit-base-patch16-224
β”œβ”€β”€ data_scripts
    β”œβ”€β”€ windows
        β”œβ”€β”€ move_single.bat
        β”œβ”€β”€ pdf2png.bat
        └── sort.bat
    └── unix
        β”œβ”€β”€ move_single.sh
        β”œβ”€β”€ pdf2png.sh
        └── sort.sh
β”œβ”€β”€ result
    β”œβ”€β”€ plots
        β”œβ”€β”€ date-time_conf_mat.png
        └── ...
    └── tables
        β”œβ”€β”€ date-time_TOP-N.csv
        β”œβ”€β”€ date-time_TOP-N_EVAL.csv
        β”œβ”€β”€ date-time_EVAL_RAW.csv
        └── ...
β”œβ”€β”€ category_samples
    β”œβ”€β”€ DRAW
        β”œβ”€β”€ CTX193200994-24.png
        └── ...
    β”œβ”€β”€ DRAW_L
    └── ...
β”œβ”€β”€ run.py
β”œβ”€β”€ classifier.py
β”œβ”€β”€ utils.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ config.txt
β”œβ”€β”€ README.md
└── ...

Some of the folders may be missing, like mentioned later model_output which is automatically created only after launching the model.


How to run prediction πŸͺ„ modes

There are two main ways to run the program:

  • Single PNG file classification πŸ“„
  • Directory with PNG files classification πŸ“

To begin with, open config.txt βš™ and change folder path in the [INPUT] section, then optionally change top_N and batch in the [SETUP] section.

Note

️ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.

The batch variable value depends on your machine πŸ–₯️ memory size

Rough estimations of memory usage per batch size πŸ‘€
Batch size CPU / GPU memory usage
4 2 Gb
8 3 Gb
16 5 Gb
32 9 Gb
64 17 Gb

It is safe to use batch size below 12 for a regular office desktop computer, and lower it to 4 if it's an old device. For training on a High Performance Computing cluster, you may use values above 20 for the batch variable in the [SETUP] section.

Caution

Do NOT try to change base_model and other section contents unless you know what you are doing

Rough estimations of disk space needed for trained model in relation to the base model πŸ‘€
Version Disk space
vit-base-patch16-224 344 Mb
vit-base-patch16-384 345 Mb
vit-large-patch16-384 1.2 Gb

Make sure the virtual environment with all the installed libraries is activated, you are in the project directory with Python files and only then proceed.

How to πŸ‘€
cd /local/folder/for/this/project/
source <your_venv_dir>/bin/activate
cd atrium-page-classification

Important

All the listed below commands for Python scripts running are adapted for Unix consoles, while Windows users must use python instead of python3 syntax

Page processing πŸ“„

The following prediction should be run using the -f or --file flag with the path argument. Optionally, you can use the -tn or --topn flag with the number of guesses you want to get, and also the -m or --model flag with the path to the model folder argument.

How to πŸ‘€

Run the program from its starting point run.py πŸ“Ž with optional flags:

python3 run.py -tn 3 -f '/full/path/to/file.png' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses with a console output.

OR if you are sure about default variables set in the config.txt βš™:

python3 run.py -f '/full/path/to/file.png'

to run a single PNG file classification - the output will be in the console.

Note

Console output and all result tables contain normalized scores for the highest N class πŸͺ§ scores

Directory processing πŸ“

The following prediction type does NOT require explicit directory path setting with the -d or --directory, since its default value is set in the config.txt βš™ file and awakens when the --dir flag is used. The same flags for the number of guesses and the model folder path as for the single-page processing can be used. In addition, 2 directory-specific flags --inner and --raw are available.

Caution

You must either explicitly set the -d flag's argument or use the --dir flag (calling for the preset in [INPUT] section default value of the input directory) to process PNG files on the directory level, otherwise, nothing will happen

Worth mentioning that the directory πŸ“ level processing is performed in batches, therefore you should refer to the hardware's memory capacity requirements for different batch sizes tabulated above.

How to πŸ‘€
python3 run.py -tn 3 -d '/full/path/to/directory' -m '/full/path/to/model/folder'

for exactly TOP-3 guesses in tabular format from all images found in the given directory.

OR if you are really sure about default variables set in the config.txt βš™:

python3 run.py --dir 

python3 run.py -rev v3.2 -b google/vit-base-patch16-384 --inner --dir

The classification results of PNG pages collected from the directory will be saved πŸ’Ύ to related results πŸ“ folders defined in [OUTPUT] section of config.txt βš™ file.

Tip

To additionally get raw class πŸͺ§ probabilities from the model along with the TOP-N results, use --raw flag when processing the directory (NOT available for single file processing)

Tip

To process all PNG files in the directory AND its subdirectories use the --inner flag when processing the directory, or switch its default value to True in the [SETUP] section

Naturally, processing of the large amount of PNG pages takes time βŒ› and progress of this process is recorded in the console via messages like Processed <BΓ—N> images where B is batch size set in the [SETUP] section of the config.txt βš™ file, and N is an iteration of the current dataloader processing loop.

Only after all images from the input directory are processed, the output table is saved πŸ’Ύ in the result/tables folder.


Results πŸ“Š

There are accuracy performance measurements and plots of confusion matrices for the evaluation dataset (10% of the provided in [TRAIN]'s folder data). Both graphic plots and tables with results can be found in the result πŸ“ folder.

v2.0 Evaluation set's accuracy (Top-3): 95.58% πŸ†

Confusion matrix πŸ“Š TOP-3 πŸ‘€

TOP-3 confusion matrix

v2.1 Evaluation set's accuracy (Top-3): 99.84% πŸ†

Confusion matrix πŸ“Š TOP-3 πŸ‘€

TOP-3 confusion matrix

v2.2 Evaluation set's accuracy (Top-3): 100.00% πŸ†

Confusion matrix πŸ“Š TOP-3 πŸ‘€

TOP-3 confusion matrix

v2.0 Evaluation set's accuracy (Top-1): 84.96% πŸ†

Confusion matrix πŸ“Š TOP-1 πŸ‘€

TOP-1 confusion matrix

v2.1 Evaluation set's accuracy (Top-1): 96.36% πŸ†

Confusion matrix πŸ“Š TOP-1 πŸ‘€

TOP-1 confusion matrix

v2.2 Evaluation set's accuracy (Top-1): 99.61% πŸ†

Confusion matrix πŸ“Š TOP-1 πŸ‘€

TOP-1 confusion matrix

Confusion matrices provided above show the diagonal of matching gold and predicted categories πŸͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.

By running tests on the evaluation dataset after training you can generate the following output files:

  • date-time_model_TOP-N_EVAL.csv - (by default) results of the evaluation dataset with TOP-N guesses
  • date-time_model_conf_mat_TOP-N.png - (by default) confusion matrix plot for the evaluation dataset also with TOP-N guesses
  • date-time_model_EVAL_RAW.csv - (by flag --raw) raw probabilities for all classes of the evaluation dataset

Note

Generated tables will be sorted by FILE and PAGE number columns in ascending order.

Additionally, results of prediction inference run on the directory level without checked results are included.

Result tables and their columns πŸ“πŸ“‹

General result tables πŸ‘€

Demo files v2.0:

Demo files v2.1:

Demo files v2.2:

With the following columns πŸ“‹:

  • FILE - name of the file
  • PAGE - number of the page
  • CLASS-N - label of the category πŸͺ§, guess TOP-N
  • SCORE-N - score of the category πŸͺ§, guess TOP-N

and optionally

  • TRUE - actual label of the category πŸͺ§
Raw result tables πŸ‘€

Demo files v2.0:

Demo files v2.1:

With the following columns πŸ“‹:

  • FILE - name of the file
  • PAGE - number of the page
  • <CATEGORY_LABEL> - separate columns for each of the defined classes πŸͺ§
  • TRUE - actual label of the category πŸͺ§

The reason to use the --raw flag is the possible convenience of results review, since the rows will be basically sorted by categories, and most ambiguous ones will have more small probabilities instead of zeros than the most obvious (for the model) categories πŸͺ§.


Data preparation πŸ“¦

You can use this section as a guide for creating your own dataset of pages, which will be suitable for further model processing.

There are useful multiplatform scripts in the data_scripts πŸ“ folder for the whole process of data preparation.

Note

The .sh scripts are adapted for Unix OS and .bat scripts are adapted for Windows OS, yet their functionality remains the same

On Windows you must also install the following software before converting PDF documents to PNG images:

  • ImageMagick 7 πŸ”— - download and install the latest version
  • Ghostscript 8 πŸ”— - download and install the latest version (32 or 64-bit) by AGPL

PDF to PNG πŸ“š

The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe the procedure of converting PDF documents to PNG images suitable for training, evaluation, or prediction inference.

Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.

How to πŸ‘€

Windows:

move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files

Unix:

cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files

Now check the content and comments in pdf2png.sh πŸ“Ž or pdf2png.bat πŸ“Ž script, and run it.

Important

You can optionally comment out the removal of processed PDF files from the script, yet it's NOT recommended in case you are going to launch the program several times from the same location.

How to πŸ‘€

Windows:

cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat

Unix:

cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh

After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:

Unix folder tree 🌳 structure πŸ‘€
/full/path/to/your/folder/with/pdf/files
β”œβ”€β”€ PdfFile1Name
    β”œβ”€β”€ PdfFile1Name-001.png
    β”œβ”€β”€ PdfFile1Name-002.png
    └── ...
β”œβ”€β”€ PdfFile2Name
    β”œβ”€β”€ PdfFile2Name-01.png
    β”œβ”€β”€ PDFFile2Name-02.png
    └── ...
β”œβ”€β”€ PdfFile3Name
    └── PdfFile3Name-1.png 
β”œβ”€β”€ PdfFile4Name
└── ...

Note

The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's 7 πŸ”— convert command used on Windows does NOT pad the page numbers.

Windows folder tree 🌳 structure πŸ‘€
\full\path\to\your\folder\with\pdf\files
β”œβ”€β”€ PdfFile1Name
    β”œβ”€β”€ PdfFile1Name-1.png
    β”œβ”€β”€ PdfFile1Name-2.png
    └── ...
β”œβ”€β”€ PdfFile2Name
    β”œβ”€β”€ PdfFile2Name-1.png
    β”œβ”€β”€ PDFFile2Name-2.png
    └── ...
β”œβ”€β”€ PdfFile3Name
    └── PdfFile3Name-1.png 
β”œβ”€β”€ PdfFile4Name
└── ...

Optionally you can use the move_single.sh πŸ“Ž or move_single.bat πŸ“Ž script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.

By default, the scripts assume that the onepagers is the back-off directory for PDF document names without a corresponding separate directory of PNG pages found in the PDF files directory (already converted to subdirectories of pages).

How to πŸ‘€

Windows:

move \local\folder\for\this\project\atrium-page-classification\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat

Unix:

cp /local/folder/for/this//project/atrium-page-classification/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files 
move_single.sh 

The reason for such movement is simply convenience in the following annotation process below. These changes are cared for in the next sort.sh πŸ“Ž and sort.bat πŸ“Ž scripts as well.

PNG pages annotation πŸ”Ž

The generated PNG images of document pages are used to form the annotated gold data.

Note

It takes a lot of time βŒ› to collect at least several hundred examples per category.

Prepare a CSV table with exactly 3 columns:

  • FILE - name of the PDF document which was the source of this page
  • PAGE - number of the page (NOT padded with 0s)
  • CLASS - label of the category πŸͺ§

Tip

Prepare equal-in-size categories πŸͺ§ if possible, so that the model will not be biased towards the over-represented labels πŸͺ§

For Windows users, it's NOT recommended to use MS Excel for writing CSV tables, the free alternative may be Apache's OpenOffice 9 πŸ”—. As for Unix users, the default LibreCalc should be enough to correctly write a comma-separated CSV table.

Table in .csv format example πŸ‘€
FILE,PAGE,CLASS
PdfFile1Name,1,Label1
PdfFile2Name,9,Label1
PdfFile1Name,11,Label3
...

PNG pages sorting for training πŸ“¬

Cluster the annotated data into separate folders using the sort.sh πŸ“Ž or sort.bat πŸ“Ž script to copy data from the source folder to the training folder where each category πŸͺ§ has its own subdirectory. This division of PNG images will be used as gold data in training and evaluation.

Warning

It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the previously described CSV table with annotations, (2) the path to the previously described directory containing document-specific subdirectories of page-specific PNG pages, and (3) the path to the directory where you want to store the training data of label-specific directories with annotated page images.

How to πŸ‘€

Windows:

sort.bat

Unix:

sort.sh

After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:

Unix folder tree 🌳 structure πŸ‘€
/full/path/to/your/folder/with/train/pages
β”œβ”€β”€ Label1
    β”œβ”€β”€ PdfFileAName-00N.png
    β”œβ”€β”€ PdfFileBName-0M.png
    └── ...
β”œβ”€β”€ Label2
β”œβ”€β”€ Label3
β”œβ”€β”€ Label4
└── ...
Windows folder tree 🌳 structure πŸ‘€
\full\path\to\your\folder\with\train\pages
β”œβ”€β”€ Label1
    β”œβ”€β”€ PdfFileAName-N.png
    β”œβ”€β”€ PdfFileBName-M.png
    └── ...
β”œβ”€β”€ Label2
β”œβ”€β”€ Label3
β”œβ”€β”€ Label4
└── ...

The sorting script can help you in moderating mislabeled samples before the training. Accurate data annotation directly affects the model performance.

Before running the training, make sure to check the config.txt βš™οΈ file for the [TRAIN] section variables, where you should set a path to the data folder. Make sure label directory names do NOT contain special characters like spaces, tabs or paragraph splits.

Tip

In the config.txt βš™οΈ file tweak the parameter of max_categ for a maximum number of samples per category πŸͺ§, in case you have over-represented labels significantly dominating in size. Set max_categ higher than the number of samples in the largest category πŸͺ§ to use all data samples.

From this point, you can start model training or evaluation process.


For developers πŸͺ›

You can use this project code as a base for your own image classification tasks. The detailed guide on the key phases of the whole process (settings, training, evaluation) is provided here.

Project files description πŸ“‹πŸ‘€
File Name Description
classifier.py Model-specific classes and related functions including predefined values for training arguments
utils.py Task-related algorithms
run.py Starting point of the program with its main function - can be edited for flags and function argument extensions
config.txt Changeable variables for the program - should be edited

Most of the changeable variables are in the config.txt βš™ file, specifically, in the [TRAIN], [HF], and [SETUP] sections.

In the dev sections of the configuration βš™ file, you will find many boolean variables that can be changed from the default False state to True, yet it's recommended to awaken those variables solely through the specific command line flags implemented for each of these boolean variables.

For more detailed training process adjustments refer to the related functions in classifier.py πŸ“Ž file, where you will find some predefined values not used in the run.py πŸ“Ž file.

Important

For both training and evaluation, you must make sure that the training pages directory is set right in the config.txt βš™ and it contains category πŸͺ§ subdirectories with images inside. Names of the category πŸͺ§ subdirectories are sorted in the alphabetic order and become actual label names and replace the default categories πŸͺ§ list

Device πŸ–₯️ requirements for training / evaluation:

  • CPU of some kind and memory size
  • GPU (for real CUDA 5 support - better one of Nvidia's cards)

Worth mentioning that the efficient training is possible only with a CUDA-compatible GPU card.

Rough estimations of memory usage πŸ‘€
Batch size CPU / GPU memory usage
4 2 Gb
8 3 Gb
16 5 Gb
32 9 Gb
64 17 Gb

For test launches on the CPU-only device πŸ–₯️ you should set batch size to lower than 4, and even in this case, above-average CPU memory capacity is a must-have to avoid a total system crush.

Training πŸ’ͺ

To train the model run:

python3 run.py --train

The training process has an automatic progress logging into console, and should take approximately 5-12h depending on your machine's πŸ–₯️ CPU / GPU memory size and prepared dataset size.

Tip

Run the training with default hyperparameters if you have at least ~10,000 and less than 50,000 page samples of the very similar to the initial source data - meaning, no further changes are required for fine-tuning model for the same task on an expanded (or new) dataset of document pages, even number of categories πŸͺ§ does NOT matter while it stays under 20

Training hyperparameters πŸ‘€
  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Above are the default hyperparameters or TrainingArguments 10 used in the training process that can be partially (only epoch and log_step) changed in the [TRAIN] section, plus batch in the [SETUP]section, of the config.txt βš™ file.

You are free to play with the learning rate right in the training function arguments called in the run.py πŸ“Ž file, yet warmup ratio and other hyperparameters are accessible only through the classifier.py πŸ“Ž file.

Playing with training hyperparameters is recommended only if training πŸ’ͺ loss (error rate) descends too slow to reach 0.001-0.001 values by the end of the 3rd (last by default) epoch.

In the case evaluation πŸ† loss starts to steadily going up after the previous descend, this means you have reached the limit of worthy epochs, and next time you should set epochs to the number of epoch that has successfully ended before you noticed the evaluation loss growth.

During training image transformations 11 are applied sequentially with a 50% chance.

Note

No rotation, reshaping, or flipping was applied to the images, mainly color manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.

Image preprocessing steps πŸ‘€
  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

More about selecting the image transformation and the available ones you can read in the PyTorch torchvision docs 11.

After training is complete the model will be saved πŸ’Ύ to its separate subdirectory in the model directory, by default, the naming of the model folder corresponds to the length of its training batch dataloader and the number of epochs - for example model_<S/B>_E where E is the number of epochs, B is the batch size, and S is the size of your training dataset (by defaults, 90% of the provided in [TRAIN]'s folder data).

Full project tree 🌳 files structure πŸ‘€
/local/folder/for/this/project/atrium-page-classification
β”œβ”€β”€ model
    β”œβ”€β”€ movel_v<HFrevision1> 
        β”œβ”€β”€ config.json
        β”œβ”€β”€ model.safetensors
        └── preprocessor_config.json
    β”œβ”€β”€ movel_v<HFrevision2>
    └── ...
β”œβ”€β”€ checkpoint
    β”œβ”€β”€ models--google--vit-base-patch16-224
        β”œβ”€β”€ blobs
        β”œβ”€β”€ snapshots
        └── refs
    └── .locs
        └── models--google--vit-base-patch16-224
β”œβ”€β”€ model_output
    β”œβ”€β”€ checkpoint-version1
        β”œβ”€β”€ config.json
        β”œβ”€β”€ model.safetensors
        β”œβ”€β”€ trainer_state.json
        β”œβ”€β”€ optimizer.pt
        β”œβ”€β”€ scheduler.pt
        β”œβ”€β”€ rng_state.pth
        └── training_args.bin
    β”œβ”€β”€ checkpoint-version2
    └── ...
β”œβ”€β”€ data_scripts
    β”œβ”€β”€ windows
    └── unix
β”œβ”€β”€ result
    β”œβ”€β”€ plots
    └── tables
β”œβ”€β”€ category_samples
    β”œβ”€β”€ DRAW
    β”œβ”€β”€ DRAW_L
    └── ...
β”œβ”€β”€ run.py
β”œβ”€β”€ classifier.py
β”œβ”€β”€ utils.py
└── ...

Important

The movel_<revision> folder naming is generated from the HF 😊 repo 1 πŸ”— revision value and does NOT affect the trained model naming, other training parameters do. Since the length of the dataloader depends not only on the size of the dataset but also on the preset batch size, and test subset ratio.

You can slightly change the test_size and / or the batch variable value in the config.txt βš™ file to train a differently named model on the same dataset. Alternatively, adjust the model naming generation in the classifier.py's πŸ“Ž training function.

Evaluation πŸ†

After the fine-tuned model is saved πŸ’Ύ, you can explicitly call for evaluation of the model to get a table of TOP-N classes for the randomly composed subset (10% in size by default) of the training page folder.

There is an option of setting test_size to 0.8 and use all the sorted by category pages provided in [TRAIN]'s folder for evaluation, but do NOT launch it on the whole training data you have actually used up for the evaluated model training.

To do this in the unchanged configuration βš™, automatically create a confusion matrix plot πŸ“Š and additionally get raw class probabilities table run:

python3 run.py --eval --raw

OR when you don't remember the specific [SETUP] and [TRAIN] variables' values for the trained model, you can use:

python3 run.py --eval -m './model/model_<your_model_number_code>'

Finally, when your model is trained and you are happy with its performance tests, you can uncomment a code line in the run.py πŸ“Ž file for HF 😊 hub model push. This functionality has already been implemented and can be accessed through the --hf flag using the values set in the [HF] section for the token and repo_name variables.

In this case, you must rename the trained model folder in respect to the revision value (dots in the naming are skipped, e.g. revision v1.9.22 turns to model_v1922 model folder), and only then run repo push.

Caution

Set your own repo_name to the empty one of yours on HF 😊 hub, then in the Settings of your HF 😊 account find the Access Tokens section and generate a new token - copy and paste its value to the token variable. Before committing those config.txt βš™ file changes via git replace the full token value with its shortened version for security reasons.


Contacts πŸ“§

For support write to: [email protected] responsible for this GitHub repository 12 πŸ”—

Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff πŸ“Ž file.

Acknowledgements πŸ™

  • Developed by UFAL 13 πŸ‘₯
  • Funded by ATRIUM 14 πŸ’°
  • Shared by ATRIUM 14 & UFAL 13 πŸ”—
  • Model type: fine-tuned ViT with a 224x224 2 πŸ”— or 384x384 3 4 πŸ”— resolution size

©️ 2022 UFAL & ATRIUM


Appendix πŸ€“

README emoji codes πŸ‘€
  • πŸ–₯ - your computer
  • πŸͺ§ - label/category/class
  • πŸ“„ - page/file
  • πŸ“ - folder/directory
  • πŸ“Š - generated diagrams or plots
  • 🌳 - tree of file structure
  • βŒ› - time-consuming process
  • ✍️ - manual action
  • πŸ† - performance measurement
  • 😊 - Hugging Face (HF)
  • πŸ“§ - contacts
  • πŸ‘€ - click to see
  • βš™οΈ - configuration/settings
  • πŸ“Ž - link to the internal file
  • πŸ”— - link to the external website
Content specific emoji codes πŸ‘€
  • πŸ“ - table content
  • πŸ“ˆ - drawings/paintings/diagrams
  • πŸŒ„ - photos
  • ✏️ - handwritten content
  • πŸ“„ - text content
  • πŸ“° - mixed types of text content, maybe with graphics
Decorative emojis πŸ‘€
  • πŸ“‡πŸ“œπŸ”§β–ΆπŸͺ„πŸͺ›οΈπŸ“¦πŸ”ŽπŸ“šπŸ™πŸ‘₯πŸ“¬πŸ€“ - decorative purpose only

Tip

Alternative version of this README file is available in README.html πŸ“Ž webpage

Footnotes

  1. https://huggingface.co/ufal/vit-historical-page ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7

  2. https://huggingface.co/google/vit-base-patch16-224 ↩ ↩2

  3. https://huggingface.co/google/vit-base-patch16-384 ↩ ↩2

  4. https://huggingface.co/google/vit-large-patch16-384 ↩ ↩2

  5. https://developer.nvidia.com/cuda-python ↩ ↩2 ↩3

  6. https://docs.python.org/3/library/venv.html ↩

  7. https://imagemagick.org/script/download.php#windows ↩ ↩2

  8. https://www.ghostscript.com/releases/gsdnld.html ↩

  9. https://www.openoffice.org/download/ ↩

  10. https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments ↩

  11. https://pytorch.org/vision/0.20/transforms.html ↩ ↩2

  12. https://github.com/ufal/atrium-page-classification ↩

  13. https://ufal.mff.cuni.cz/home-page ↩ ↩2

  14. https://atrium-research.eu/ ↩ ↩2

About

Classification of historical page images using ViT - for ATRIUM project

Topics

Resources

License

Stars

Watchers

Forks