Scope: Processing of images, training / evaluation of ViT model, input file/directory processing, class πͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format, HF π hub 1 π support for the model, multiplatform (Win/Lin) data preparation scripts for PDF to PNG conversion
- Versions π
- Model description π
- How to install π§
- How to run prediction πͺ modes
- Results π
- Data preparation π¦
- For developers πͺ
- Contacts π§
- Acknowledgements π
- Appendix π€
There are currently 2 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved v2.1
is considered to be default and can be found in the main
branch
of HF π hub 1 π
Version | Base | Pages | PDFs | Description |
---|---|---|---|---|
v2.0 |
vit-base-path16-224 |
10073 | 3896 | annotations with mistakes, more heterogenous data |
v2.1 |
vit-base-path16-224 |
11940 | 5002 | main : more diverse pages in each category, less annotation mistakes |
v2.2 |
vit-base-path16-224 |
15855 | 5730 | same data as v2.1 + some restored pages from v2.0 |
v3.2 |
vit-base-path16-384 |
15855 | 5730 | same data as v2.2 , but a bit larger model base with higher resolution |
v5.2 |
vit-large-path16-384 |
15855 | 5730 | same data as v2.2 , but the largest model base with higher resolution |
Base model - size π
Version | Disk space |
---|---|
vit-base-patch16-224 |
344 Mb |
vit-base-patch16-384 |
345 Mb |
vit-large-patch16-384 |
1.2 Gb |
π² Fine-tuned model repository: UFAL's vit-historical-page 1 π
π³ Base model repository: Google's vit-base-patch16-224, vit-base-patch16-384, vit-large-patch16-284 2 3 4 π
The model was trained on the manually βοΈ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.
The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π - categories πͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples π directory.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten βοΈ / just printed plain οΈπ text or structured in tabular π format text, as well as to mark the presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
Training πͺ set of the model: 8950 images for v2.0
Training πͺ set of the model: 10745 images for v2.1
Training πͺ set of the model: 14565 images for v2.2
, v3.2
and v5.2
90% of all - proportion in categories πͺ§ tabulated below
Evaluation π set: 1290 images (taken from v2.2
annotations)
10% of all - same proportion in categories πͺ§ as below and demonstrated in model_EVAL.csv π
Manual βοΈ annotation was performed beforehand and took some time β, the categories πͺ§ were formed from different sources of the archival documents originated in the 1920-2020 years span.
Note
Disproportion of the categories πͺ§ in both training data and provided evaluation category_samples π is NOT intentional, but rather a result of the source data nature.
In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).
The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW π)
LabelοΈ | Description |
---|---|
DRAW |
π - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions |
DRAW_L |
ππ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table |
LINE_HW |
βοΈπ - handwritten text organized in a tabular or form-like structure |
LINE_P |
π - printed text organized in a tabular or form-like structure |
LINE_T |
π - machine-typed text organized in a tabular or form-like structure |
PHOTO |
π - photographs or photographic cutouts, potentially with text captions |
PHOTO_L |
ππ - photos presented within a table-like layout or accompanied by tabular annotations |
TEXT |
π° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements |
TEXT_HW |
βοΈπ - only handwritten text in paragraph or block form (non-tabular) |
TEXT_P |
π - only printed text in paragraph or block form (non-tabular) |
TEXT_T |
π - only machine-typed text in paragraph or block form (non-tabular) |
The categories were chosen to sort the pages by the following criteria:
- presence of graphical elements (drawings π OR photos π)
- type of text π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
- presence of tabular layout / forms π
The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification as mentioned above.
Examples of pages sorted by category πͺ§ can be found in the category_samples π directory
which is also available as a testing subset of the training data (can be used to run evaluation and prediction with a
necessary --inner
flag).
Step-by-step instructions on this program installation are provided here. The easiest way to obtain the model would be to use the HF π hub repository 1 π that can be easily accessed via this project.
Hardware requirements π
Minimal machine π₯οΈ requirements for slow prediction run (and very slow training / evaluation):
- CPU with a decent (above average) operational memory size
Ideal machine π₯οΈ requirements for fast prediction (and relatively fast training / evaluation):
- CPU of some kind and memory size
- GPU (for real CUDA 5 support - only one of Nvidia's cards)
Warning
Make sure you have Python version 3.10+ installed on your machine π» and check its hardware requirements for correct program running provided above. Then create a separate virtual environment for this project
How to π
Clone this project to your local machine π₯οΈοΈ via:
cd /local/folder/for/this/project
git init
git clone https://github.com/ufal/atrium-page-classification.git
OR for updating the already cloned project with some changes, go to the folder containing (hidden) .git
subdirectory and run pulling which will merge upcoming files with your local changes:
cd /local/folder/for/this/project/atrium-page-classification
git add <changed_file>
git commit -m 'local changes'
git pull -X theirs
Alternatively, if you do NOT care about local changes OR you want to get the latest project files,
just remove those files (all .py
, .txt
and README
files) and pull the latest version from the repository:
cd /local/folder/for/this/project/atrium-page-classification
rm *.py
rm *.txt
rm README*
git pull
Next step would be a creation of the virtual environment. Follow the Unix / Windows-specific instruction at the venv docs 6 ππ if you don't know how to.
After creating the venv folder, activate the environment via:
source <your_venv_dir>/bin/activate
and then inside your virtual environment, you should install Python libraries (takes time β)
Caution
Up to 1 GB of space for model files and checkpoints is needed, and up to 7 GB of space for the Python libraries (Pytorch and its dependencies, etc)
Installation of Python dependencies can be done via:
pip install -r requirements.txt
Note
The so-called CUDA 5 support for Python's PyTorch library is supposed to be automatically installed at this point - when the presence of the GPU on your machine π₯οΈ is checked for the first time, later it's also checked every time before the model initialization (for training, evaluation or prediction run).
After the dependencies installation is finished successfully, in the same virtual environment, you can run the Python program.
To test that everything works okay and see the flag
descriptions call for --help
β:
python3 run.py -h
You should see a (hopefully) helpful message about all available command line flags. Your next step would be to pull the model from the HF π hub repository 1 π via:
python3 run.py --hf
OR for specific model version (e.g. main
, v2.0
or vX.2
) use the --revision
flag:
python3 run.py --hf -rev v2.0
OR for specific base model version (e.g. google/vit-large-patch16-384
) use the --base
flag (only when the
trained model version demands such base model as described above):
python3 run.py --hf -rev v5.2 -b google/vit-large-patch16-384
Important
If you already have the model files in the model/movel_<revision>
directory next to this file, you do NOT have to use the --hf
flag to download the
model files from the HF π repo 1 π (only for the model version update).
You should see a message about loading the model from the hub and then saving it locally on your machine π₯οΈ.
Only after you have obtained the trained model files (takes less time β than installing dependencies), you can play with any commands provided below.
After the model is downloaded, you should see a similar file structure:
Initial project tree π³ files structure π
/local/folder/for/this/project/atrium-page-classification
βββ model
βββ movel_<revision>
βββ config.json
βββ model.safetensors
βββ preprocessor_config.json
βββ checkpoint
βββ models--google--vit-base-patch16-224
βββ blobs
βββ snapshots
βββ refs
βββ .locs
βββ models--google--vit-base-patch16-224
βββ data_scripts
βββ windows
βββ move_single.bat
βββ pdf2png.bat
βββ sort.bat
βββ unix
βββ move_single.sh
βββ pdf2png.sh
βββ sort.sh
βββ result
βββ plots
βββ date-time_conf_mat.png
βββ ...
βββ tables
βββ date-time_TOP-N.csv
βββ date-time_TOP-N_EVAL.csv
βββ date-time_EVAL_RAW.csv
βββ ...
βββ category_samples
βββ DRAW
βββ CTX193200994-24.png
βββ ...
βββ DRAW_L
βββ ...
βββ run.py
βββ classifier.py
βββ utils.py
βββ requirements.txt
βββ config.txt
βββ README.md
βββ ...
Some of the folders may be missing, like mentioned later model_output
which is automatically created
only after launching the model.
There are two main ways to run the program:
- Single PNG file classification π
- Directory with PNG files classification π
To begin with, open config.txt β and change folder path in the [INPUT]
section, then
optionally change top_N
and batch
in the [SETUP]
section.
Note
οΈ Top-3 is enough to cover most of the images, setting Top-5 will help with a small number of difficult to classify samples.
The batch
variable value depends on your machine π₯οΈ memory size
Rough estimations of memory usage per batch size π
Batch size | CPU / GPU memory usage |
---|---|
4 | 2 Gb |
8 | 3 Gb |
16 | 5 Gb |
32 | 9 Gb |
64 | 17 Gb |
It is safe to use batch size below 12 for a regular office desktop computer, and lower it to 4 if it's an old device.
For training on a High Performance Computing cluster, you may use values above 20 for
the batch
variable in the [SETUP]
section.
Caution
Do NOT try to change base_model and other section contents unless you know what you are doing
Rough estimations of disk space needed for trained model in relation to the base model π
Version | Disk space |
---|---|
vit-base-patch16-224 |
344 Mb |
vit-base-patch16-384 |
345 Mb |
vit-large-patch16-384 |
1.2 Gb |
Make sure the virtual environment with all the installed libraries is activated, you are in the project directory with Python files and only then proceed.
How to π
cd /local/folder/for/this/project/
source <your_venv_dir>/bin/activate
cd atrium-page-classification
Important
All the listed below commands for Python scripts running are adapted for Unix consoles, while
Windows users must use python
instead of python3
syntax
The following prediction should be run using the -f
or --file
flag with the path argument. Optionally,
you can use the -tn
or --topn
flag with the number of guesses you want to get, and also the -m
or
--model
flag with the path to the model folder argument.
How to π
Run the program from its starting point run.py π with optional flags:
python3 run.py -tn 3 -f '/full/path/to/file.png' -m '/full/path/to/model/folder'
for exactly TOP-3 guesses with a console output.
OR if you are sure about default variables set in the config.txt β:
python3 run.py -f '/full/path/to/file.png'
to run a single PNG file classification - the output will be in the console.
Note
Console output and all result tables contain normalized scores for the highest N class πͺ§ scores
The following prediction type does NOT require explicit directory path setting with the -d
or --directory
,
since its default value is set in the config.txt β file and awakens when the --dir
flag
is used. The same flags for the number of guesses and the model folder path as for the single-page
processing can be used. In addition, 2 directory-specific flags --inner
and --raw
are available.
Caution
You must either explicitly set the -d
flag's argument or use the --dir
flag (calling for the preset in
[INPUT]
section default value of the input directory) to process PNG files on the directory
level, otherwise, nothing will happen
Worth mentioning that the directory π level processing is performed in batches, therefore you should refer to the hardware's memory capacity requirements for different batch sizes tabulated above.
How to π
python3 run.py -tn 3 -d '/full/path/to/directory' -m '/full/path/to/model/folder'
for exactly TOP-3 guesses in tabular format from all images found in the given directory.
OR if you are really sure about default variables set in the config.txt β:
python3 run.py --dir
python3 run.py -rev v3.2 -b google/vit-base-patch16-384 --inner --dir
The classification results of PNG pages collected from the directory will be saved πΎ to related results π
folders defined in [OUTPUT]
section of config.txt β file.
Tip
To additionally get raw class πͺ§ probabilities from the model along with the TOP-N results, use
--raw
flag when processing the directory (NOT available for single file processing)
Tip
To process all PNG files in the directory AND its subdirectories use the --inner
flag
when processing the directory, or switch its default value to True
in the [SETUP]
section
Naturally, processing of the large amount of PNG pages takes time β and progress of this process
is recorded in the console via messages like Processed <BΓN> images
where B
is batch size set in the [SETUP]
section of the config.txt β file,
and N
is an iteration of the current dataloader processing loop.
Only after all images from the input directory are processed, the output table is
saved πΎ in the result/tables
folder.
There are accuracy performance measurements and plots of confusion matrices for the evaluation
dataset (10% of the provided in [TRAIN]
's folder data). Both graphic plots and tables with
results can be found in the result π folder.
v2.0
Evaluation set's accuracy (Top-3): 95.58% π
v2.1
Evaluation set's accuracy (Top-3): 99.84% π
v2.2
Evaluation set's accuracy (Top-3): 100.00% π
v2.0
Evaluation set's accuracy (Top-1): 84.96% π
v2.1
Evaluation set's accuracy (Top-1): 96.36% π
v2.2
Evaluation set's accuracy (Top-1): 99.61% π
Confusion matrices provided above show the diagonal of matching gold and predicted categories πͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.
By running tests on the evaluation dataset after training you can generate the following output files:
- date-time_model_TOP-N_EVAL.csv - (by default) results of the evaluation dataset with TOP-N guesses
- date-time_model_conf_mat_TOP-N.png - (by default) confusion matrix plot for the evaluation dataset also with TOP-N guesses
- date-time_model_EVAL_RAW.csv - (by flag
--raw
) raw probabilities for all classes of the evaluation dataset
Note
Generated tables will be sorted by FILE and PAGE number columns in ascending order.
Additionally, results of prediction inference run on the directory level without checked results are included.
General result tables π
Demo files v2.0
:
-
Manually βοΈ checked (small): model_TOP-5.csv π
-
Manually βοΈ checked evaluation dataset (TOP-3): model_TOP-3_EVAL.csv π
-
Manually βοΈ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
-
Unchecked with TRUE values: model_TOP-5.csv π
-
Unchecked with TRUE values (small): model_TOP-3.csvπ
Demo files v2.1
:
-
Manually βοΈ checked evaluation dataset (TOP-3): model_TOP-3_EVAL.csv π
-
Manually βοΈ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
-
Unchecked with TRUE values: model_TOP-3.csv π
-
Unchecked with TRUE values (small): model_TOP-3.csvπ
Demo files v2.2
:
-
Manually βοΈ checked evaluation dataset (TOP-3): model_TOP-3_EVAL.csv π
-
Manually βοΈ checked evaluation dataset (TOP-1): model_TOP-1_EVAL.csv π
With the following columns π:
- FILE - name of the file
- PAGE - number of the page
- CLASS-N - label of the category πͺ§, guess TOP-N
- SCORE-N - score of the category πͺ§, guess TOP-N
and optionally
- TRUE - actual label of the category πͺ§
Raw result tables π
Demo files v2.0
:
-
Manually βοΈ checked evaluation dataset RAW: model_RAW_EVAL.csv π
-
Unchecked with TRUE values RAW: model_RAW.csv π
-
Unchecked with TRUE values (small) RAW: model_RAW.csv π
Demo files v2.1
:
-
Manually βοΈ checked evaluation dataset RAW: model_RAW_EVAL.csv π
-
Unchecked with TRUE values RAW: model_RAW.csv π
-
Unchecked with TRUE values (small) RAW: model_RAW.csv π
-
Demo files
v2.2
: -
Manually βοΈ checked evaluation dataset RAW: model_RAW_EVAL.csv π
With the following columns π:
- FILE - name of the file
- PAGE - number of the page
- <CATEGORY_LABEL> - separate columns for each of the defined classes πͺ§
- TRUE - actual label of the category πͺ§
The reason to use the --raw
flag is the possible convenience of results review,
since the rows will be basically sorted by categories, and most ambiguous ones will
have more small probabilities instead of zeros than the most obvious (for the model)
categories πͺ§.
You can use this section as a guide for creating your own dataset of pages, which will be suitable for further model processing.
There are useful multiplatform scripts in the data_scripts π folder for the whole process of data preparation.
Note
The .sh
scripts are adapted for Unix OS and .bat
scripts are adapted for Windows OS, yet
their functionality remains the same
On Windows you must also install the following software before converting PDF documents to PNG images:
- ImageMagick 7 π - download and install the latest version
- Ghostscript 8 π - download and install the latest version (32 or 64-bit) by AGPL
The source set of PDF documents must be converted to page-specific PNG images before processing. The following steps describe the procedure of converting PDF documents to PNG images suitable for training, evaluation, or prediction inference.
Firstly, copy the PDF-to-PNG converter script to the directory with PDF documents.
How to π
Windows:
move \local\folder\for\this\project\data_scripts\pdf2png.bat \full\path\to\your\folder\with\pdf\files
Unix:
cp /local/folder/for/this/project/data_scripts/pdf2png.sh /full/path/to/your/folder/with/pdf/files
Now check the content and comments in pdf2png.sh π or pdf2png.bat π script, and run it.
Important
You can optionally comment out the removal of processed PDF files from the script, yet it's NOT recommended in case you are going to launch the program several times from the same location.
How to π
Windows:
cd \full\path\to\your\folder\with\pdf\files
pdf2png.bat
Unix:
cd /full/path/to/your/folder/with/pdf/files
pdf2png.sh
After the program is done, you will have a directory full of document-specific subdirectories containing page-specific images with a similar structure:
Unix folder tree π³ structure π
/full/path/to/your/folder/with/pdf/files
βββ PdfFile1Name
βββ PdfFile1Name-001.png
βββ PdfFile1Name-002.png
βββ ...
βββ PdfFile2Name
βββ PdfFile2Name-01.png
βββ PDFFile2Name-02.png
βββ ...
βββ PdfFile3Name
βββ PdfFile3Name-1.png
βββ PdfFile4Name
βββ ...
Note
The page numbers are padded with zeros (on the left) to match the length of the last page number in each PDF file, this is done automatically by the pdftoppm command used on Unix. While ImageMagick's 7 π convert command used on Windows does NOT pad the page numbers.
Windows folder tree π³ structure π
\full\path\to\your\folder\with\pdf\files
βββ PdfFile1Name
βββ PdfFile1Name-1.png
βββ PdfFile1Name-2.png
βββ ...
βββ PdfFile2Name
βββ PdfFile2Name-1.png
βββ PDFFile2Name-2.png
βββ ...
βββ PdfFile3Name
βββ PdfFile3Name-1.png
βββ PdfFile4Name
βββ ...
Optionally you can use the move_single.sh π or move_single.bat π script to move all PNG files from directories with a single PNG file inside to the common directory of one-pagers.
By default, the scripts assume that the onepagers
is the back-off directory for PDF document names without a
corresponding separate directory of PNG pages found in the PDF files directory (already converted to
subdirectories of pages).
How to π
Windows:
move \local\folder\for\this\project\atrium-page-classification\data_scripts\move_single.bat \full\path\to\your\folder\with\pdf\files
cd \full\path\to\your\folder\with\pdf\files
move_single.bat
Unix:
cp /local/folder/for/this//project/atrium-page-classification/data_scripts/move_single.sh /full/path/to/your/folder/with/pdf/files
cd /full/path/to/your/folder/with/pdf/files
move_single.sh
The reason for such movement is simply convenience in the following annotation process below. These changes are cared for in the next sort.sh π and sort.bat π scripts as well.
The generated PNG images of document pages are used to form the annotated gold data.
Note
It takes a lot of time β to collect at least several hundred examples per category.
Prepare a CSV table with exactly 3 columns:
- FILE - name of the PDF document which was the source of this page
- PAGE - number of the page (NOT padded with 0s)
- CLASS - label of the category πͺ§
Tip
Prepare equal-in-size categories πͺ§ if possible, so that the model will not be biased towards the over-represented labels πͺ§
For Windows users, it's NOT recommended to use MS Excel for writing CSV tables, the free alternative may be Apache's OpenOffice 9 π. As for Unix users, the default LibreCalc should be enough to correctly write a comma-separated CSV table.
Table in .csv format example π
FILE,PAGE,CLASS
PdfFile1Name,1,Label1
PdfFile2Name,9,Label1
PdfFile1Name,11,Label3
...
Cluster the annotated data into separate folders using the sort.sh π or sort.bat π script to copy data from the source folder to the training folder where each category πͺ§ has its own subdirectory. This division of PNG images will be used as gold data in training and evaluation.
Warning
It does NOT matter from which directory you launch the sorting script, but you must check the top of the script for (1) the path to the previously described CSV table with annotations, (2) the path to the previously described directory containing document-specific subdirectories of page-specific PNG pages, and (3) the path to the directory where you want to store the training data of label-specific directories with annotated page images.
How to π
Windows:
sort.bat
Unix:
sort.sh
After the program is done, you will have a directory full of label-specific subdirectories containing document-specific pages with a similar structure:
Unix folder tree π³ structure π
/full/path/to/your/folder/with/train/pages
βββ Label1
βββ PdfFileAName-00N.png
βββ PdfFileBName-0M.png
βββ ...
βββ Label2
βββ Label3
βββ Label4
βββ ...
Windows folder tree π³ structure π
\full\path\to\your\folder\with\train\pages
βββ Label1
βββ PdfFileAName-N.png
βββ PdfFileBName-M.png
βββ ...
βββ Label2
βββ Label3
βββ Label4
βββ ...
The sorting script can help you in moderating mislabeled samples before the training. Accurate data annotation directly affects the model performance.
Before running the training, make sure to check the config.txt βοΈ file for the [TRAIN]
section variables, where you should
set a path to the data folder. Make sure label directory names do NOT contain special characters like spaces, tabs or paragraph splits.
Tip
In the config.txt βοΈ file tweak the parameter of max_categ
for a maximum number of samples per category πͺ§, in case you have over-represented labels significantly dominating in size.
Set max_categ
higher than the number of samples in the largest category πͺ§ to use all data samples.
From this point, you can start model training or evaluation process.
You can use this project code as a base for your own image classification tasks. The detailed guide on the key phases of the whole process (settings, training, evaluation) is provided here.
Project files description ππ
File Name | Description |
---|---|
classifier.py |
Model-specific classes and related functions including predefined values for training arguments |
utils.py |
Task-related algorithms |
run.py |
Starting point of the program with its main function - can be edited for flags and function argument extensions |
config.txt |
Changeable variables for the program - should be edited |
Most of the changeable variables are in the config.txt β file, specifically,
in the [TRAIN]
, [HF]
, and [SETUP]
sections.
In the dev sections of the configuration β file, you will find many boolean variables that can be changed from the default False
state to True
, yet it's recommended to awaken those variables solely through the specific
command line flags implemented for each of these boolean variables.
For more detailed training process adjustments refer to the related functions in classifier.py π file, where you will find some predefined values not used in the run.py π file.
Important
For both training and evaluation, you must make sure that the training pages directory is set right in the config.txt β and it contains category πͺ§ subdirectories with images inside. Names of the category πͺ§ subdirectories are sorted in the alphabetic order and become actual label names and replace the default categories πͺ§ list
Device π₯οΈ requirements for training / evaluation:
- CPU of some kind and memory size
- GPU (for real CUDA 5 support - better one of Nvidia's cards)
Worth mentioning that the efficient training is possible only with a CUDA-compatible GPU card.
Rough estimations of memory usage π
Batch size | CPU / GPU memory usage |
---|---|
4 | 2 Gb |
8 | 3 Gb |
16 | 5 Gb |
32 | 9 Gb |
64 | 17 Gb |
For test launches on the CPU-only device π₯οΈ you should set batch size to lower than 4, and even in this case, above-average CPU memory capacity is a must-have to avoid a total system crush.
To train the model run:
python3 run.py --train
The training process has an automatic progress logging into console, and should take approximately 5-12h depending on your machine's π₯οΈ CPU / GPU memory size and prepared dataset size.
Tip
Run the training with default hyperparameters if you have at least ~10,000 and less than 50,000 page samples of the very similar to the initial source data - meaning, no further changes are required for fine-tuning model for the same task on an expanded (or new) dataset of document pages, even number of categories πͺ§ does NOT matter while it stays under 20
Training hyperparameters π
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Above are the default hyperparameters or TrainingArguments 10 used in the training process that can be partially
(only epoch
and log_step
) changed in the [TRAIN]
section, plus batch
in the [SETUP]
section,
of the config.txt β file.
You are free to play with the learning rate right in the training function arguments called in the run.py π file, yet warmup ratio and other hyperparameters are accessible only through the classifier.py π file.
Playing with training hyperparameters is recommended only if training πͺ loss (error rate) descends too slow to reach 0.001-0.001 values by the end of the 3rd (last by default) epoch.
In the case evaluation π loss starts to steadily going up after the previous descend, this means
you have reached the limit of worthy epochs, and next time you should set epochs
to the
number of epoch that has successfully ended before you noticed the evaluation loss growth.
During training image transformations 11 are applied sequentially with a 50% chance.
Note
No rotation, reshaping, or flipping was applied to the images, mainly color manipulations were used. The reason behind this are pages containing specific form types, general text orientation on the pages, and the default reshape of the model input to the square 224x224 resolution images.
Image preprocessing steps π
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
More about selecting the image transformation and the available ones you can read in the PyTorch torchvision docs 11.
After training is complete the model will be saved πΎ to its separate subdirectory in the model
directory, by default,
the naming of the model folder corresponds to the length of its training batch dataloader and the number of epochs -
for example model_<S/B>_E
where E
is the number of epochs, B
is the batch size, and S
is the size of your
training dataset (by defaults, 90% of the provided in [TRAIN]
's folder data).
Full project tree π³ files structure π
/local/folder/for/this/project/atrium-page-classification
βββ model
βββ movel_v<HFrevision1>
βββ config.json
βββ model.safetensors
βββ preprocessor_config.json
βββ movel_v<HFrevision2>
βββ ...
βββ checkpoint
βββ models--google--vit-base-patch16-224
βββ blobs
βββ snapshots
βββ refs
βββ .locs
βββ models--google--vit-base-patch16-224
βββ model_output
βββ checkpoint-version1
βββ config.json
βββ model.safetensors
βββ trainer_state.json
βββ optimizer.pt
βββ scheduler.pt
βββ rng_state.pth
βββ training_args.bin
βββ checkpoint-version2
βββ ...
βββ data_scripts
βββ windows
βββ unix
βββ result
βββ plots
βββ tables
βββ category_samples
βββ DRAW
βββ DRAW_L
βββ ...
βββ run.py
βββ classifier.py
βββ utils.py
βββ ...
Important
The movel_<revision>
folder naming is generated from the HF π repo 1 π revision
value and does NOT
affect the trained model naming, other training parameters do.
Since the length of the dataloader depends not only on the size of the dataset but also on the preset batch size,
and test subset ratio.
You can slightly change the test_size
and / or
the batch
variable value in the config.txt β file to train a differently named model on the same dataset.
Alternatively, adjust the model naming generation in the classifier.py's π training function.
After the fine-tuned model is saved πΎ, you can explicitly call for evaluation of the model to get a table of TOP-N classes for the randomly composed subset (10% in size by default) of the training page folder.
There is an option of setting test_size
to 0.8 and use all the sorted by category pages provided
in [TRAIN]
's folder for evaluation, but do NOT launch it on the whole training data you have actually used up
for the evaluated model training.
To do this in the unchanged configuration β, automatically create a confusion matrix plot π and additionally get raw class probabilities table run:
python3 run.py --eval --raw
OR when you don't remember the specific [SETUP]
and [TRAIN]
variables' values for the trained model, you can use:
python3 run.py --eval -m './model/model_<your_model_number_code>'
Finally, when your model is trained and you are happy with its performance tests, you can uncomment a code line
in the run.py π file for HF π hub model push. This functionality has already been implemented and can be
accessed through the --hf
flag using the values set in the [HF]
section for the token
and repo_name
variables.
In this case, you must rename the trained model folder in respect to the revision
value (dots in the naming are skipped, e.g.
revision v1.9.22
turns to model_v1922
model folder), and only then run repo push.
Caution
Set your own repo_name
to the empty one of yours on HF π hub, then in the Settings of your HF π account
find the Access Tokens section and generate a new token - copy and paste its value to the token
variable. Before committing
those config.txt β file changes via git replace the full token
value with its shortened version for security reasons.
For support write to: [email protected] responsible for this GitHub repository 12 π
Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff π file.
- Developed by UFAL 13 π₯
- Funded by ATRIUM 14 π°
- Shared by ATRIUM 14 & UFAL 13 π
- Model type: fine-tuned ViT with a 224x224 2 π or 384x384 3 4 π resolution size
Β©οΈ 2022 UFAL & ATRIUM
README emoji codes π
- π₯ - your computer
- πͺ§ - label/category/class
- π - page/file
- π - folder/directory
- π - generated diagrams or plots
- π³ - tree of file structure
- β - time-consuming process
- βοΈ - manual action
- π - performance measurement
- π - Hugging Face (HF)
- π§ - contacts
- π - click to see
- βοΈ - configuration/settings
- π - link to the internal file
- π - link to the external website
Content specific emoji codes π
- π - table content
- π - drawings/paintings/diagrams
- π - photos
- βοΈ - handwritten content
- π - text content
- π° - mixed types of text content, maybe with graphics
Decorative emojis π
- πππ§βΆπͺπͺοΈπ¦ππππ₯π¬π€ - decorative purpose only
Tip
Alternative version of this README file is available in README.html π webpage
Footnotes
-
https://huggingface.co/ufal/vit-historical-page β© β©2 β©3 β©4 β©5 β©6 β©7
-
https://huggingface.co/google/vit-large-patch16-384 β© β©2
-
https://imagemagick.org/script/download.php#windows β© β©2
-
https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments β©