Textual Entailment Recognition in Multilingual Text using Transfer Learning and Data Augmentation

Kaggle has initiated a competition, Contradictory, My Dear Watson, to challenge machine learning practitioners to build a system that automatically classifies how pairs of sentences are related from texts in 15 diverse and under-represented languages. The aim of this capstone project is to create a multi-class classification system to detect entailment and contradiction in multi-lingual text using transfer learning and data augmentation.

The final model yields an accuracy of 94% on the test dataset with a top 3% ranking on the leaderboard at the time of the competition.

The final report with model visualizations and validation plots can be accessed here.

For a beginner's tutorial on implementing a baseline model for textual entailment recognition, use this notebook.

Dependencies

The project requires Python 3.6 and the latest version of the following libraries installed:

numpy
pandas
scikit-learn
transformers
allennlp
googletrans
datasets
PyTorch
tensorflow

To train the models, Tensor Processing Units or TPUs with 8 cores were used. TPUs are hardware accelerators specialized in deep learning tasks and are available to use for free in Kaggle. All the implementations were performed in both Tensorflow and Pytorch frameworks with Python programming language.

Data

The data files can be accessed from the Data folder. The following figure displays the data augmentation workflow used to improve model performance across all the languages and language families.

Code

The notebook folder contains all the Jupyter notebook files consisting of the baselines, data augmentation, and fine tuning. The scripts folder contains code to train the models.

Run

Run the python files from scripts directory using the following command and adding the necessary argument values:

python run.py --train-file data/train.csv --test-file data/test.csv --bt-file data/back_translation_all.csv

To open the .ipynb files in your browser and look at the output of the completed cells, use the following command in your terminal after changing the working directory to the project directory textual-entailment-recognition/notebooks:

jupyter notebook <file_name>.ipynb

Results

The performance of the trained models were evaluated across all the output classes, languages, and language families using a held-out subset. You can also check the report which contains an in-depth analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Textual Entailment Recognition in Multilingual Text using Transfer Learning and Data Augmentation

Dependencies

Data

Code

Run

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Textual Entailment Recognition in Multilingual Text using Transfer Learning and Data Augmentation

Dependencies

Data

Code

Run

Results