Language-Identification-Tun : a project to predict the langage of text (arabic/english/french/tunizi/code-switching)

General Introduction : What is language identification in NLP?

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Project Workflow

We will be using the following workflow:

1. Data collection

a. Scrape Comments from Youtube

This task is done in the python environment of my desktop with selenium and webdriver.

b. Collecting Public Dataset

We have 2 different datasets :

TUNIZI: (dataset in Tunisian Arabizi) https://github.com/chaymafourati/TUNIZI-Sentiment-Analysis-Tunisian-Arabizi-Dataset
TSAC: (mix of Arabic, French, and Arabizi) https://github.com/fbougares/TSAC

c. Annotation

this task have completed in the class

Attribute to each sentence, a label, from these 5 classes:

Arabic: all the letters/words are in Arabic
French: all the letters/words are in French
English: all the letters/words are in English
Tunizi: words are written in the Tunisian Arabizi (Latin chars with numerics)
Code-switching: the sentence is a mix between more than 2 languages.

2. Data preparation

we will merge all data file in one data frame and insure the type of each column
The final data will contain just 2 columns : "text" and "label"
The "text" column shoold be string, the "label" column shoold be integer

3. Data cleaning

In this step, we will clean text in our data file

Delete duplicate rows and Nan values in labels column.
Change the type of data (text column must be string and label colmn must be integer
Clean text data from : URL, emojis, punctuation (?,:!..) , symbols, newlines and Tabs. : Example : To know more about this website: https://Hamza.example.com
Remove Accented Characters. : é, à, ...
Reduce repeated characters. : eyyyyyy (mean "yes") ==> ey
Remove Whitespaces : "How are you doing ?" Case Conversion : str.lower()

4. Data visualization

With writing some python code, we can see some beautiful informations about our data

The data is not balanced: a major difference between the English/French languages and tunizi in terms of the number of comments and the number of unique words.
There is a difference between code-switching/tunizi and other classes: the largest number of unique words and max/mean sentence is important.
The data must be balanced in terms of number of word and unique word (vocab) in all the text
To get closed to the number of comments that we shoold add to our data in each class, we will use a new feature alpha of each class
alpha = (Num of unique words / mean_all_comments)
Num of words : number of unique words in each class
mean_all_comments : the mean length of all comments

We need a data augmentation action in English/French/Arabic language (+ ~2000 comments for the arabic/ ~3500 comments for english and frensh), and Code-switching class (+ ~2000 comments)

5. Data augmentation

We are thinking about balancing our data distribution by adding more labled-data in the Code-switching/English/French class There is many methods :

Using public datasets
Generate text using OpenAi or any others tools
Scrape more data from social media/ blog sites/ journal or magazine
Back translation/ Synonym Replacement/ Random Insertion/ Random Swap / Random Deletion/ Shuffle Sentences Transform using NLPAug Library (https://neptune.ai/blog/data-augmentation-nlp)

Our French/English data does not contain the suffisant amount of text to apply the 4th technique
Scraping text need more time for annotate data and check it

The best solution is to use public data set for English/French/arabic text and using text generation for Code-switching text

*English/French/Arabic language (+ ~2000 comments for the arabic/ ~3500 comments for english and frensh), and Code-switching class (+ ~2000 comments)

a. Collecting public Dataset : from Huggingface 🙂

link : https://huggingface.co/datasets/papluca/language-identification/blob/main/train.csv

The data distribution

b. Generating data : Code-switching text: We will discuss many tools ⚡

OpenAi (https://openai.com/api/)
Using a custom LSTM model to generate text : the training data will be our data from Arabic/French/English text (https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/)

We will try openai api

6. Data validation

we will validate our dataset before modeling.

Deep data cleaning :

a. Clean text : URL, emojis, punctuation (?,:!..) , symbols, newlines and Tabs ... ✊

b. Clean langages : validate language letters and convert numeric patterns to letters 🛑

french langage : Ã© = é, Ã§ = ç, Ã = à, Å“ = oe, Ãª = Ã¨ = è, others = e
Tunizi/Code-switching : https://github.com/faressoltani55/TunisianDialectSentimentAnalysis/blob/main/Tunisian_Dialect_Sentiment_Analysis_Model.ipynb

c. Stop words : removing or keeping ❎

What are stop words? 🤔 The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text
As we mentioned, the stopwords are a bags of words in a language, an intersted bag of words!
So we can't remove them if we will predict the langage of text
let's take an example: text = "Hamza is a clever person, mais he is stupid!"
this text is a C-S text : english and french, if we remove the stopwords ( 'mais' is a french stopword) the text will be english langage!

Data visualisation 🎨

7. Data modeling : Built the classifier

we will apply the AI stuff to our data to predict the langage of the text.

Our task is a text multilabel classification, there is many methods :

the old-fashioned Bag-of-Words (with Tf-Idf or countvector)
the cutting edge Language models (with BERT).

A. Using the old-fashioned Bag-of-Words (with Tf-Idf or countvector) 🧯

The text feature extraction methods will be tf-idf and Count-Vectorization
The classifiers will be Stochastic Gradient Descent and naive_bayes

a. Count vectorizer

i) Naive bayes

The f1-score of the 4th class(code-switching) is too low : 0.61, the rest of the prediction is super perfect , so our problem in the code-switching label! let's see the others combination

ii) Stochastic Gradient Descent

a. tf-idf

i) Naive bayes

ii) Stochastic Gradient Descent

Interpretation :

Bag of words give us a macro averaged F-score equal 0.89, but a bad results in the 4th label (recall = 0.77 for this class)

Best combination : tf-idf + Stochastic Gradient Descent

B. Using the cutting edge Language models (with BERT) 🙂

In order to complete a text classification task, you can use BERT in 3 different ways:

train it all from scratches and use it as classifier.
Extract the word embeddings and use them in an embedding layer (like I did with Word2Vec).
Fine-tuning the pre-trained model (transfer learning).

We are going with the latter and do transfer learning from a pre-trained lighter version of BERT, called Distil-BERT (66 million of parameters instead of 110 million!).

Then! we will going to build the deep learning model with transfer learning from the pre-trained BERT. Basically, we will going to summarize the output of BERT into one vector with Average Pooling and then add two final Dense layers to predict the probability of each langages.

Interpretation :

Bert give us a macro averaged F-score equal 0.91, also a better result in the 4th class classification (recall = 0.86 for this class)

8. Conclusion

Bert reached an accuracy of 0.93 and macro averaged F-score equal to 0.91.

We can do more fixing tasks like using the TunBert (a pre-trained bert model with Tunizi langage) to get more accurate results espacily between the CS and the Tunizi langages.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Data augmentation		Data augmentation
General Data		General Data
data collection		data collection
Data_Augmentation.ipynb		Data_Augmentation.ipynb
Data_Preparation.ipynb		Data_Preparation.ipynb
Data_cleaning.ipynb		Data_cleaning.ipynb
Data_collection.ipynb		Data_collection.ipynb
Data_modeling.ipynb		Data_modeling.ipynb
Data_validation.ipynb		Data_validation.ipynb
Data_visualization.ipynb		Data_visualization.ipynb
README.md		README.md
cleaned_data.csv		cleaned_data.csv
final_data.csv		final_data.csv
prepared_data.csv		prepared_data.csv
to_model_data.csv		to_model_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-Identification-Tun : a project to predict the langage of text (arabic/english/french/tunizi/code-switching)

General Introduction : What is language identification in NLP?

Project Workflow

1. Data collection

a. Scrape Comments from Youtube

b. Collecting Public Dataset

c. Annotation

2. Data preparation

3. Data cleaning

4. Data visualization

5. Data augmentation

a. Collecting public Dataset : from Huggingface 🙂

b. Generating data : Code-switching text: We will discuss many tools ⚡

6. Data validation

7. Data modeling : Built the classifier

A. Using the old-fashioned Bag-of-Words (with Tf-Idf or countvector) 🧯

a. Count vectorizer

i) Naive bayes

ii) Stochastic Gradient Descent

a. tf-idf

i) Naive bayes

ii) Stochastic Gradient Descent

Interpretation :

B. Using the cutting edge Language models (with BERT) 🙂

Interpretation :

8. Conclusion

About

Releases

Packages

Languages

Hamza-t/Language-Identification-Tun

Folders and files

Latest commit

History

Repository files navigation

Language-Identification-Tun : a project to predict the langage of text (arabic/english/french/tunizi/code-switching)

General Introduction : What is language identification in NLP?

Project Workflow

1. Data collection

a. Scrape Comments from Youtube

b. Collecting Public Dataset

c. Annotation

2. Data preparation

3. Data cleaning

4. Data visualization

5. Data augmentation

a. Collecting public Dataset : from Huggingface 🙂

b. Generating data : Code-switching text: We will discuss many tools ⚡

6. Data validation

7. Data modeling : Built the classifier

A. Using the old-fashioned Bag-of-Words (with Tf-Idf or countvector) 🧯

a. Count vectorizer

i) Naive bayes

ii) Stochastic Gradient Descent

a. tf-idf

i) Naive bayes

ii) Stochastic Gradient Descent

Interpretation :

B. Using the cutting edge Language models (with BERT) 🙂

Interpretation :

8. Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages