These projects are related to NLP & Machine Learning for Text in English & Persian languages under the supervision of Dr. Hamidreza Baradaran Kashani at the University of Isfahan.
Text Preprocessing
Language models
RNN, GRU & LSTM
In this part, we prepare data to train a model.
- Text cleaning & removing emojis.
- removing English characters and signs.
- text normalization.
- Tokenizing with respect to words.
- Stopwords removal.
- Lemmatization.
- Text cleaning and removing web addresses, and signs.
- Remove numbers and emojis.
- Tokenizing with respect to words.
- Stopwords removal.
- Showing words cloud.
To Run this project, run these commands:
pip install numpy
pip install pandas
pip install matplotlib
pip install hazm
pip install nltk
pip install wordcloud
In this part, we implement the elementary language models based on the probability of the tokens.
- Text cleaning & preprocessing.
- Implementation of the Unigram Model.
- Implementation of the bigram Model.
- Implementation of the trigram Model.
- Showing the most probable unigram, bigram, and trigram combinations based on data.
- Calculating the probability & perplexity of test examples with unigram, bigram, and trigram models.
- Text completion with the unigram, bigram, and Trigram models.
- POS Tagging on the data.
- Counting occurrence of all tokens with respect to pos tagging.
- Showing the most observed Nouns.
To Run this project, run these commands:
pip install numpy
pip install pandas
pip install hazm
In this part, we implement models for text classification.
- Text cleaning & preprocessing.
- Creating vocabulary.
- Encoding(encoding any word to its index with respect to vocabulary).
- Applying zero padding.
- Creating Train set, Validation set, and Test set.
- Importing word2vec.
- RNN Training and applying it on the test set.
- GRU Training and applying it on the test set.
- LSTM Training and applying it on the test set.
Tip
To apply word embedding, we use Word2Vec with 300 dimensions.
The embedding layer of all three models is frozen.
We transfer all models into GPU and train them in GPU.so you should have the CUDA.
The criterion of all models is Cross Entropy function.
The optimizer of all models is Adam.
To Run this project, run these commands:
pip install numpy
pip install pandas
pip install nltk
pip install torch
pip install gensim