The Natural Language Processing Course Projects

These projects are related to NLP & Machine Learning for Text in English & Persian languages under the supervision of Dr. Hamidreza Baradaran Kashani at the University of Isfahan.

Text Preprocessing

In this part, we prepare data to train a model.

Steps

for Persian data

Text cleaning & removing emojis.
removing English characters and signs.
text normalization.
Tokenizing with respect to words.
Stopwords removal.
Lemmatization.

for English Data

Text cleaning and removing web addresses, and signs.
Remove numbers and emojis.
Tokenizing with respect to words.
Stopwords removal.
Showing words cloud.

Installation

To Run this project, run these commands:

  pip install numpy

  pip install pandas

  pip install matplotlib

  pip install hazm

  pip install nltk

  pip install wordcloud

Language models

In this part, we implement the elementary language models based on the probability of the tokens.

Steps

Text cleaning & preprocessing.
Implementation of the Unigram Model.
Implementation of the bigram Model.
Implementation of the trigram Model.
Showing the most probable unigram, bigram, and trigram combinations based on data.
Calculating the probability & perplexity of test examples with unigram, bigram, and trigram models.
Text completion with the unigram, bigram, and Trigram models.
POS Tagging on the data.
Counting occurrence of all tokens with respect to pos tagging.
Showing the most observed Nouns.

Installation

To Run this project, run these commands:

  pip install numpy

  pip install pandas

  pip install hazm

RNN, GRU & LSTM

In this part, we implement models for text classification.

Steps

Text cleaning & preprocessing.
Creating vocabulary.
Encoding(encoding any word to its index with respect to vocabulary).
Applying zero padding.
Creating Train set, Validation set, and Test set.
Importing word2vec.
RNN Training and applying it on the test set.
GRU Training and applying it on the test set.
LSTM Training and applying it on the test set.

Tip

To apply word embedding, we use Word2Vec with 300 dimensions.
The embedding layer of all three models is frozen.
We transfer all models into GPU and train them in GPU.so you should have the CUDA.
The criterion of all models is Cross Entropy function.
The optimizer of all models is Adam.

Installation

To Run this project, run these commands:

  pip install numpy

  pip install pandas

  pip install nltk

  pip install torch

  pip install gensim

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
1-preprocessing		1-preprocessing
2-Language model		2-Language model
3-RNN		3-RNN
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Natural Language Processing Course Projects

Table of Contents

Text Preprocessing

Steps

for Persian data

for English Data

Installation

Language models

Steps

Installation

RNN, GRU & LSTM

Steps

Installation

About

Releases

Packages

Languages

pouriaSameti/NLP

Folders and files

Latest commit

History

Repository files navigation

The Natural Language Processing Course Projects

Table of Contents

Text Preprocessing

Steps

for Persian data

for English Data

Installation

Language models

Steps

Installation

RNN, GRU & LSTM

Steps

Installation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages