This project aims to develop machine learning models for Arabic content using advanced techniques such as LSTM, BERT, and BiLSTM. The primary objective is to explore and implement state-of-the-art methods for text classification and analysis in the Arabic language domain.
The dataset used in this project was collected through web scraping tasks from various student repositories on GitHub. The aim was to compile a comprehensive dataset for further analysis and processing. The data collection process involved gathering information from multiple sources, merging them, and performing necessary preprocessing steps to ensure data consistency and quality.
The data was primarily sourced from student repositories on GitHub. Each repository contained articles or news articles written in Arabic, covering a variety of topics.
The dataset comprises the following features:
- Title: The title of the article.
- Label: The label assigned to the article, indicating its category (real | fake).Labels may vary in language, punctuation, and synonyms.
- Topic: The topic of the article, which may also vary in language, punctuation, and synonyms for easier merging.
- Origine: The origin of the article, likely referring to the GitHub repository from which it was sourced.
- Article_date: The date of publication of the article(if availble).
- Article_content: The main content of the article(if availble).
- Article_correction: Any corrections of the article content if the article was fake(if availble).
Before using the dataset, several preprocessing steps were applied to ensure consistency and ease of usage. This included translating topic names and labels, as well as standardizing punctuation and synonyms across different instances of the dataset. Additionally, some instances required dropping them to make the final data reliable.
The data preprocessing step involves cleaning and preparing the dataset for training. This includes handling missing values, filtering out short texts, and balancing the class distribution if necessary.
The Arabic BERT tokenizer is used to tokenize the text data, and the dataset is prepared for training and testing. The texts are tokenized, padded, and converted into PyTorch tensors for input into the BERT model.
The pre-trained Arabic BERT model for sequence classification is loaded, and an optimizer (e.g., AdamW) is set up for training. Additionally, a learning rate scheduler is configured for adjusting the learning rate during training.
The model is trained using the training dataset for a specified number of epochs. During each epoch, the model parameters are updated using backpropagation, and the loss is calculated. Training progress, including loss and accuracy, is monitored and printed.
After training, the model is evaluated using the test dataset. Evaluation metrics such as accuracy, precision, recall, and F1-score are calculated to assess the model's performance in classifying fake and real news articles.
A detailed classification report is generated, including precision, recall, F1-score, and support for each class. Additionally, a confusion matrix is computed to visualize the model's performance in classifying instances.
Once trained, the BERT or ARABERT model is saved for future use and deployment. The model is serialized using joblib and saved to a specified file path.
The data preprocessing step involves several tasks such as reading data frames, cleaning text by removing punctuations, links, emojis, HTML tags, and irregular patterns. Additionally, Arabic stop words are removed to improve text quality.
The initial step includes setting up hyperparameters required for the model. These parameters include the maximum vocabulary size, tokenizer, and maximum sequence length.
The BILSTM (Bidirectional Long Short-Term Memory) model architecture is defined here. The model includes embedding layers, spatial dropout, multiple bidirectional LSTM layers, and dense layers. The purpose of each layer is to process the input text data effectively for classification.
The model is compiled using the sparse categorical cross-entropy loss function and the Adam optimizer. The model is also configured to track accuracy as a metric during training.
The BILSTM model is trained on the dataset with plots generated at the end of each epoch to visualize the training and validation loss. This provides insights into the model's performance and helps in monitoring convergence.
The trained model is evaluated on the test dataset to assess its performance. Accuracy score and other classification metrics such as precision, recall, and F1-score are calculated to measure the model's effectiveness in classifying fake and real news articles.
A detailed classification report is provided along with a confusion matrix to further analyze the model's performance. The confusion matrix visually represents the model's ability to correctly classify instances and identify any misclassifications.
Once trained, the BILSTM model is saved in .h5 format for future use and deployment in applications requiring fake news detection.
Finally, plots showing training and validation accuracy are generated to visualize the model's learning process over epochs.