As individuals passionate about understanding language and social media trends, we initiated a project aimed at identifying cyberbullying using Bengali language data. Cyberbullying, a prevalent problem on social platforms, can be addressed through machine learning techniques. By automating the detection process, we strive to enhance online safety.
Cyberbullying detection involves the use of machine learning techniques to analyze text data and identify instances of abusive or harmful behavior. In this project, We focused on leveraging machine learning algorithms to classify text data from social media posts written in Bengali. By training models on annotated datasets, we aimed to develop accurate classifiers capable of detecting cyberbullying in real-time.
-
NumPy: A fundamental package for numerical computing in Python, essential for handling arrays and mathematical operations.
-
Pandas: A powerful data manipulation library used for data preprocessing, analysis, and manipulation, offering data structures like DataFrames that simplify data handling.
-
scikit-learn (sklearn): A versatile machine learning library providing a wide range of algorithms for classification, regression, clustering, and more, along with tools for model selection and evaluation.
-
Matplotlib: A comprehensive plotting library for creating static, animated, and interactive visualizations in Python, essential for data visualization and result analysis.
-
Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
-
NLTK (Natural Language Toolkit): A leading platform for building Python programs to work with human language data, used for text preprocessing tasks such as tokenization, stemming, and stopwords removal.
-
Regular Expressions (re): A module in Python providing support for regular expressions, used for pattern-matching and text manipulation tasks.
-
TfidfVectorizer: Part of scikit-learn, TfidfVectorizer is used to convert text data into numerical features based on term frequency-inverse document frequency (TF-IDF) for machine learning models.
-
PorterStemmer: A stemming algorithm used for reducing words to their base or root form, helping to normalize text data.
-
Stopwords Corpus: A collection of common words like "the," "is," and "and" that are filtered out during text preprocessing as they typically don't carry significant meaning in natural language processing tasks.
-
GitHub: Used for version control and collaboration, providing a centralized repository for project files and code.
-
Jupyter Notebook: Employed for interactive development and experimentation with data and models.
- Data Collection: Gather a sufficient amount of Bengali text data from social media platforms, ensuring it is annotated with labels indicating instances of cyberbullying.
- Preprocessing: Preprocess the text data by removing punctuation, stopwords, digits, and applying stemming to reduce the dimensionality of the feature space.
- Model Development: Train machine learning models such as logistic regression, decision tree, random forest, and XGBoost on the preprocessed text data to classify instances of cyberbullying.
- Evaluation: Evaluate the performance of each model using cross-validation and metrics such as accuracy, precision, recall, and F1-score.
By following these steps and leveraging the mentioned technologies, we aimed to develop an effective cyberbullying detection system capable of enhancing online safety on social media platforms for Bengali-speaking users.
Below are the references that we consulted to study and gain insights into implementing our project: [1] Cyberbullying. (n.d.). Link
[2] Cyberbullying – Law and Legal Definitions US Legal. Link
[3] An Educator’s Guide to Cyberbullying Brown Senate.gov. Link
[4] E. J. K. S. T. P. R. A. Eri Eli lavindi, ”Cyber-bullying Detection based on Machine Learning Method(CaseStudy: Instagram Comment Section),”Journal of Applied Information and Communication Technologies, vol.8, pp. 223-226, 2023. Link
[5] N. B. a. M. H. H. Rahman, ”Toxicity Detection on Bengali Social Media Comments using Supervised Models,” in 2019 2nd International Conference on Innovation in Engineering and Technology (ICIET), Dhaka, Bangladesh, 23-24 December 2019. Link
[6] M. R. M. H. N. a. S. M. M. H. C. S. Ahammed, ”Implementation of Machine Learning to Detect Hate Speech in Bangla Language,” 8th International Conference System Modeling and Advancement in Research Trends (SMART), India,doi:10.1109/SMART46866.2019.9117214, pp. 317- Link
[7] K. A. U. a. M. M. A. P. A. Akhter, ”Cyber Bullying Detection and Classification using Multinomial Naive Bayes and Fuzzy Logic,” Int. J. Math. Sci. Comput, vol. 5, no. doi: 10.5815/ijmsc.2019.04.01, p. 1–12, 2019. Link
[8] S. B. C. a. A. H. R. R. Dalvi, ”Detecting A Twitter Cyberbullying Using Machine Learning,” 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India no. doi: 10.1109/ICICCS48265.2020.pp. 297-301, 2020. Link
Contributions are welcome! If you have any suggestions, bug reports, or feature requests, feel free to open an issue or submit a pull request on GitHub.
Team name: High Five
Memebers
Tanzila Akhter (343)
Nurun Nahar Fiha (361)
Md. Parvej Hoque Palash (378)
Sakib Mollah (387)
Serajum Monira (2142)