Skip to content

Heyyassinesedjari/Twitter-Community-Detection

Repository files navigation

Unveiling the 🐦 Twittersphere (Now 𝕏): Community Detection Analysis

This project aims to unite similar Twitter communities by identifying shared interests through Unsupervised Learning Techniques on Graph and Tabular Data.

This project is a segment of my Unsupervised Learning and Social Network Analysis (UL & SNA) course, under the guidance of Professor M. Lazaar at ENSIAS, Mohammed V University.

In choosing a project for this course, I opted to concentrate on clustering communities within Twitter. Coming from a traditional machine learning background involving tabular data, I was particularly intrigued by the challenge of handling graph data and constructing machine learning models that could uncover patterns without human guidance. While Facebook and Google+ were available data sources, Twitter stood out due to its simplicity and engaging nature.

The entirety of this project comprises sample code demonstrating the following procedures:

  • Identification of Twitter communities using the Stanford Network Analysis Project (SNAP) Twitter graph data, employing two distinct methods: Edge-based and Feature-based approaches.
  • Generation of a visual representation and preprocessing of data by creating a graph and computing the adjacency matrix through networkx, scipy, and matplotlib.
  • Edge-based approach:
    • Execution of training for the Spectral Clustering model over the adjacency matrix followed by its evaluation using Silhouette score via Scikit-Learn.
  • Feature-based approach:
    • Construction of a tabular format from the graph data, enhancing it with critical graph centrality metrics, including degree, closeness, and betweenness centrality.
    • Execution of training for various clustering algorithms—KMEANS, SpectralClustering, and AgglomerativeClustering—followed by their evaluation using Silhouette scores.
  • Assignment of labels to clusters (produced by the best performing approach) by identifying the most commonly used hashtags among cluster members. These hashtags are then employed to encapsulate key themes, such as 'Social Media Cluster,' 'Gaming Cluster,' and 'Music Cluster,' portraying the prevalent interests within each cluster.

Visual Project Walkthrough

Dataset Statistics

Files Hierarchy

Graph Vizualisation

Feature Extraction using Edge Based Approach

Feature Extraction using Feature Based Approach

Final Dataframe using Feature Based Approach

Experimental Results

Hashtag Distribution Across Clusters Generated by the Optimal Method (e.g., Feature-Based Approach with KMEANS):

Music Community

Social Media Community

Gaming Community

For a more comprehensive explanation, please consult the project report, review the code, and refer to the presentation.

About

This project aims to unite similar Twitter communities by identifying shared interests through Unsupervised Learning Techniques on Graph and Tabular Data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published