- Setup environment
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
- Download the data
mkdir data
mkdir images
Download the file into the data
folder
Then run the program :
python run.py
The output will be the number of cluster found at each second, and the event (json format) at every change.
Example
{
"type": "decrease",
"date": 2,
"centroid": (1.0, 1.0),
"people_out": set([17]),
"previous_cluster_id": 15,
"cluster_id": 16,
"population": set([16, 15]),
"density": 0.0
}
If you change the value of the variable PLOT
in run.py
, you will have the plot of each step (with colored clusters) in the folder images
.
To run the tests :
nosetests --rednose --force-color tests/
- IPython Notebook
You can browse the file exploration.ipynb
which explain the choice of some variables. (PS: The first graph doesn't seem to work)
The main goal of this project is to detect groups creation/modification/deletion on temporal and geo-spatial data.
Each position comes with an accuracy which can be very low (high value in meters of imprecision). We then face multiple problems :
- Distance metric
- Shape computation
- Centroid localization
The series of points for a user is not regularly sampled and like the accuracy there can be long moment (many hours) between two points. We have to :
- Handle time series with big gaps (cut in multiple parts)
- Predict the short term position when the gap is reasonable (we can use the information of the past trajectory)
The main problem is to group people without knowing the number of group. We can only use the repartition of the points (and its evolutions) to extract this information.
-
Discard the outliers
- Using the
exploration.ipynb
, we chose 60 meters as a minimum viable accuracy - We also removed series with elapsed time > 1 minute
- Using the
-
Stationary interpolation of the position if the last position was less than 1 minute before
With this interpolation, we can re-use the series we removed earlier : if no point was detected for 1 minute, the algorithm will not consider it.
Unsupervised algorithm to cluster people based on their closeness :
- We do not want to put everyone in a cluster
- The cluster must be very dense
- The number
k
of clusters can change every second
We used the DBSCAN algorithm for multiple reasons and mainly for its weaknesses :
- Handle only regularly dense clusters
- Comes with the parameter
epsilon
which defines the maximum distance between two samples - Is less effective with high dimensional problems
See A Review: Comparative Study of Various Clustering Techniques in Data Mining (short and precise)
Distance metric :
We chose a distance metric that could be easily understandable, to be able to choose the best epsilon for DBSCAN.
- First we used the Haversine formula between two points
- Then if the accuracy was very low, we weighted it by using the max and min distances possible between points
The only parameters would be the accuracy filter and the epsilon of DBSCAN. The latter is easy to choose when there are no accuracy problems : because we used the Haversine distance, we can reasonably select a minimum distance between two people in the same groups (for instance 4 meters). With the accuracy, we chose 50 meters empirically.
Moreover the sparse positions in the dataset allow us to be less strict on the minimum distance between dots.
Here is an example group evolutions :
There is also a video of the first evolutions :
- Validate group modification only if a certain amount of time has passed (sample not just passing through)
- Enhance accuracy handling in distance metric
- Better graphical output
- Bug fixing / Refacto
- The density of the accuracy circle could be higher in the point 'global' direction. It would change the distance computation. We could use the Ramer-Douglas-Peucker algorithm to extract the general path of the point.
- Could be use have a better interpolation.
- Density-based Place Clustering in Geo-Social Networks
- Ad-Hoc Group Formation/Detection for Better Shopping Experience
- Jasmine: A Real-time Local-event Detection System based on Geolocation Information Propagated to Microblogs
- Social Event Detection in Massive Mobile Phone Data Using Probabilistic Location Inference
- Add circle of the size of the accuracy with a corresponding
alpha
- Add the map of Paris