GitHub - tillwf/spatial-group-detection: Based on temporal / spatial data, identify groups

Installation

Setup environment

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Download the data

mkdir data
mkdir images

Download the file into the data folder

Then run the program :

python run.py

The output will be the number of cluster found at each second, and the event (json format) at every change.

Example

{
    "type": "decrease",
    "date": 2,
    "centroid": (1.0, 1.0),
    "people_out": set([17]),
    "previous_cluster_id": 15,
    "cluster_id": 16,
    "population": set([16, 15]),
    "density": 0.0
}

If you change the value of the variable PLOT in run.py, you will have the plot of each step (with colored clusters) in the folder images.

To run the tests :

nosetests --rednose --force-color tests/

IPython Notebook

You can browse the file exploration.ipynb which explain the choice of some variables. (PS: The first graph doesn't seem to work)

Main goal

The main goal of this project is to detect groups creation/modification/deletion on temporal and geo-spatial data.

Problematics

The Accuracy of the data

Each position comes with an accuracy which can be very low (high value in meters of imprecision). We then face multiple problems :

Distance metric
Shape computation
Centroid localization

The sampling of the data

The series of points for a user is not regularly sampled and like the accuracy there can be long moment (many hours) between two points. We have to :

Handle time series with big gaps (cut in multiple parts)
Predict the short term position when the gap is reasonable (we can use the information of the past trajectory)

The clustering

The main problem is to group people without knowing the number of group. We can only use the repartition of the points (and its evolutions) to extract this information.

Approach

Sampling and accuracy

Discard the outliers
- Using the exploration.ipynb, we chose 60 meters as a minimum viable accuracy
- We also removed series with elapsed time > 1 minute
Stationary interpolation of the position if the last position was less than 1 minute before

With this interpolation, we can re-use the series we removed earlier : if no point was detected for 1 minute, the algorithm will not consider it.

Clustering

Unsupervised algorithm to cluster people based on their closeness :

We do not want to put everyone in a cluster
The cluster must be very dense
The number k of clusters can change every second

We used the DBSCAN algorithm for multiple reasons and mainly for its weaknesses :

Handle only regularly dense clusters
Comes with the parameter epsilon which defines the maximum distance between two samples
Is less effective with high dimensional problems

See A Review: Comparative Study of Various Clustering Techniques in Data Mining (short and precise)

Distance metric :

We chose a distance metric that could be easily understandable, to be able to choose the best epsilon for DBSCAN.

First we used the Haversine formula between two points
Then if the accuracy was very low, we weighted it by using the max and min distances possible between points

Results

Parameters

The only parameters would be the accuracy filter and the epsilon of DBSCAN. The latter is easy to choose when there are no accuracy problems : because we used the Haversine distance, we can reasonably select a minimum distance between two people in the same groups (for instance 4 meters). With the accuracy, we chose 50 meters empirically.

Moreover the sparse positions in the dataset allow us to be less strict on the minimum distance between dots.

Screenshot

Here is an example group evolutions :

There is also a video of the first evolutions :

To be done (short term)

Validate group modification only if a certain amount of time has passed (sample not just passing through)
Enhance accuracy handling in distance metric
Better graphical output
Bug fixing / Refacto

Future work

Trajectories

The density of the accuracy circle could be higher in the point 'global' direction. It would change the distance computation. We could use the Ramer-Douglas-Peucker algorithm to extract the general path of the point.
Could be use have a better interpolation.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
tests		tests
.gitignore		.gitignore
README.md		README.md
clustering.py		clustering.py
exploration.ipynb		exploration.ipynb
loader.py		loader.py
metrics.py		metrics.py
plot.py		plot.py
requirements.txt		requirements.txt
run.py		run.py
stats.py		stats.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Main goal

Problematics

The Accuracy of the data

The sampling of the data

The clustering

Approach

Sampling and accuracy

Clustering

Results

Parameters

Screenshot

To be done (short term)

Future work

Trajectories

Contextual informations

Graphical

Handle mobility

Bibliography / Sources / To read

About

Releases

Packages

Languages

tillwf/spatial-group-detection

Folders and files

Latest commit

History

Repository files navigation

Installation

Main goal

Problematics

The Accuracy of the data

The sampling of the data

The clustering

Approach

Sampling and accuracy

Clustering

Results

Parameters

Screenshot

To be done (short term)

Future work

Trajectories

Contextual informations

Graphical

Handle mobility

Bibliography / Sources / To read

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages