Skip to content

tillwf/spatial-group-detection

Repository files navigation

Installation

  • Setup environment
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
  • Download the data
mkdir data
mkdir images

Download the file into the data folder

Then run the program :

python run.py

The output will be the number of cluster found at each second, and the event (json format) at every change.

Example

{
    "type": "decrease",
    "date": 2,
    "centroid": (1.0, 1.0),
    "people_out": set([17]),
    "previous_cluster_id": 15,
    "cluster_id": 16,
    "population": set([16, 15]),
    "density": 0.0
}

If you change the value of the variable PLOT in run.py, you will have the plot of each step (with colored clusters) in the folder images.

To run the tests :

nosetests --rednose --force-color tests/
  • IPython Notebook

You can browse the file exploration.ipynb which explain the choice of some variables. (PS: The first graph doesn't seem to work)

Main goal

The main goal of this project is to detect groups creation/modification/deletion on temporal and geo-spatial data.

Problematics

The Accuracy of the data

Each position comes with an accuracy which can be very low (high value in meters of imprecision). We then face multiple problems :

  • Distance metric
  • Shape computation
  • Centroid localization

The sampling of the data

The series of points for a user is not regularly sampled and like the accuracy there can be long moment (many hours) between two points. We have to :

  • Handle time series with big gaps (cut in multiple parts)
  • Predict the short term position when the gap is reasonable (we can use the information of the past trajectory)

The clustering

The main problem is to group people without knowing the number of group. We can only use the repartition of the points (and its evolutions) to extract this information.

Approach

Sampling and accuracy

  • Discard the outliers

    • Using the exploration.ipynb, we chose 60 meters as a minimum viable accuracy
    • We also removed series with elapsed time > 1 minute
  • Stationary interpolation of the position if the last position was less than 1 minute before

With this interpolation, we can re-use the series we removed earlier : if no point was detected for 1 minute, the algorithm will not consider it.

Clustering

Unsupervised algorithm to cluster people based on their closeness :

  • We do not want to put everyone in a cluster
  • The cluster must be very dense
  • The number k of clusters can change every second

We used the DBSCAN algorithm for multiple reasons and mainly for its weaknesses :

  • Handle only regularly dense clusters
  • Comes with the parameter epsilon which defines the maximum distance between two samples
  • Is less effective with high dimensional problems

See A Review: Comparative Study of Various Clustering Techniques in Data Mining (short and precise)

Distance metric :

We chose a distance metric that could be easily understandable, to be able to choose the best epsilon for DBSCAN.

  • First we used the Haversine formula between two points
  • Then if the accuracy was very low, we weighted it by using the max and min distances possible between points

Results

Parameters

The only parameters would be the accuracy filter and the epsilon of DBSCAN. The latter is easy to choose when there are no accuracy problems : because we used the Haversine distance, we can reasonably select a minimum distance between two people in the same groups (for instance 4 meters). With the accuracy, we chose 50 meters empirically.

Moreover the sparse positions in the dataset allow us to be less strict on the minimum distance between dots.

Screenshot

Here is an example group evolutions :

Group Evolution

There is also a video of the first evolutions :

Video of the 15241 first seconds

To be done (short term)

  • Validate group modification only if a certain amount of time has passed (sample not just passing through)
  • Enhance accuracy handling in distance metric
  • Better graphical output
  • Bug fixing / Refacto

Future work

Trajectories

  • The density of the accuracy circle could be higher in the point 'global' direction. It would change the distance computation. We could use the Ramer-Douglas-Peucker algorithm to extract the general path of the point.
  • Could be use have a better interpolation.

Contextual informations

Graphical

  • Add circle of the size of the accuracy with a corresponding alpha
  • Add the map of Paris

Handle mobility

Bibliography / Sources / To read

About

Based on temporal / spatial data, identify groups

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published