In this example we'll build a model to predict the duration of taxi trips in the city of New-York. To make things realistic, we'll run a simulation where the taxis leave and arrive in the order as given in the dataset. Indeed, we can reproduce a live workload from a historical dataset, therefore producing an environment which is very close to what happens in a production setting.
We'll be predicting the duration of taxi trips, which is a regression task. We can therefore set the flavor to regression
via the CLI:
> chantilly init regression
Before running the simulation, let's start the chantilly
server. For the purpose of this example, we'll assume that chantilly
is being served in a command-line session, while we run the rest of the commands in another session.
> chantilly run
Let's now create a model using river
. Simply run the following snippet in a Python interpreter.
from river import compose
from river import linear_model
from river import preprocessing
def parse(trip):
import datetime as dt
trip['pickup_datetime'] = dt.datetime.fromisoformat(trip['pickup_datetime'])
return trip
def distances(trip):
import math
lat_dist = trip['dropoff_latitude'] - trip['pickup_latitude']
lon_dist = trip['dropoff_longitude'] - trip['pickup_longitude']
return {
'manhattan_distance': abs(lat_dist) + abs(lon_dist),
'euclidean_distance': math.sqrt(lat_dist ** 2 + lon_dist ** 2)
}
def datetime_info(trip):
import calendar
day_no = trip['pickup_datetime'].weekday()
return {
'hour': trip['pickup_datetime'].hour,
**{day: i == day_no for i, day in enumerate(calendar.day_name)}
}
model = compose.FuncTransformer(parse)
model |= compose.FuncTransformer(distances) + compose.FuncTransformer(datetime_info)
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression()
The required modules are imported within each function for serialization reasons.
We can now upload the model to the chantilly
instance with an API call, for example via the requests
library. Again, in this example we're assuming that chantilly
is being ran locally, which means it is accessible at address http://localhost:5000
.
import dill
import requests
requests.post('http://localhost:5000/api/model', data=dill.dumps(model))
Note that we use dill
to serialize the model, and not pickle
which is part of Python's standard library. The reason why is because dill
is able to serialize a whole session, and therefore deals with custom functions and module imports.
We are now all set to run the simulation.
> python simulate.py
This will produce the following output:
#0000000 departs at 2016-01-01 00:00:17
#0000001 departs at 2016-01-01 00:00:53
#0000002 departs at 2016-01-01 00:01:01
#0000003 departs at 2016-01-01 00:01:14
#0000004 departs at 2016-01-01 00:01:20
#0000005 departs at 2016-01-01 00:01:33
#0000006 departs at 2016-01-01 00:01:37
#0000007 departs at 2016-01-01 00:01:47
#0000008 departs at 2016-01-01 00:02:06
#0000009 departs at 2016-01-01 00:02:45
#0000010 departs at 2016-01-01 00:03:02
#0000006 arrives at 2016-01-01 00:03:31 - average error: 0:01:54
#0000011 departs at 2016-01-01 00:03:31
#0000012 departs at 2016-01-01 00:03:35
#0000013 departs at 2016-01-01 00:04:42
#0000014 departs at 2016-01-01 00:04:57
#0000015 departs at 2016-01-01 00:05:07
#0000016 departs at 2016-01-01 00:05:08
#0000017 departs at 2016-01-01 00:05:18
#0000018 departs at 2016-01-01 00:05:35
#0000019 departs at 2016-01-01 00:05:39
#0000003 arrives at 2016-01-01 00:05:54 - average error: 0:03:17
#0000020 departs at 2016-01-01 00:06:04
#0000021 departs at 2016-01-01 00:06:12
#0000022 departs at 2016-01-01 00:06:22
By default the simulate.py
script will take around 6 months to terminate because that's the time span of the dataset. You can speed up the simulation by running python simulate.py SPEED_UP
, where SPEED_UP
is the speed up amount which defaults to 1.