Skip to content

Latest commit

 

History

History
126 lines (87 loc) · 6 KB

README.md

File metadata and controls

126 lines (87 loc) · 6 KB

Predicting Churn for A Ride-Sharing Company

Research Problem

A ride-sharing company (Company X) is interested in predicting rider retention. Using data for rider activity, we developed a model that identifies what factors are best predictors of retention. We also offer suggestions to operationalize insights to help Company X.

Data

We have a mix of rider demographics, rider behavior, ride characteristics, and rider/driver ratings of each other. Data spanned a 7 month period.

Variable Description
city City this user signed up in
phone Primary device for this user
signup_date Date of account registration
last_trip_date Last time user completed a trip
avg_dist Average distance (in miles) per trip taken in first 30 days after signup
avg_rating_by_driver Rider’s average rating over all trips
avg_rating_of_driver Rider’s average rating of their drivers over all trips
surge_pct Percent of trips taken with surge multiplier > 1
avg_surge Average surge multiplier over all of user’s trips
trips_in_first_30_days Number of trips user took in first 30 days after signing up
luxury_car_user TRUE if user took luxury car in first 30 days
weekday_pct Percent of user’s trips occurring during a weekday

Defining Churn

We converted dates into date time objects to calculate the churn outcome variable. Users were identified as having churned if they had not used the ride-share service in the past thirty days:

def convert_dates(df):
    df['last_trip_date'] = df['last_trip_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
    df['signup_date'] = df['signup_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
    current_date = datetime.strptime('2014-07-01', '%Y-%m-%d')
    active_date = current_date - timedelta(days=30)
    y = np.array([0 if last_trip_date > active_date else 1 for last_trip_date in df['last_trip_date']])
    return y

Categorical variables where classes were represented with strings were encoded as numerical classes:

def label_encode(df, encode_list):
    le = preprocessing.LabelEncoder()
    for col in encode_list:
        le.fit(df[col])
        df[col + '_enc'] = le.transform(df[col])
    return df

Exploratory Data Analysis and Feature Engineering

We discovered that some of the predictor variables (e.g., average distance, number of trips in first 30 days) were positively skewed to a rather marked degree. These variables also included zero values so it was not possible to use simple corrections for skew, such as log transform.

Skewed data were normalized using an inverse hyperbolic sine transformation:

def normalize_inv_hyperbol_sine(x):
    x_arr = np.array(df[x])
    df[x+'_normalized'] = np.arcsinh(x_arr)

This worked well to normalize the data.

While examining distributions of the variables, we noticed that the percent of users' trips occurring during a weekday had an interesting distribution, with definite spikes for 0% and 100% and a more normal/Gaussian-looking distribution for the space between 0 and 100:

We decided to create dummy variables to split this variable apart:

  1. All rides on weekdays
  2. All rides on weekends
  3. Mix of weekdays and weekends
def categorize_weekday_pct(df):
    df['all_weekday'] = (df.weekday_pct == 100).astype('int')
    df['all_weekend'] = (df.weekday_pct == 0).astype('int')
    df['mix_weekday_weekend'] = ((df.weekday_pct <100) & (df.weekday_pct > 0)).astype('int')

Classification/Predictive Analytics

Random Forest is a great place to start with a classification problem like this. It's fast, easy to use, and pretty accurate right out of the box. Our Random Forest Classifier produced an F1 Score of 77% on unseen data.

To improve our model fit, we next tried some boosted classification models. While boosted models require more tuning (and therefore take a bit longer to get working than Random Forest), they are usually more accurate than Random Forest.

  1. Gradient boost
  • Using Scikit Learn's GridSearchCV, we first performed a grid search to determine the best model parameters for a GradientBoostingClassifier. The resultant classifier performed well, with an F1 Score of 83% on unseen data.
  1. XGBoost

Results

Coming soon!

Recommendations for Company X

  • Use the best fitting model (above) to obtain predicted probabilities for individuals. Target those with greater than some probability of churning (choose this cutoff by considering profit curve based on confusion matrix).

  • Offer discounts or free rides to at-risk users to try and retain them - no need to target users below a certain probability threshold.

What We Learned

How useful is feature engineering and normalizing skewed data?

Classifiers like random forest and boosted trees are quite robust to skewed and non-normally distributed data. We probably did not need to spend time transforming our data or creating dummy variables for percent of weekday rides.

Contributors

Our team included Micah Shanks (github.com/Jomonsugi), Stuart King (github.com/Stuart-D-King), Jennifer Waller (github.com/jw15), and Ian

Tech Stack