Lyft Demand Surge Predictor

Andrew Lee
10 min readApr 28, 2021

Some questions you will be able to answer from this article:

  • Does weather affect the demand of my Lyft ride?
  • What type of Lyft ride should I choose to avoid surge pricing?
  • How does pickup location affect the demand of my ride?
  • Which day of the week has the highest demand?
  • What hour of each day has the highest demand?

Introduction

The datasets I have chosen to analyze in conjunction are available on Kaggle here. The GitHub repository holding all my code can be found here.

The first dataset contains data on Boston’s Lyft rides from November to December 2018; the second contains data on Boston’s weather within that time period. I want to be able to analyze the relationship that weather has on the demand of Lyft rides. I think predicting surges in demand for rides can be very useful for Lyft in creating new features/products to help connect drivers and riders more efficiently.

I chose these datasets because I find it fascinating to find patterns in real world data at scale that can bring insight into human behavior. The cab rides dataset contains ~700k observations and 10 features, and the weather dataset contains ~6k observations and 8 features.

I will take you through the Machine Learning Workflow, as I did through the project.

ML Workflow

  1. Wrangle Data
  2. Split Data
  3. Establish Baseline
  4. Build Model
  5. Check Metrics
  6. Tune Model
  7. Communicate Results

Wrangle Data

Wrangling the data consumed ~80% of the total project time and was not an easy task. Below are the first five observations of each dataset before cleaning.

Raw Lyft dataset

Contains 693071 observations & 10 features

Raw Weather dataset

Contains 6276 observations & 8 features

Feature Engineering

In order to conduct a cleaner analysis, I created new features from the existing features. Some of these new features include ‘Year’, ‘Month’, ‘Day’, and ‘Hour’ columns engineered from a datetime object. This was extremely helpful in merging the DataFrames and ordering the data in a time series.

My original target variable was ‘surge_multiplier’, which was a categorical feature spanning from 1.0 to 3.0 in increments of 0.25 excluding 2.25 and 2.75. This target variable represented degrees of demand surge (1.25, 1.5, … 3.0), or lack of demand surge (1.0).

I decided to engineer a new binary feature called ‘demand_surge’ and set it as my new target variable. I assigned ‘demand_surge’ to 0 if ‘surge_multiplier’ equaled 1.0 (Had no demand surge), and assigned ‘demand_surge’ to 1 otherwise (Had demand surge).

Time Conversion

I had the opportunity to write functions to handle the conversion of Epoch (Unix) timestamps in seconds (Weather DataFrame) and milliseconds (Lyft DataFrame) into Datetime objects. This helped me grasp an understanding of the timestamps in the data.

I also figured out how to convert the Datetime object that’s in Coordinated Universal Time (UTC) into Boston’s Eastern time zone.

Drop columns

I had the pleasure of dropping many columns for a cleaner analysis. The columns I dropped were features with high-cardinality, redundancy, or those prone to data leakage (‘price’, ‘surge_multiplier’, etc.)

Fill NaN Values

The ‘Rain’ column in the weather dataset was the only column containing null values in either DataFrame. After careful inspection of the ‘Rain’ column’s unique values, I concluded that null values mean 0 inches of rain and therefore replaced these null values with 0.

Merging DataFrames

In order to analyze the demand surge of Lyft rides with weather, I had the privilege of merging the Lyft rides dataset and weather dataset. I performed an inner join, ensuring that I only get records that exist in both tables.

I joined on the criteria of time (by year, month, day, and hour) and location. This ensured that the timing and location matched between Lyft rides and weather patterns. For this analysis, I took the liberty of assuming that the weather is consistent throughout a given hour.

Resampling

During the wrangle step, I discovered a severe imbalance in the target variable (demand_surge). The majority class, 0 (No demand surge), made up 96.97% of the total observations. This is problematic because our model will not train well with a skewed distribution of observations in our target variable. In order to combat this issue, I applied a resampling technique called undersampling, which removed random observations from the majority class until the classes were equal in number of observations. The resulting DataFrame contained 76k observations.

Cleaned DataFrame

Wrangled DataFrame with Lyft & Weather data (76804 observations, 17 columns)

Description of features

Split Data

I split the data into Train, Validation, and Test sets that take on a 70–15–15% distribution, respectively. Because I considered this data a time series, I performed the split chronologically.

The target variable will be ‘demand_surge’, which will be referred as ‘demand surge’ or ‘surge in demand’ throughout this article, and will take on values of 0 (No demand surge) or 1 (Experiences demand surge).

Establish Baseline

Because we are dealing with a classification problem, I established the baseline score for my model as the majority class in my target variable (demand_surge). Our baseline score for this model is 0.5009.

Build Model

In order to create the most accurate predictor possible, I tried linear and ensemble methods against my data. The linear method I used was Logistic Regression and the ensemble methods I used were Random Forest Classifier and Gradient Boosting Classifier.

In order to build these models in a scalable way, I implemented a pipeline method that transformed the data. For each model, I used an encoder to process categorical features (OneHotEncoder & OrdinalEncoder) and a scaler (Standard Scaler) for my Logistic Regression model.

Check Metrics

In order to pick the best performing model, I printed the accuracy scores of each model with my training and validation data.

Accuracy Scores for each model on training and validation data.

The Gradient Boosting Classifier performed the best on my validation data with a score of 61.9%, so I decided to move forward with this model.

Tune Model

In order to find the optimal hyperparameters for my Gradient Boosting Classifier, I utilized a RandomizedSearchCV method to iterate through different hyperparameter values for variables n_estimators, criterion, and loss. These were the optimal values:

With these optimal hyperparameters, my Gradient Boosting Classifier performed better than my previous accuracy score on the validation data, with a score of 64.0%.

Communicate Results

Permutation Importance

In order to figure out which features had the biggest influence on my target variable (demand_surge), I used permutation importance to randomize observations for each feature and rank which feature randomization had the biggest net change on model accuracy. Below are my findings:

Partial Dependence Plot (One Variable)

I utilized a Partial Dependence Plot (PDP) to visualize the affect of a feature’s values on Demand Surge. I decided to plot Source and Name individually against Demand Surge because they were the two highest performing features based on the Permutation Importance. Below is the PDP for the ‘Source’ variable.

How ‘Source’ values affect the predicted probability of ‘Demand Surge’

From the graph, you can see that the Source of the ride has an effect on the probability of experiencing a surge in demand for that ride. Fenway and Back Bay have the highest probability, whereas North Station has the lowest probability.

Next, I examined the ‘Name’ variable:

The graph illustrates that the probability of experiencing a surge in demand for a Lyft ride is unaffected by the type of ride unless a ‘Shared’ ride is taken. If one were to call a Lyft ride and choose the ‘Shared’ option, they can experience a decrease in the probability of experiencing a demand surge.

Partial Dependence Plot (Two Variables)

Next, I looked at how Source and Name (together) affect the predicted Demand Surge probability. Below is a graph illustrating this relationship.

The PDP Plot illustrates that the probability of a surge in demand will increase the most if the Source of the ride is between 2 and 6 and the Name of the ride is not 6.

In practical terms, this means that if you call a Lyft ride and get picked up from the Theatre District, Northeastern University, Fenway, Back Bay, or Boston University and the ride is not a ‘Shared’ Lyft ride, you should expect the greatest probability of a surge in demand, therefore higher than normal pricing.

Seaborn Heat Map

I restricted my data to return only observations with a demand surge and created a Seaborn heat map to illustrate which days of the week and which hours of the day that demand surges occur most frequently. Here is the resulting heat map:

Gradient illustrates frequency of observations. White illustrates absence of data.

The heat map illustrates that Wednesday has the most surges in demand of any day of the week. And between the hours of 11 and 17 (11am and 5pm EST) is when surges occur most frequently.

The days of the week with the most surges in demand, in descending order, are Wednesday (32.9% of observations), Monday (18.2%), Tuesday (13.9%), Sunday (12.8%), Friday (7.6%), Thursday (7.4%), Saturday (7.1%).

The top 5 hours in the day with the most surges in demand are 11 (7.5%), 12 (6.7%), 15 (6.1%), 17 (5.9%), 13 (5.8%).

Test Data

I ran my final model on my test data which I created to simulate a real world implementation of the model. My final score was 65.78%, which is a 1.78% improvement from my tuned model performance on my validation set, and a significant improvement from our 50.09% baseline.

Accuracy Score of 65.78%

Conclusion

In conclusion, weather has a negligible effect on demand of Lyft rides in Boston during this time period. The characteristics that mattered most were Source (where a rider is picked up), Name (type of Lyft ride), and Distance (distance of Lyft ride); all other features are negligible in predicting the demand of a Lyft ride and contribute noise to our analysis.

Source

Some pickup locations like Fenway or Back Bay increase the probability of experiencing a demand surge, while a location like North Station has the lowest probability of experiencing a demand surge.

Name

Choosing a ‘Shared’ Lyft ride lowers the probability of experiencing a demand surge, while all other types of Lyft rides (e.g. Lux Black XL, Lyft XL, etc.) don’t have any effect on increasing or decreasing the probability of a demand surge.

Ultimately, lowering the probability of experiencing a demand surge can save you money because Lyft ride prices are based on surge pricing. The higher the demand for a ride, the higher the price to pay to match riders and drivers.

In addition, understanding what causes a surge in demand can help Lyft create new products or features to its app to more efficiently connect riders and drivers.

My Takeaways

This project took me two weeks to complete. I’m most proud of my data wrangling and data visualization because these were the areas I learned the most and feel most fulfilled.

Some tangible skills that I picked up from this project include:

  1. Epoch (Unix) to Datetime conversion and wrangling
  2. Strategic feature engineering and merging to get a cleaned DataFrame that makes sense given our business question
  3. Resampling (under-sampling) to solve imbalanced classification
  4. Visualization techniques for PDP plots and a Seaborn Heatmap
  5. A greater understanding of the command line, virtual environments and importing packages

Opportunities for Improvement

As mentioned earlier, this project took me two weeks to complete and could be improved given more time. Some things I would consider to improve my results would be:

  1. Add regularization techniques — my models overfit my training data, and I think simplifying my model would improve its results drastically. After observing my Permutation Importance results, I would drop all columns that contribute noise to the model.
  2. Standardize data — before training, I would standardize the training data to add uniformity. Rescaling this data would allow my models to train much better on the training set.
  3. More visualizations — it’s critical to communicate your results to stakeholders, and I think I could have made additional visualizations to communicate my results. Additionally, I would have worked on cleaning up my current visualizations and look into making them even simpler so a stakeholder can gain insights from them quickly.

--

--