HDSC Stage F OSP: Food Delivery Time Prediction

The world today has been massively transformed by technology, as the years go by daily activities are becoming more and more automated. The food industry is no exception, food delivery has become much more easy and time saving over the years. Getting food at your doorsteps is now easily done by ordering online through mobile and web applications.

As a result of this, predicting the time taken for food deliveries has become very valuable to clients, delivery companies and restaurants.

Problem Statement

The value derived from predicting the time taken for deliveries cannot be disputed. This has an extensive application in mobile and web applications made for ordering food, having great estimates for the delivery time for various cuisines and from various locations could be a key factor in improving the user experience of these applications.

Our objective was to train and test a model that predict the food delivery time for every record in the test set

Dataset Description

The dataset was gotten from a hackathon organized by IMS Proschool. The dataset is a collection of 13868 records of deliveries from 8661 restaurants in india. The training set consists of 11094 records and the test set consists of 2774 records. There are 9 features in total:

Restaurant ID
Location Address
Average Cost
Minimum Order
Delivery Time(Target Variable)

Importing the data and viewing the first 5 Data records


Data Cleaning

The data cleaning consisted of two main processes:

  1. Removed unwanted symbols from the data: Characters such as currency symbols had to be removed from the Average cost and Minimum order columns.
  2. Dealing with non-numeric characters from numerical columns:
    Words such as :
    “for” occurred in the Average cost and minimum order columns.
    ”Opening Soon” and ”Temporarily Closed” occurred in the Rating column.
    hyphens “-” occurred in the votes and reviews column. These were replaced with -999.

Exploratory Data Analysis

The “Location” feature was explored to find out the cities with the highest food delivery count.

We also explored the target column(Delivery Time) to get the most frequent food delivery time.

We explored the “Cuisines” feature to find the most frequent cuisine delivered.

From the data visualization diagrams employed we notice that “noida” is the city with the most food deliveries,”30 minutes” is the most frequent food delivery duration and “Fast food” is the most frequent food delivery ordered.

Feature Engineering

The following new features were added:
Minimum_Order_Zero: a column made of 0s and 1s. ‘1’ denoting the minimum order for that sample was 0 and 0 denoting otherwise.
Minimum_Order_to_Cost: a column made by dividing the value of minimum order by the values of the Average cost for each sample.
Reviews by Votes: a column made by dividing the value of reviews by the values of the votes for each sample.
Num_of_Restaurants_City: a column with the values of the number of restaurants for each sample.
Restaurants_Branch_Count: a column with the values of the number of branches for the restaurants in each sample.

We also trimmed down the number of locations food was delivered, because a lot of locations were from the same cities but from different parts of the city. In such cases we classified each part as the city in which they were located.
A lot of research on Indian cities had to be carried out to make sure the appropriate city location, Cuisine categories and other details were input.


In developing our model, we employed the use of two machine learning methods, the random forest and the lightgbm methods

We carried out one hot encoding on some features particularly the “cuisines” feature and label encoding on some other.

As a result of this, we totalled 11274 columns, this was then converted to a sparse matrix which was used in training our model.

After fitting and training our model, we made predictions on our test data and found the light GBM model gave the highest accuracy


Our data engineer successfully built and tested a pipeline for our model using kube flow.


In building the model, data cleaning and feature engineering were the most crucial aspect of building a good model. We experimented on several features to get the best accuracy.

Data Scientist| Web Developer