Walmart Sales Forecasting 🍳🍳

Use historical data to predict store sales  💰💰

Every observation starts with a question ? 🤔🤔

What is forecasting ? Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends. In this blog post , we are going to predict the future sales of Walmart stores using traditional machine learning algorithm(s) . 🔊🔊

Let’s look at the problem we are trying to solve 👁👁

You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store. In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

What is the dataset schema ? 📝📝

Let’s discuss about the evaluation metric 🔦🔦

Our predicted sales should be as close as possible to the actual sales in the holidays. This is because, the sales made during holidays is much higher than in other days.

What are the business objectives and constraints ? 💲💲

1.) Stationary may not be well preserved.

2.) The evaluation metric used here is the weighted absolute mean deviation of the predicted value from the ground truth value.

3.) For holidays, the predictions are given more weightage. We need to focus relatively more on the observations which are recorded in the holidays.

4.) If we correctly model the observations, this could help Walmart in deciding upon their future budget, materials, etc. This could increase their revenue substantially.

Some useful links 🔗🔗

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

https://www.tutorialspoint.com/sales_forecasting/sales_forecasting_introduction.htm

https://www.walmart.com/

We need to install some libraries in Python to get started 🧨🧨

# pip is one of the package manager for installing packages in python
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install sklearn
!pip install xgboost
!pip install seaborn

Let’s import the libraries we have downloaded 💻💻

# Importing the necessary libraries..
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from collections import Counter
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns

Let’s download the dataset from Kaggle website. 💾💾

The dataset is provided by the Kaggle website. There are many ways to download the data from Kaggle. I will provide one such method which is easy and less messy.

Steps to download the data from Kaggle :

1.) Login to your Kaggle account

2.) Download CurlWget extension from Google Chrome Browser.

3.) Search for the data in the Kaggle search bar.

4.) Join the competition. Click the Download All button.

5.) Then cancel the download.

6.) Open your CurlWget extension which is present in the top right corner of your browser.

7.) Copy the link provided in it.

Disclaimer : The above method may violate the Kaggle regulations.

We will use the wget command to download the data using the link we copied earlier.

#wget command is used to download files
!wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.70" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/3816/32105/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602507515&Signature=J%2B9Esap36xIIDR0M0dqTLteFcX1zQf5IpmV0We%2Bk0ygL0aAYERuo8kcL%2F5ZSct%2BDNkXWi3M5B21qV2vv54Sjex1NXAH7vP4FAxdPazbR32A02Rez0fKwLcdvXkjY8TxBxCOfifHt6ZwbpkS0e8gszEYdA5HdJbSWHvpYM21xnicAqCoWxz3t%2Bj87ivr4CJF7gjmrBlf%2B67ciNFi0eEMx%2BQek5XOjXN1Hai5BrKeSIZSN83Pvl%2FBxwglarT0WyDIefBm6ekm9nDLH7KaZ5TjAqV%2BwQ0cEiI%2Bqz1Lv5AjAdJaS02dnu736L8ILgjZGtf7jllBjQ3iJKWM53X4C40R%2FlA%3D%3D&response-content-disposition=attachment%3B+filename%3Dwalmart-recruiting-store-sales-forecasting.zip" -c -O 'walmart-recruiting-store-sales-forecasting.zip'

The downloaded files are in compressed format. We need to extract them out. 📂📂

# Unzipping the files downloaded.
!unzip 'walmart-recruiting-store-sales-forecasting.zip'

After we execute the above command, the following files are shown 📁📁

1.) stores.csv

2.) test.csv.zip

3.) train.csv.zip

4.) features.csv.zip

We need to extract again the compressed file. 📁📁

# Unzipping the files downloaded.
!unzip 'test.csv.zip'
!unzip 'train.csv.zip'
!unzip 'features.csv.zip'

We are given some past observations about the Walmart stores. Our task here is to predict the weekly sales by using these observations. These observations are indexed by time at which it was recorded. So, this is a time-series problem. There are many statistical algorithms to model the time-series data. But, our goal here is to model the same observations using traditional machine learning algorithms.

Exploratory Data Analysis + Feature Engineering 🍳🍳

First, we need to load the data from disk. Then , we will visualize the data in a tabular format.

# Reading the csv files from the disk using read_csv method from pandas library. # It converts the data into pandas dataframe object.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
features = pd.read_csv('features.csv')
stores = pd.read_csv('stores.csv')

We are basically reading the .csv files using read_csv function from pandas library.

# Using the head method, we display the dataframe
train.head(5)

The train dataframe contains [‘Store’ , ‘Dept’ , ‘Date’ , ‘Weekly_Sales’ , ‘IsHoliday’ ] as its attributes.

For each store and department, the Weekly_Sales attribute contains the total sales for a week.

Each successive date is approximately 7 days ahead of the previous date.

#Using the head method, we display the dataframe
features.head(5)

This dataframe contains some additional information for each store and department in the train dataframe.

The details about these attributes are given in the beginning of this notebook.

# Using the head method , we display the dataframe
stores.head(5)

For each store , we have a attribute named Type associated with it.

For each store , we have a attribute named Size associated with it.

Since the attributes contained in the features and stores dataframe are related to the attributes in the train dataframe, we will try to merge these dataframes together.

Whatever feature is added to our training data, the same should reflect on the test data. We concatenate both the train and test data to make our job easier.

# The merge operation is supported by the merge method.

# We are merging the train and the test dataframe. If we add any additional feature to the training data, then simultaneously 
it will also get added to the test data.

train_df = pd.concat([train,test],axis=0,ignore_index=True)

#We are merging the train and features dataframe where they intersect each other.

data = pd.merge(train_df,features,how='inner',on=['Store','Date','IsHoliday'])

# We are merging the above resultant dataframe with the stores dataframe where they intersect each other.

walmart_df = pd.merge(data,stores,how='inner',on=['Store'])

#Printing the dataframe.
#Now every attributes about the store and department are in a single dataframe. #Now it is easier to analyze.
walmart_df.head(5)

#Printing the number of records. The len function returns the number of records.
print("Number of records ",len(walmart_df))

Number of records 536634

Here, the number of records includes both training and the test data.

# Printing the number of records.
print("Number of records present in the training data ",len(train))
print("Number of records present in the test data ",len(test))

Number of records present in the training data 421570

Number of records present in the test data 115064

Let’s see how many records we have for each store

# For each type of stores , we are counting the number of records
# Creating a counter object on store attribute.
features_store = Counter(walmart_df['Store'].values)

# Matplotlib style(comics)
plt.xkcd()

# Storing the values of the type of stores in x
x = list(features_store.keys())
# Storing the values of the counts of each type of stores in y
y = list(features_store.values())



# Plotting the barplot with x and y
plt.bar(x,y,label='#_records',color='r')

# For displaying purpose
plt.legend(prop={'size':15})
plt.xlabel('Store Number')
plt.ylabel('# Records')
plt.title('Store number vs Number of records')

# Displaying the plot
plt.tight_layout()
plt.show()

We have at least 7500 records for each store.
The observations were taken from 2010-02-05 to 2013-07-26.

Now , we will see some simple statistics about our data.

#The describe method returns a dataframe containing some common statistical properties of the data.
walmart_df.describe()

For the Weekly_Sales attribute, we have a total of 421570 valid values.
The values for the Weekly_Sales attribute is not present in the test data. We need to predict the values. Since we concatenated the train and test data , the values are marked as NaN (missing).
We observe some missing values in the CPI and Unemployment attribute.

Machine learning algorithms can’t deal with missing values. Either we need to fill it with some statistic or remove it. We will try to fill the missing values for a particular attribute with the mean of that attribute.

# A boolean array denoting the presence of nan values
missing_cpi = walmart_df['CPI'].isna()
missing_unemp = walmart_df['Unemployment'].isna()

mean_c = walmart_df['CPI'].mean()
mean_u = walmart_df['Unemployment'].mean()

# Calculating the average CPI and Unemployment based on grouping the Store,Department.
mean_CPI = walmart_df.groupby(['Store','Dept'])['CPI'].mean()
mean_Unemployment = walmart_df.groupby(['Store','Dept'])['Unemployment'].mean()

# Iterating through the records
for i in range(len(missing_cpi)):
    # If the value is missing in the CPI attribute
    if missing_cpi[i]:
        # Getting the index to retrieve the average value
        index = (walmart_df['Store'][i],walmart_df['Dept'][i])
        # Checking whether the index is present or not.
        try:
            mean_cpi = mean_CPI.get(index)
        # If not,fill it with 0.
        except KeyError:
            mean_cpi = mean_c
        walmart_df['CPI'][i] = mean_cpi
    # If the value is missing in the Unemployment attribute
    if missing_unemp[i]:
        # Getting the index to retrieve the average value.
        index = (walmart_df['Store'][i],walmart_df['Dept'][i])
        # Checking whether the index is present or not.
        try:
            mean_unemp = mean_Unemployment.get(index)
        # If the index is not present,fill it with 0.
        except KeyError:
            mean_unemp = mean_u
        walmart_df['Unemployment'][i] = mean_unemp
        
# For some store and department,the CPI and Unemployment are not present. We can't calculate the mean. Therefore, we fill with 0
walmart_df['CPI'] = walmart_df['CPI'].fillna(mean_c)
walmart_df['Unemployment'] = walmart_df['Unemployment'].fillna(mean_u)

There are some missing values in the Markdown attributes. Our task is to fill those missing values. There are many ways to perform this task. Since this attribute is mentioned as anonymous in the competition, it is better to fill it with zero. The value zero indicates the absence of value in the particular record.

# Filling the missing values using fillna method.

for i in range(1,6):
walmart_df['MarkDown{0}'.format(i)]=walmart_df['MarkDown{0}'.format(i)].fillna(0)

We observe some negative values in the Weekly_Sales attribute. The values may be wrongly imputed. Since the sales can’t be negative, either we need to remove it or reverse the sign. We don’t want to loose the data. Hence , we reverse the sign of the values if it is negative.

# Calculating the number of records in the train dataframe
num_records = len(walmart_df)
# Calculating the number of records with negative values in the Weekly_Sales attribute.
negative_records = len (walmart_df[ walmart_df['Weekly_Sales']<0 ])

print("Percentage of records with negative values in the Weekly_Sales attribute ",(negative_records/num_records)*100)

#Filling the missing values. It is not necessary, since we are not going to use it.
walmart_df['Weekly_Sales'] = walmart_df['Weekly_Sales'].fillna(0)

#We are converting the values in the Weekly_Sales attribute to positive if it is negative.
walmart_df['Weekly_Sales'] = walmart_df['Weekly_Sales'].map(lambda x:x if x>=0 else -x)

Percentage of records with negative values in the Weekly_Sales attribute 0.2394555693452148

It’s enough with feature engineering. Let’s now focus on data visualization.

# We will sort our dataframe based on the date. This is achieved using the sort_values method.
walmart_df.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)

We sorted our records based on the date, store and the department.

We will now visualize the values of the Weekly_Sales attribute using a box plot.

# Plotting the boxplot for the Weekly_Sales attribute
f, ax = plt.subplots(figsize=(9, 6))
plt.title('Boxplot of Weekly_Sales')
fig = sns.boxplot(y='Weekly_Sales', data=walmart_df)

Let us also look at the histogram.

# Plotting the histogram of the Weekly_Sales attribute
plt.hist(walmart_df['Weekly_Sales'],bins=10)
plt.xlabel('Weekly_Sales')
plt.ylabel('Frequency')
plt.title('Histogram of Weekly_Sales')
plt.show()

There are many values which are greater than Q3 + 1.5*(IQR) . These could be outliers.
Since the notion of outliers is subjective, we need to further investigate on it.

Let’s look at the attributes which have contributed to more sales.

# Selecting those records which have higher sales.
slices = walmart_df[walmart_df['Weekly_Sales']>20206]

# Plotting the pair plot.
plt.plot(slices['Fuel_Price'],slices['Weekly_Sales'])
plt.title('Fuel_Price vs Weekly_Sales')
plt.xlabel('Fuel_Price')
plt.ylabel('Weekly_Sales')
plt.show()

There is no clear trend in the plot.
The sales are higher when the values of the fuel price are closer to their mean.

# Plotting the pair plot.
plt.xkcd()
plt.plot(slices['Temperature'],slices['Weekly_Sales'])
plt.title('Temperature vs Weekly_Sales')
plt.xlabel('Temperature')
plt.ylabel('Weekly_Sales')
plt.show()

The sales are higher when the mean temperature is achieved.
It is self-evident. People tend to go for shopping when the temperature is moderate.

# Plotting the pair plot.
plt.plot(slices['CPI'],slices['Weekly_Sales'])
plt.title('CPI vs Weekly_Sales')
plt.xlabel('CPI')
plt.ylabel('Weekly_Sales')
plt.show()

With the definition above, it is obvious that sales are higher when the CPI is low.

# Plotting the pair plot.
plt.plot(slices['Unemployment'],slices['Weekly_Sales'])
plt.title('Unemployment vs Weekly_Sales')
plt.xlabel('Unemployment')
plt.ylabel('Weekly_Sales')
plt.show()

There are more sales when the unemployment rate is low.
Still the unemployment rate doesn’t decide the sales much.

# Plotting the pair plot.
plt.plot(slices['Size'],slices['Weekly_Sales'])
plt.title('Size vs Weekly_Sales')
plt.xlabel('Size')
plt.ylabel('Weekly_Sales')
plt.show()

There is no linear relationship between sales and size. It is because, the sales also depends on the date.
Typically one would expect a linear relationship.

Now we will look at the dates where the sales are much higher

#Selecting those records where the sales are much higher
slices = walmart_df[walmart_df['Weekly_Sales']>250000] 
print(np.unique(slices['Date'])) # It returns the unique values.

[ ‘2010-02-05’ ‘2010-11-26’ ‘2010-12-17’ ‘2010-12-24’ ‘2011-11-25’ ‘2011-12-23’ ]

The above information are provided in the Competition website.

These dates exactly matches with the dates given in the website where the sales are much higher. The dates mentioned above are very crucial for our modelling purpose.

These dates decides the range of values the Weekly_Sales attribute takes

To add these information, we need to define a flag variable to denote the presence of these dates.

# Converting the date passed as a string to pandas datetime object
def date(dates):
    return pd.to_datetime(dates)

# Converting the Date attribute in the dataframe to pandas datetime object.
walmart_df['Date'] = pd.to_datetime(walmart_df['Date'])

# We are defining a variable with two values ('Y','N'). It is set to 'Y', if the date is present and vice-versa.
# np.where function sets the second argument as the value if the condition holds true and the third argument as the value if it doesn't.
Super_bowl = ['2010-02-05','2010-02-12','2011-02-11','2012-02-10','2013-02-08']
Labor_Day = ['2010-09-10','2011-09-09','2012-09-07','2013-09-06']
Thanksgiving = ['2010-11-26','2011-11-23','2011-11-25','2012-11-23','2013-11-29']
Christmas = ['2010-12-24','2010-12-31','2011-12-30','2012-12-28','2013-12-27']

walmart_df['Super_Bowl'] = np.where((walmart_df['Date']==date(Super_bowl[0])) | (walmart_df['Date']==date(Super_bowl[1])) | (walmart_df['Date']==date(Super_bowl[2])) | (walmart_df['Date']==date(Super_bowl[3])) | (walmart_df['Date']==date(Super_bowl[4])),1,0)
walmart_df['Labor_Day'] = np.where((walmart_df['Date']==date(Labor_Day[0])) | (walmart_df['Date']==date(Labor_Day[1])) | (walmart_df['Date']==date(Labor_Day[2])) | (walmart_df['Date']==date(Labor_Day[3])),1,0)
walmart_df['Thanksgiving'] = np.where((walmart_df['Date']==date(Thanksgiving[0])) | (walmart_df['Date']==date(Thanksgiving[1])) | (walmart_df['Date']==date(Thanksgiving[2])) | (walmart_df['Date']==date(Thanksgiving[3])) | (walmart_df['Date']==date(Thanksgiving[4])),1,0)                                                                              
walmart_df['Christmas'] = np.where((walmart_df['Date']==date(Christmas[0])) | (walmart_df['Date']==date(Christmas[1])) | (walmart_df['Date']==date(Christmas[2])) | (walmart_df['Date']==date(Christmas[3])) | (walmart_df['Date']==date(Christmas[4])),1,0)

Using the np.where function, we are defining an indicator variable to denote the presence of any of those important days. This feature is very useful. If the feature is turned off, we know the sales won’t be higher.

The next variable which tells about the sales is temperature. People prefer a moderate temperature for shopping. Let’s observe the sales when the values of the temperature are increasing or decreasing.

# This routine calculates the length of the longest subarray with increasing values.
def maximum_length_increasing_subarray(array):
    # Initializing the length variable with 1.(Assumption:Atleast an array contains a single element in it.)
    length = 1
    # Initializing the starting position(If the array contains a single element, then start=0)
    start=0
    # Initializing the  m variable. m-> It keeps track of the length. The value 1 denotes, at least it has seen a single element.
    m = 1
    # For each element in the array
    for i in range(len(array)-2,-1,-1):
        # It denotes the condition for increasing.
        if array[i]<array[i+1]:
            m = m + 1
            #Updating the maximum_length so far.
            if length<m:
                length=m
                start=i
        else:
            m=1
    # Returning the length and starting position.
    return length, start

The above routine returns the length of the subarray which is strictly increasing. In our case the array holds the values of the temperature. We will look at the corresponding sales.

# Calling the method maximum_length_increasing_subarray
length, start=maximum_length_increasing_subarray(walmart_df['Temperature'])
# Printing the starting position and the length of the longest increasing subarray
print("Starting position of the maximum increasing subarray ",start)
print("Length of the maximum increasing subarray ",length)

Starting position of the maximum increasing subarray 536515

Length of the maximum increasing subarray 2

# 2 is the length of the longest increasing subarray.
x = np.arange(1,length+1,1)
#Getting the temperature values
temp = walmart_df['Temperature'][start:start+length]
#Getting the sales values
sales = walmart_df['Weekly_Sales'][start:start+length]


#Plotting the temperature and the weekly sales simultaneously.
plt.plot(x,temp,label='temperature')
plt.plot(x,sales,label='Weekly Sales')
plt.xlabel('Units')
plt.title('Increasing temperature vs Weekly Sales')
plt.legend()
plt.plot()
plt.show()

As the temperature increases, the sales tend to increase.

What happens when the temperature is decreasing ?

# This routine calculates the length of the longest subarray with decreasing values.
def maximum_length_decreasing_subarray(array):
    # Initializing the length variable with 1.(Assumption:Atleast an array contains a single element in it.)
    length = 1
    #Initialzing the starting position(If the array contains a single element, then start=0)
    start=0
    # Initializing the  m variable. m-> It keeps track of the length. The value 1 denotes, at least it has seen a single element.
    m = 1
    # For each element in the array
    for i in range(len(array)-2,-1,-1):
        # It denotes the condition for decreasing.
        if array[i]>array[i+1]:
            m = m + 1
            #Updating the maximum_length so far.
            if length<m:
                length=m
                start=i
        else:
            m=1
    # Returning the length and starting position.
    return length, start

# Calling the method maximum_length_decreasing_subarray
length, start = maximum_length_decreasing_subarray(walmart_df['Temperature'])
# Printing the starting position and the length of the longest decreasing subarray.
print("Starting position of the maximum decreasing subarray ",start)
print("Length of the maximum decreasing subarray ",length)

Starting position of the maximum decreasing subarray 536567

Length of the maximum decreasing subarray 2

# 2 is the length of the longest decreasing subarray.
x = np.arange(1,length+1,1)
# Getting the temperature values
temp = walmart_df['Temperature'][start:start+length]
# Getting the sales values
sales = walmart_df['Weekly_Sales'][start:start+length]


# Plotting the temperature and the weekly sales simultaneously.
plt.plot(x,temp,label='temperature')
plt.plot(x,sales,label='Weekly Sales')
plt.xlabel('Units')
plt.title('Decreasing temperature vs Weekly Sales')
plt.legend()
plt.plot()
plt.show()

As the temperature decreases, the sales tends to go down.

We will render some more box plots to enhance our knowledge on the sales.

The sales are higher in holidays.

The two distributions looks similar other than the maximum value.

The median sale is slightly higher in the holidays.

# Plotting the boxplot for weekly_sales in Type attribute
walmart_sale = pd.concat([walmart_df['Type'], walmart_df['Weekly_Sales']], axis=1)
f, ax = plt.subplots(figsize=(9, 6))
fig = sns.boxplot(x='Type', y='Weekly_Sales', data=walmart_sale, showfliers=False)

The weekly_sales is very low for Type C relative to other type.

This feature quite discriminates among other features.

This feature denotes the range of values the weekly_sales takes.

The store of type A has more sales than any other type.

Even the store of type A has more size . Therefore , size affects the sales.

Since we noticed that the type attribute discriminates the range of values the weekly_sales take, we will plot a pie-chart describing the size of each type of stores.

# For each stores, we are counting the number of records present for it.
stores_count = stores.groupby('Type')['Store'].count()


# Using a pie chart, we are displaying the relative percentage of the observations for each type of stores.
plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Number of Stores of Type '*'")
plt.tight_layout()
plt.show()

The size of the type A store is much larger than other type of stores.

Therefore, the size attribute also discriminates the range of values the sales takes.

# For each type of stores , we are extracting the minimum size value.
stores_count = stores.groupby('Type')['Size'].max()

plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Maximum size of store type '*'")
plt.tight_layout()
plt.show()

The maximum size of the type C is very low compared to other types.

The sales in the type C store may be less compared to other types.

The sales in the type A store may be relatively high.

# For each type of stores , we are extracting the minimum size value.
stores_count = stores.groupby('Type')['Size'].min()

plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Minimum size of store type '*'")
plt.tight_layout()
plt.show()

Here, the type A and C looks similar.

Since we are dealing with time-series data, the future observations depends on the past observations. To incorporate this, we will add the past sales as an attribute. We will add the previous 3 weeks sales as an additional feature.

# Adding an additional feature which holds the previous weeks sales
one_week_before = list(walmart_df['Weekly_Sales'])[:-1]
one_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_1'] = one_week_before

# Adding an additional feature which holds the previous 2 weeks sales
two_week_before = list(walmart_df['Weekly_Sales'])[:-2]
two_week_before.insert(0,walmart_df['Weekly_Sales'][1])
two_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_2'] = two_week_before

# Adding an additional feature which holds the previous 3 weeks sales
three_week_before = list(walmart_df['Weekly_Sales'])[:-3]
three_week_before.insert(0,walmart_df['Weekly_Sales'][2])
three_week_before.insert(0,walmart_df['Weekly_Sales'][1])
three_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_3'] = three_week_before

The attribute Date is not atomic. It can be further subdivided. We will extract the month, year and day from the Date attribute.

# Using the map function we are transforming each date to year, month and day respectively.
walmart_df['Year'] = walmart_df['Date'].map(lambda x:x.year)
walmart_df['Month'] = walmart_df['Date'].map(lambda x:x.month)
walmart_df['Day'] = walmart_df['Date'].map(lambda x:x.day)

The values of the Date attribute are pandas datetime object. It contains several methods and properties. Using such properties, we extracted the year, month and day.

Now, we will add the median sales as an additional feature. This varies for each store and department. This feature discriminates the sales among the stores and department. We will find the median by grouping Store, Dept, Type, IsHoliday and Month attribute.

# Calculating the median by groupby operation.
median_df = pd.DataFrame({'Median_Sales':walmart_df.groupby(['Store','Dept','Type','IsHoliday','Month'])['Weekly_Sales'].median()})

# Merging the two dataframes.
walmart_df = pd.merge(walmart_df,median_df,on=['Store','Dept','Type','IsHoliday','Month'],how = 'inner')

We will add another attribute which denotes the difference in sales from the median sales. This attribute denotes the direction of the sales.

# We are taking the difference between the median sales and weekly sales
walmart_df['Difference_Median'] = walmart_df['Weekly_Sales'] - walmart_df['Median_Sales']

In the competition website, it was given that MarkDown are present only after November 2011.
It is also given that, MarkDown are available in special occasions.
We will add an indicator variable to denote the presence of MarkDown attribute.

# Adding a indicator variable to denote the presence of MarkDown attribute
walmart_df['MarkDown1_indicator'] = np.where(walmart_df['MarkDown1']==0,1,0)
walmart_df['MarkDown2_indicator'] = np.where(walmart_df['MarkDown2']==0,1,0)
walmart_df['MarkDown3_indicator'] = np.where(walmart_df['MarkDown3']==0,1,0)
walmart_df['MarkDown4_indicator'] = np.where(walmart_df['MarkDown4']==0,1,0)
walmart_df['MarkDown5_indicator'] = np.where(walmart_df['MarkDown5']==0,1,0)

Now, we will try to visualize the median sales and other attribute related to the sales.

# Routine for plotting.
def plots(x,x_label,y_label,title):
    plt.plot(x)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()

# Plotting the weekly sales
plots(walmart_df['Weekly_Sales'],'Units','Weekly_Sales','Weekly_Sales')

The sales are much higher only in some weeks.

# Plotting the median sales.
plots(walmart_df['Median_Sales'],'Units','Median_Sales','Median_Sales')

# Plotting the sales from the median.
plots(walmart_df['Difference_Median'],'Units','Difference_Median','Difference in the sales from the median')

# Plotting the difference of previous 1 week sale from the median sales.
plots(walmart_df['delta_1']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_1','Difference in the median sales from the previous weeks sales')

# Plotting the difference of previous 2 week sales from the median sales.
plots(walmart_df['delta_2']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_2','Difference in the Median_Sales from past 2 weeks sales')

# Plotting the difference of previous 3 week sales from the median sales.
plots(walmart_df['delta_3']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_3','Difference in the median sales from the past 3 weeks sales')

Modelling the weekly sales directly is quite hard.

We will rather model the difference in the sales from the median.

The differences has a nice distribution. Hence, it is easier to model.

While we are modelling the differences , we need to mostly focus on the direction rather than the magnitude

Since the Type attribute contains string values, we need to encode it.

# We are encoding the Type attribute using LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

walmart_df['Type'] = le.fit_transform(walmart_df['Type'])

Splitting the dataset into train and test.

# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]

# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]

# Printing the length of the dataset (train/test)
print("Length of the training dataset ",len(train))
print("Length of the test dataset ",len(test))

Length of the training dataset 421570

Length of the test dataset 115064

# Making the Difference_Median attribute as the dependent variable
y_train = train['Difference_Median']

# Making the Difference_Median attribute as the dependent variable
y_test = test['Difference_Median']

# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)

# Printing the dataframe.
train.head()

Modelling

Hyperparameter tuning

The algorithms in machine learning are governed by many parameters. There is no full-fledged theoretical argument for choosing the right set of parameters. It is mostly an educated guess and sometimes theory works too. There are many methods proposed for hyperparameter tuning. Not all are useful. The one which is intuitive and actually correct is grid search. We need to define some values for each parameter that need to be tuned.

Suppose, p and q are the hyperparameters that needs to be tuned.

p = [ 1, 2, 3 ] and q = [ 4, 3 ]

If we want to fit a model with these parameters, we need to try all possible combinations and fit our model. This is called as a brute-force search. At each stage, we also evaluate our model and choose the best one out of it. This process is costlier. If the range of these parameters is large, then the time taken for this process increases substantially. One such method which reduces the time complexity and space complexity too is Randomized Search algorithm.

The randomized search method drives the search process probabilistically. At each stage, a value for each parameter is chosen randomly with equal probability. After that, the model is fitted. We also evaluate our model and choose the best one out of it. This process takes less time and space too. There are many theoretical arguments behind the convergence of this process. The details of it are beyond the scope of this post.

Linear Regression

In statistics, linear regression is a approach to model the scalar dependent variable and one or more independent exploratory variable(s) when the relationship between them is linear or when we want to approximate it with a linear function. In a nutshell, fitting a straight line to the data. Linear Regression works very well in high-dimension. It is due to curse of dimensionality.

To use Linear Regression, we need to do some transformation to the existing data. We need to encode the categorical variables to one-hot vectors. We also need to normalize the numerical variables for the purpose of optimization.

Encoding the categorical variables

# Storing the categories
categories = ['Store','Dept','IsHoliday','Type','Super_Bowl','Labor_Day','Thanksgiving','Christmas','MarkDown1_indicator','MarkDown2_indicator','MarkDown3_indicator','MarkDown4_indicator','MarkDown5_indicator']
columns = list(walmart_df.columns)

# Storing the names of the numerical variables
numeric = []
for i in columns:
    if i not in categories and i!='Date' and i!='Difference_Median' and i!='Median_Sales':
        numeric.append(i)

# Creating an dictionary to store the one-hot encoding of the categories
temp_dict={}
for i in categories:
    temp_dict[i] = pd.get_dummies(walmart_df[i])

# Storing it for future purpose.
temp_dict['numeric'] = walmart_df[numeric]
temp_dict['Date'] = walmart_df['Date']
temp_dict['IsHoliday'] = walmart_df['IsHoliday']
temp_dict['Difference_Median'] = walmart_df['Difference_Median']
temp_dict['Median_Sales'] = walmart_df['Median_Sales']
temp_dict['Store'] = walmart_df['Store']
temp_dict['Dept'] = walmart_df['Dept']
temp_dict['Date'] = walmart_df['Date']
temp_dict['IsHoliday'] = walmart_df['IsHoliday']

# Sorting and concatenating the dataframes.
lin_reg_walmart_df = pd.concat([temp_dict[i] for i in temp_dict.keys()],axis=1)
lin_reg_walmart_df.sort_values('Date',inplace=True,ignore_index=True)

The pd.get_dummies function is used to transform the categorical variable to a one-hot vector. Once we have transformed it, we normally concatenate it with the existing dataframe and remove the categorical variable from it.

Normalizing the numerical variables

# Normalizing the numerical variables.
for i in numeric:
    mean = lin_reg_walmart_df[i].mean()
    std = lin_reg_walmart_df[i].std()
    lin_reg_walmart_df[i]-=mean
    lin_reg_walmart_df[i]/=std

There are many techniques to normalize the numerical variables. We used a technique which displaces the mean to zero and scales the standard deviation to one. It is done purely for better convergence in the optimization.

Splitting the dataset into train and test

# Selecting the records for test data
test = lin_reg_walmart_df[lin_reg_walmart_df['Date']>=pd.to_datetime('2012-11-02')]

# Selecting the records for train data
train = lin_reg_walmart_df[lin_reg_walmart_df['Date']<pd.to_datetime('2012-11-02')]

For the training data, we select all those records which are recorded before 2012-11-02.

For the test data, we select all those records which are recorded after 2012-11-02.

# Dependent variable
y_train = train['Difference_Median']
y_test = test['Difference_Median']
train_isholiday = train['IsHoliday']

# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date','IsHoliday','Store','Dept'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)

Rather than treating sales as the dependent variable, we treat the difference in sales from the median sales. We already discussed the reason behind it.

Randomized Search

# Routine to visualize the hyperparameter tuning results.
def visualize(random):
    dictionary = random.cv_results_
    dictionary.pop('params')
    return pd.DataFrame(dictionary)

The above function basically returns a dataframe containing the results from the hyper parameter tuning. The results are stored in the cv_results_ attribute.

# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# alpha : Regularization parameter
alpha = [x for x in np.linspace(start = 0, stop = 1, num = 30)]

random_grid = {'alpha': alpha}

# Printing the random_grid
print(random_grid)

The parameter alpha is need to be tuned. It is a constant to regulate the learning process. When the magnitude of alpha is large, it produces a simple solution.

# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.linear_model import Ridge
ridge_random = Ridge()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
ridge_random = RandomizedSearchCV(estimator = ridge_random, param_distributions = random_grid, n_iter = 75 , cv = 2, verbose= 2, random_state= 42, n_jobs = 3)
# Fit the random search model
ridge_random.fit(train,y_train)

Once we defined the hyper parameters, we need to create a object of type RandomizedSearchCV and start the process using the fit method. We also used cross-validation. It derives an additional unseen dataset from the train dataset and use it for evaluation purpose.

# Calling the visualize routine
visualize(ridge_random)

It is a snapshot of our results. We observe a poor score. This indicates that, relationship is not linear.

# Printing the best parameters
print("The best parameters are ", ridge_random.best_params_)

We printed the best value of the alpha that resulted in the best reduction of the mean squared loss.

# Printing the best estimator and fitting on the train data
best_estimator = ridge_random.best_estimator_
best_estimator.fit(train,y_train)

The best_estimator_ attribute holds the object of the Ridge class with the best alpha. Then, we start the learning process by invoking the fit method.

# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_linreg.joblib")

We saved the model using joblib library.

# Loading the saved estimator from the drive.
best_estimator = joblib.load("./models/best_linreg.joblib")

We loaded the model from disk.

# This routine calculates the custom metric given in the Kaggle website
def custom_metric(train_isholiday,x,y,estimator):
    a=np.ones(shape=(x.shape[0]))
    a=a*(train_isholiday==True)*4+1 #Gives a weight of 5 if holiday and 1 if not
    y_hat = estimator.predict(x)+x['Median_Sales']
    # Plotting our predicted weekly sales.
    plt.plot(y_hat+x['Median_Sales'],color='r',label='Predicted')
    plt.plot(y,color='g',label='Truth')
    plt.xlabel('Weeks')
    plt.ylabel('Predicted weekly sales')
    plt.title('Estimated Weekly Sales')
    plt.legend()
    plt.show()
    # Calculating the weighted mean absolute deviation error.
    diff = np.abs(y-y_hat)
    diff = diff*a
    diff = np.sum(diff)
    normalize = 1/(np.sum(a))
    diff = diff*normalize
    return diff

The above routine predicts the sales using the test data and plots it. We also calculate the metric which is given in the competition.

# Printing the score on the train set
print("The score on the train set ",custom_metric(train_isholiday,train,y_train+train['Median_Sales'],best_estimator))

It’s a good score to get started.

We have almost finished 99% of our task. Next, we need to predict the sales using our best model on the private test data given in the competition website.

Inference

def inference(estimator,test,filename):
    # Reading the test.csv from the disk.
    t = pd.read_csv('test.csv')
    # We are reordering the test data with the submission file.
    test.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)
    t['Date'] = pd.to_datetime(t['Date'])
    v = pd.merge(t,test,on=['Store','Dept','Date','IsHoliday'],how='inner')
    v.drop(['Date','Store','Dept','IsHoliday'],axis=1,inplace=True)
    # Predicting the weekly sales for the test data.
    y_test = best_estimator.predict(v)
    # Adding the Median_Sales to the dependent variable.
    Y = np.array(y_test)
    Y += np.array(v['Median_Sales'])
    # Concatenating the columns. This is the expected format for this competition.
    t['Id'] = t['Store'].astype(str)+'_'+t['Dept'].astype(str)+'_'+t['Date'].astype(str)
    # Adding the predicted weekly sales.
    t['Weekly_Sales'] = Y
    # Dropping the unnecessary columns.
    t.drop(['Store','Dept','Date','IsHoliday'],axis=1,inplace=True)
    # Writing it to a csv file.
    t.to_csv('submission/'+filename,columns=['Id','Weekly_Sales'],index=False)
    # Plotting our predicted weekly sales.
    plt.plot(Y)
    plt.xlabel('Weeks')
    plt.ylabel('Predicted weekly sales')
    plt.title('Estimated Weekly Sales')
    plt.show()

Most of the things mentioned in the routine has already been discussed earlier.

to_csv function is used to convert the dataframe to a csv file. Once the csv file is created, we upload it. After some processing in the web server, the final score is displayed.

inference(best_estimator,test,'walmart_df_lin.csv')

Score

We are under the top 350 !!!.

Random Forest Regression

Here, we are going to train a random forest model on our dataset and make predictions.

WHY RANDOM FOREST ? Random Forest algorithm works well if the dependent variable we are trying to model is an outcome of several decision(s). This algorithm can’t extrapolate. The sales is actually determined by several variables. Also, we don’t actually want to extrapolate. Because, each year data changes and we need to model our observations iteratively. With all these in mind, we are going to model the sales using Random Forest algorithm.

Encoding the categorical variables

# We are encoding the Type attribute using LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

walmart_df['Type'] = le.fit_transform(walmart_df['Type'])

Previously we used one-hot encoding for transforming the categorical variables. Here, we are using the label encoder. It basically gives a natural number to each unique value. Since, decision trees are constructed by making a decision at each stage, there is no need for one-hot encoding.

Splitting the dataset into train and test

# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]

# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]

# Making the Difference_Median attribute as the dependent variable
y_train = train['Difference_Median']

# Making the Difference_Median attribute as the dependent variable
y_test = test['Difference_Median']

# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)

Hyper parameter tuning

# This routine creates a dataframe from the results of the hyperparameter tuning
def visualize(random_object):
    dictionary = random_object.cv_results_
    dictionary.pop('params')
    return pd.DataFrame(dictionary)

# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 100, num = 4)]
n_estimators.insert(0,200)
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 55, num = 4)]
# Minimum number of samples required to split a node
min_samples_split = [2,5]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split':[2]}
# Printing the random_grid
print(random_grid)

n_estimators : It is the number of decision trees to construct a forest.

max_depth : It decides the maximum number of decisions needed to construct a decision tree.

min_samples_split : The minimum number of samples needed to be in the data space after making an decision.

# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 6, cv = 2, verbose= 2, random_state= 42, n_jobs = 3)
# Fit the random search model
rf_random.fit(train,y_train)

# Calling the visualize routine
visualize(rf_random)

# Printing the best parameters
print("The best parameters are ",rf_random.best_params_)

The best parameters are { ‘n_estimators’ : 200 , ‘min_samples_split’ : 2 , ‘max_depth’: 55 }

# Printing the best estimator.
print('The best estimator ',rf_random.best_estimator_)

The best estimator RandomForestRegressor(max_depth=55 , n_estimator=200)

# Printing the best estimator and fitting on the train data
best_estimator = rf_random.best_estimator_
best_estimator.fit(train,y_train)

# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_rf.joblib")

# Loading the saved estimator from the drive.
best_estimator = joblib.load("./best_rf.joblib")

# This routine calculates the custom metric given in the Kaggle website
def custom_metric(x,y,estimator):
    a=np.ones(shape=(x.shape[0]))
    a=a*(x['IsHoliday']==True)*4+1 #Gives a weight of 5 if holiday and 1 if not
    y_hat = estimator.predict(x)+x['Median_Sales']
    #Plotting our predicted weekly sales.
    plt.plot(y_hat+x['Median_Sales'],color='r',label='Predicted')
    plt.plot(y,color='g',label='Truth')
    plt.xlabel('Weeks')
    plt.ylabel('Predicted weekly sales')
    plt.title('Estimated Weekly Sales')
    plt.legend()
    plt.show()
    # Calculating the weighted mean absolute deviation
    diff = np.abs(y-y_hat)
    diff = diff*a
    diff = np.sum(diff)
    normalize = 1/(np.sum(a))
    diff = diff*normalize
    return diff

# Printing the score on the train set
print("The score on the train set ",custom_metric(train,y_train+train['Median_Sales'],best_estimator))

Inference

def inference(estimator,test,filename):
    # Reading the test.csv from the disk.
    t = pd.read_csv('test.csv')
    # We are reordering the test data with the submission file.
    test.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)
    t['Date'] = pd.to_datetime(t['Date'])
    v = pd.merge(t,test,on=['Store','Dept','Date','IsHoliday'],how='inner')
    v.drop('Date',inplace=True,axis=1)
    v['IsHoliday'] = v['IsHoliday'].astype(bool)
    # Predicting the weekly sales for the test data.
    y_test = best_estimator.predict(v)
    # Adding the Median_Sales to the dependent variable.
    Y = np.array(y_test)
    Y += np.array(v['Median_Sales'])
    # Concatenating the columns. This is the expected format for this competition.
    t['Id'] = t['Store'].astype(str)+'_'+t['Dept'].astype(str)+'_'+t['Date'].astype(str)
    # Adding the predicted weekly sales.
    t['Weekly_Sales'] = Y
    # Dropping the unnecessary columns.
    t.drop(['Store','Dept','Date','IsHoliday'],axis=1,inplace=True)
    # Writing it to a csv file.
    t.to_csv('submission/'+filename,columns=['Id','Weekly_Sales'],index=False)
    # Plotting our predicted weekly sales.
    plt.plot(Y)
    plt.xlabel('Weeks')
    plt.ylabel('Predicted weekly sales')
    plt.title('Estimated Weekly Sales')
    plt.show()

inference(best_estimator,test,'walmart_df_rf.csv')

Score

We are under the top 375 !!!!

Xgboost Regression

Xgboost works by using the gradient descent in function space. The idea is very simple and intuitive. First order optimization techniques works by displacing the parameters in the direction of opposite of its gradient. The gradient is calculated with respect to the objective function. Since decision trees can’t be expressed as a algebraic function, it is quite unintuitive to think about calculating the gradients. Xgboost works by seeing the decision trees has a normal function. All the techniques proposed for optimization works here too.

Splitting the dataset into train and test

# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]
# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]

# Xgboost expects it to be of boolean.
train['IsHoliday'] = train['IsHoliday'].astype(bool)

# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)

Hyper parameter tuning

# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [40,100,140,180,200]
# Maximum number of levels in tree
max_depth = [None,10,25,55,70]
# Minimum number of samples required to split a node
min_samples_split = [2, 5]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'learning_rate':[0.1,0.01,0.001]}
#Printing the random_grid
print(random_grid)

learning_rate : It is a constant to regulate the displacement of the parameter(s).

# This routine creates a dataframe from the results of the hyperparameter tuning
def visualize_xgb(random_object):
    dictionary = random_object.cv_results_ #Using the cv_results_ attribute to retrieve the results in a dictionary.
    columns = ['mean_fit_time','std_fit_time','mean_score_time','std_score_time','n_estimators','max_depth','learning_rate','split0_test_score','split1_test_score','mean_test_score','std_test_score','rank_test_score']
    d={}
    negate = [4,5,6] #The values in the keys are not needed.
    for i in range(0,12):
        if i not in negate:
            d[columns[i]] = dictionary[columns[i]] #We are extracting the results from the cv_results_ dictionary.
    d[columns[4]]=np.array([i['n_estimators'] for i in dictionary['params']])
    d[columns[5]]=np.array([i['max_depth'] for i in dictionary['params']])
    d[columns[6]]=np.array([i['learning_rate'] for i in dictionary['params']])
   
    return pd.DataFrame(d) # Converting it into a dataframe for nicer visualization.

# Use the random grid to search for best hyperparameters
# First create the base model to tune
from xgboost import XGBRegressor
xgb = XGBRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = random_grid, n_iter = 8, cv = 2, verbose= 2, n_jobs = 3)
# Fit the random search model
xgb_random.fit(train,y_train)

# Calling the visualize routine
visualize_xgb(xgb_random)

# Printing the best parameters
print("The best parameters are ",xgb_random.best_params_)

The best parameters are {‘n_estimators’: 100, ‘max_depth’: None , ‘learning_rate’: 0.1 }

# Printing the best estimator.
print('The best estimator ',xgb_random.best_estimator_)

# Printing the best estimator and fitting on the train data
best_estimator = xgb_random.best_estimator_
best_estimator.fit(train,y_train)

# Printing the score on the train set
print("The score on the train set ",custom_metric(train,y_train+train['Median_Sales'],best_estimator))

# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_xgb.joblib")

# Loading the saved estimator from the drive.
best_estimator = joblib.load("./models/best_xgb.joblib")

Inference

inference(best_estimator,test,'walmart_df_xgb.csv')

Score

We are under the top 280 !!!!!

Future Work

Since we don’t have fancy hardware, it is hard to perform the hyper parameter tuning. Hence, we didn’t get better results. Since it is a time-series problem, there are many SOTA models in deep learning for this task. Also, we could test it on some statistical algorithms designed especially for time-series. But our goal here is to model it using traditional machine learning algorithms, we have limited ourselves.

Visualizing our Toy Neural Network

In my previous post , we implemented our own neural network from scratch in python. In this post, we are going to visualize our network and gain some insights about it.As a remainder, a neural network is nothing but a bunch of neurons connected together to identify or approximate a function which maps input to output.To move on with this post,please revise my old post to refresh the concepts.

This is the code used in my previous post to implement the neural network.The main essence of the neural network lies in the for loop.Let’s visualize what it does.

The input is plotted along x-axis and the output is plotted along y-axis.At first,the neural network guesses the weight and computes a function (y= weight * input).After plotting the function , we observe that the line is not closer to the data point (1 , 4) .So we update the weight in order to fit the data point with the line.The line is nothing but the equation (y= weight * input ).If we found the best line, we can substitute any arbitrary input value to the network and get the correct output. In our example, the correct weight is 4.If we substitute the input value as 2, then we would get 8 as the output(y = 4 * 2).

Simple Neural Network from Scratch in Python .

In my previous post, we briefly discussed about the neural network and its inner mechanics.In this post, we are going to code our own neural network in python.

Python is a programming language which is used for communicating with computer.To move forward with this post , please refer to my previous post for clear understanding.

First , let us see the code and decode it step by step.

That’s it..We implemented a simple neural network from scratch in 12 lines of code.The code and the algorithm we discussed in the previous post looks identical but with difference in python syntax.Let’s talk about the code line by line

1.) The input is set to 1 (input1 = 1)

2.) The original y value is 7.8 (y = 7.8).Our task is to find the weight which when multiplied by input1 gives y

3.) At the initial stage, weight is initialized to 1 (Some random guess).

4.) for loop runs from 0 till 100.Why? .In our algorithm, we used to find the weights by iteration i.e., we will guess some random number and then compute the error if the error is high then we will update the weights accordingly.This forms a loop.So we used for loop.

5.) yh = input1 * weight . In this , we compute our network output(not original y) and we use it to compare with original y

6.) error = y – yh .We compute the error term between the original output and our network output.

7.) If the error is 0, then we found the correct weight and we print the weight.

8.) Else , we compute the derivative of error with respect to the weight. If you compute ,it comes out to be the input1 term.How?

The negative term of the derivative indicates that, when we increase the weight the error decreases.

9.) The derivative term is divided by itself and multiplied by 0.1.Why?.Derivative only tells us whether to increase(If the derivative is negative) or decrease(If the derivative is positive) the weight , but doesn’t tell how much to increase.So to increase it by step by step, the term is divided and multiplied by 0.1.The 0.1 used here is the learning rate of our network.It allows us to update the weight step by step so that we arrive at the proper weight.Try yourselves adjusting the learning rate and see the results.To understand this , let us say our input is 1 and the y value is 4. Let us say if we update the weight by 5, then it exceeds the original weight (4) and we may not get the correct weight.So, to avoid this we update the weight by a tiny amount.

10. ) Then the weight is updated by the dedw.

To understand it more, try yourselves running the code in your system and use print statement where you feel you are unable to understand.All researchers doing deep learning does the same.Then why not you?..

Happy Coding !!!!

Gradient Descent using Backpropagation

In my first post, we talked about neural networks and it’s uses.In this post, we are going to talk about how they work and what are it’s advantages.Before moving onto the discussion, we must need to understand several things.A neural network is nothing but a bunch of neurons interconnected together.

x -> Input neuron (Where it takes in an input)

w -> weight of the neuron (In the case of y= 2 * x the weight would be 2)

y -> Output neuron ( w * x)

Let’s consider a small problem and understand what gradient descent algorithm means.

We are given the x value as 1 and y value as 4.We need to find the relationship between them or find the weight which when multiplied by x gives y.The answer is 4(y = 1 * 4).It’s easy to find the weight because it was a simple function.So, if x were 2 and y was 7.9 we could not find the weight which maps x to y because it is not simple.So, we need to come up with a general algorithm to find the weights for any arbitrary value of x and y given.Thus , the algorithm for finding the optimal weights is known as gradient descent using backpropagation.

Let us consider x to be 1 and y to be 3. We will need to find the weight between x and y.At first we would guess some random weight and compute y’ .Then we compare with the original y and find the difference.If the difference was 0 then we found the correct weight, else we need to change the weights.The algorithm is quite easy,but we don’t know whether we need to increase or decrease the weights if the difference wasn’t 0 . .Let’s understand it in picture.

It is clear from the above picture that we used trial and error method.We guessed some random weight and computed the difference between the actual output and our output.If the error is 0 , then we found the correct weight else we readjust it.But, we need to decide whether we need to increase or decrease the weight if the error was not zero.How to decide?.To answer the above question, let us first plot the error and the weights used.We will call our friend Mr.Graph.

From the graph we can conclude that , if the weights are increased the error tends to decrease.So what does this signifies?.Yes, the derivative of a function.To restate again,derivative of a function is nothing but the rate of change or it is used to measure the change in output if the input changes.So to answer the question whether to increase or decrease the weight if the error wasn’t zero,we could easily find it using the derivative.(How?).Derivative tells us the whether the function is increasing or decreasing(rate of change).If the derivative comes out to be positive, then the function is increasing for every change in input and vice-versa for negative derivative.So, we need our error to be small or zero or we need the direction where the function is decreasing. In the above graph, if we compute the derivative it comes out to be -1.This says that if we increase the weight ,the error decreases.So this answers our question whether we need to increase or decrease our weights when error is not zero.What if the derivative was positive.It says that if i increase the weight then error also increases,so we need to decrease the weight in that situation to decrease the error.Please spend some time on this paragraph, as it is the main thing to understand any concepts in deep learning.Okay , we now found how to find the weights to map x to y.Let us formalize the steps in terms of algorithm(Gradient descent using backpropagation) so it will be easier.

Gradient Descent algorithm using backpropagation:

1.) Initialise with some random weight

2.)Compute y’ (Our network output)

3.) Find the Error between actual output and our output E = ( y – y’ )

4.) Find the derivative of Error with respect to weight

5.) Update weight as : w = w – ((dE/dw)/(dE/dw))

6.)Then go to step 2 until Error is zero.

Why step 5 is so weird?? .. Lets talk about it. From the above example if the weight were 1, then the error would be 3 .If we find the derivative of Error with respect to the weight ,it comes out to be -3 (Try yourselves). The negative sign of the derivative term tells us that the function will decrease if i increase the weight.So if the step 5 were w = w + ((dE/dw)/(dE/dw)) , then we would be decreasing the weight.We don’t want that to happen, so we flip the sign.Why we are dividing by (dE/dw). Let’s understand. If we didn’t divide by the term then we will increase w by 3 , then we would not arrive at a correct answer(Why?).If we were at w=3 and we compute the error term and the dE/dw , it comes out to be 1 and -3.So if we use step 5 to update the weight, then we would increase w from 3 to 6.But the true weight was 4.To avoid higher update of w,we would update it step by step.So to change w step by step , we divided by that term and it would become 1 so, weight gets incremented or decremented by 1 and we would arrive at the optimal weight.That’s all the algorithm is.Why the name Backpropagation?. Because in the algorithm we first computed the y’ using some random weight and compared with the actual output and we updated back the weight to minimise the error.We are moving backward to update the weight,hence it is called as backpropagation.In the next post, we will implement this algorithm from scratch using python and write our own neural network!!!

Continuation of Basics of Calculus

In my previous post , we were discussing about finding the derivative of a linear function. The derivative of the function (y = 2 * x) was 2 and it remained constant for what values of x and y you choose and substitute in the formula for finding dy/dx .But there are some functions whose derivative will not stay constant but depend on the point where you are evaluating.To understand the above statement , we will need to understand some stuffs from trigonometry. Let us understand these by pictures.

In the above picture , a right triangle whose hypotenuse of length “1” is considered. By definition , tan(theta) is defined as the ratio of the Opposite side to its Adjacent side(Opposite side / Adjacent side).When we compute tan(theta) it comes out to be ratio of y to the x or ( y2 – y1)/(x2 – x1 ). If we recall in our previous post, it’s nothing but the derivative of a function. Tan(theta) is also defined as the slope of a line or the angle which the line makes with the x-axis.If the theta value is increased then the line would move towards y-axis and hence the derivative of the function increases(Why?).There are lot things to digest,so pause yourselves a while and ponder over it.So ,till now we discussed about finding the derivative of a linear function.Lets take a step further and talk about non-linear function.Before moving on, we must first understand what is a non-linear function?.A non-linear function is a function whose points when plotted and joined, does not form a straight line(may be a curve..).Lets call our friend Mr.Graph.

The above function is not linear , because when the points are joined it forms a curve.Try finding the derivative of a function yourselves.If you try to find you would not get the correct answer.Why?.Lets understand it visually.

If you look into the above image , I used the same method for calculating the derivative of the function as I did in the previous example.If you watch closely , you would see the hypotenuse of the triangle is quite away from the curve.What does this tell??.If the function was linear then the hypotenuse of the triangle is nothing but the function’s line.To understand it simpler , we would consider a small example.

X 0 1 2 3

Y 0 1 4 9 ( Y = X * X -> Non-linear function )

X 0 1 2 3

Y 0 2 4 6 ( Y = 2 * X -> Linear Function )

In the linear function , If I change x from 1 to 2 then correspondingly the y value changes from 2 to 4 . So the ratio is (4 – 2)/(2 – 1) which is 2 .If you consider any other x and y, you would get the same ratio. But in the first function( Y = X * X ) , If I change x from 1 to 2 the y value changes from 1 to 4 and similarly if i change x from 2 to 3 the y value changes from 4 to 9.If you compute the above ratios , you would find that they are not the same. So ,the derivative of the function does not stay constant and we need to come up with a new strategy to find it.Before moving onto finding the derivative, we would summarize ourselves till now.Derivative of a function is nothing but , the ratio of change in output to the input or measuring how much the output changes if i change the input!.That’s all is about the derivative of a function.

To gain more intuition about derivative of a function ,we will now consider an real world problem where the derivative of a function is used.Let us plot the distance covered by a car in the X-axis and time it took in Y-axis.Let’s call our friend Mr.Graph

Average Speed is calculated by dividing the total distance travelled by the time taken.It’s comes out to be the derivative of the function.So derivative of the function tells us the rate of change of function or how much the output changes when the input changes.In the above example, the speed remains constant(2) and doesn’t change.But this will not be always the case.We can’t drive a car at constant speed all the time.We vary our speed at some point and sometimes we stay constant.In this case,we cannot calculate the average speed, because our speeds are different at different points.Hence,the derivative will not stay constant.We previously discussed in this post about non-linear function and its derivative.We noticed that the ratios were different in each case.So to overcome this , we are going to find the speed(rate of change) at a point and not between two points.Since speed is changing at each point , what if we found a formula for finding speed at each point rather than between any two points?Sounds good??..Let us look at the graph of y = x * x again.

In the above picture when we change our x from 1 to 2 the y value changes from 1 to 4.But we also observed that the hypotenuse didn’t fall on the curve, rather it is quite away from the curve.It’s because the speed is not constant and changing at each points and hence we cant draw a line and connect those points.We already know how to compute the derivative of a linear function.If we make this function to be linear, then our life would be easier.How to make?.Its easy if we closely watch the triangle.If we make the dx smaller then the line would move towards the curve and hence it would approximate the curve.Then if the line coincides with the curve, then it would form a straight line(linear function) and we can easily find its derivative.So how small the dx should be to approximate the curve?.What if the dx were 0 ? It can’t be.Why? Because rate of change is defined as dy/dx and hence if dx is 0 then dy/dx is undefined.So we need a number which is closer to 0 but not zero.How to express the former statement mathematically. Limitssssssssssssssssssss………………….

Does it make sense?. If we make dx as close to zero , then the hypotenuse would approximate the curve better and hence it is linear at that point.So, we know how to find the derivative for that case . f(x0 + del x) – f(x0) = y2 – y1 . Why ? Lets find it in the graph.

I think the image is clear enough.If we make dx closer to 0 then the tangent line would come closer to the curve and changes to linear at that point.So to calculate the derivative at that point we would substitute x0,x0+dx,f(x0),f(x0+dx) in the above limit equation.

Woww..we found the derivative of the function at a point x.When we compare this with that of the linear equation, we find that the derivative of this function(y = x^2) is dependent on x and it’s not constant like in y= m* x(where m may be 0,1,2…).So for a exercise if we find what is the rate of change at point 2 , it comes out to be 2 * 2 = 4(f ‘(x) = 2x) and at point 3 it is 6 (f ‘ (x) = 2x) .So the speed is changing from point 2 to point 3 and it is not constant.Thus, it is a non-linear function.With this enough details, we can talk about gradient descent algorithm using backpropagation in the next post.

Basics Of Calculus

Before getting started, we need to revise some of the basic statements regarding neural networks.Neural networks are nothing but, when presented a set of input and output pair of values , it learns the function which maps from input to output.The learned function may be the better approximation to the true underlying function or not.Sounds weird right??. Lets decompose it , so that we could gain intuition. In my previous blog post , we were discussing about a problem of finding the missing value. So , to solve the problem we found the pattern between the data points i.e.,( Y= 1 * X ). Using this pattern we predicted the output of Y to be 6 given X is 6. The task was easier because it is easier to spot out the relationship. But what if the relationship was complex?.To solve this , we use Neural Networks.So the idea behind solving the problem is that, finding the correct weight which maps from X->Y.In the above problem , the weight is 1. So,if there were some algorithm that could find out the weights so that the X is correctly mapped to Y,then we could solve the problem.The algorithm we are going to discuss here is Gradient Descent using Backpropagation. Before moving onto the discussion of the algorithm, we need to understand some of the topics in calculus. Don’t worry,I will present you with greater intuition.

Basics Of Calculus

In calculus , we are going to discuss the topic of differentiation.What does a derivative of a function tells us? How are we going to apply this technique in gradient descent using backpropagation? . To answer the above questions , we first need to understand what does the derivative of a function represents.

Let us consider a simple function y = 2 * x . For every value of x , y is defined as 2 times the x .For example, if x were 2 then y would be two times the x which is 4. Derivative of the function tells us how the output y will change if we were to increase x by a small amount.To understand the statement we could use our friend Mr.Graph.

Let us plot the above function in graph.

The above function is also linear since we got a straight line after connecting those points. So , if we differentiate the function y = 2 * x , we would get 2 as the output.Why? Let’s find it. In the above graph the value of y given x as 1 is 2 (y = 2 * x) and for y given x as 2 is 4 (y = 2 *x ).Derivative of a function represents how much the output is changed with respect to the input.Lets understand this visually.Consider the graph given below.

We could observe from the above graph that, for every unit(1) increase in x the value of y is increasing by 2.Also, the derivative of a function is nothing but the slope of a line(blue line). You could plug in different values of x and y and justify yourselves. For a practice try finding the derivative of the function y = 3 * x in the similar way.So,how this concept would help us in finding the optimal weights for our network.If you closely watch the value we get after calculating the derivative, is nothing but the true weight(2) which maps from x to y (y = 2 * x).With this detail ,we will drive towards gradient descent using backpropagation.