Use historical data to predict store sales π°π°
Every observation starts with a question ? π€π€
What is forecasting ? Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends. In this blog post , we are going to predict the future sales of Walmart stores using traditional machine learning algorithm(s) . ππ
Let’s look at the problem we are trying to solve ππ
You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store. In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.
What is the dataset schema ? ππ

Let’s discuss about the evaluation metric π¦π¦

Our predicted sales should be as close as possible to the actual sales in the holidays. This is because, the sales made during holidays is much higher than in other days.
What are the business objectives and constraints ? π²π²
1.) Stationary may not be well preserved.
2.) The evaluation metric used here is the weighted absolute mean deviation of the predicted value from the ground truth value.
3.) For holidays, the predictions are given more weightage. We need to focus relatively more on the observations which are recorded in the holidays.
4.) If we correctly model the observations, this could help Walmart in deciding upon their future budget, materials, etc. This could increase their revenue substantially.
Some useful links ππ
We need to install some libraries in Python to get started π§¨π§¨
# pip is one of the package manager for installing packages in python
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install sklearn
!pip install xgboost
!pip install seaborn
Let’s import the libraries we have downloaded π»π»
# Importing the necessary libraries..
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from collections import Counter
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns
Let’s download the dataset from Kaggle website. πΎπΎ
The dataset is provided by the Kaggle website. There are many ways to download the data from Kaggle. I will provide one such method which is easy and less messy.
Steps to download the data from Kaggle :
1.) Login to your Kaggle account
2.) Download CurlWget extension from Google Chrome Browser.
3.) Search for the data in the Kaggle search bar.
4.) Join the competition. Click the Download All button.
5.) Then cancel the download.
6.) Open your CurlWget extension which is present in the top right corner of your browser.
7.) Copy the link provided in it.
Disclaimer : The above method may violate the Kaggle regulations.
We will use the wget command to download the data using the link we copied earlier.
#wget command is used to download files
!wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.70" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/3816/32105/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602507515&Signature=J%2B9Esap36xIIDR0M0dqTLteFcX1zQf5IpmV0We%2Bk0ygL0aAYERuo8kcL%2F5ZSct%2BDNkXWi3M5B21qV2vv54Sjex1NXAH7vP4FAxdPazbR32A02Rez0fKwLcdvXkjY8TxBxCOfifHt6ZwbpkS0e8gszEYdA5HdJbSWHvpYM21xnicAqCoWxz3t%2Bj87ivr4CJF7gjmrBlf%2B67ciNFi0eEMx%2BQek5XOjXN1Hai5BrKeSIZSN83Pvl%2FBxwglarT0WyDIefBm6ekm9nDLH7KaZ5TjAqV%2BwQ0cEiI%2Bqz1Lv5AjAdJaS02dnu736L8ILgjZGtf7jllBjQ3iJKWM53X4C40R%2FlA%3D%3D&response-content-disposition=attachment%3B+filename%3Dwalmart-recruiting-store-sales-forecasting.zip" -c -O 'walmart-recruiting-store-sales-forecasting.zip'
The downloaded files are in compressed format. We need to extract them out. ππ
# Unzipping the files downloaded.
!unzip 'walmart-recruiting-store-sales-forecasting.zip'
After we execute the above command, the following files are shown ππ
1.) stores.csv
2.) test.csv.zip
3.) train.csv.zip
4.) features.csv.zip
We need to extract again the compressed file. ππ
# Unzipping the files downloaded.
!unzip 'test.csv.zip'
!unzip 'train.csv.zip'
!unzip 'features.csv.zip'
We are given some past observations about the Walmart stores. Our task here is to predict the weekly sales by using these observations. These observations are indexed by time at which it was recorded. So, this is a time-series problem. There are many statistical algorithms to model the time-series data. But, our goal here is to model the same observations using traditional machine learning algorithms.
Exploratory Data Analysis + Feature Engineering π³π³
First, we need to load the data from disk. Then , we will visualize the data in a tabular format.
# Reading the csv files from the disk using read_csv method from pandas library. # It converts the data into pandas dataframe object.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
features = pd.read_csv('features.csv')
stores = pd.read_csv('stores.csv')
We are basically reading the .csv files using read_csv function from pandas library.
# Using the head method, we display the dataframe
train.head(5)

The train dataframe contains [‘Store’ , ‘Dept’ , ‘Date’ , ‘Weekly_Sales’ , ‘IsHoliday’ ] as its attributes.
For each store and department, the Weekly_Sales attribute contains the total sales for a week.
Each successive date is approximately 7 days ahead of the previous date.
#Using the head method, we display the dataframe
features.head(5)

This dataframe contains some additional information for each store and department in the train dataframe.
The details about these attributes are given in the beginning of this notebook.
# Using the head method , we display the dataframe
stores.head(5)

For each store , we have a attribute named Type associated with it.
For each store , we have a attribute named Size associated with it.
Since the attributes contained in the features and stores dataframe are related to the attributes in the train dataframe, we will try to merge these dataframes together.
Whatever feature is added to our training data, the same should reflect on the test data. We concatenate both the train and test data to make our job easier.
# The merge operation is supported by the merge method.
# We are merging the train and the test dataframe. If we add any additional feature to the training data, then simultaneously
it will also get added to the test data.
train_df = pd.concat([train,test],axis=0,ignore_index=True)
#We are merging the train and features dataframe where they intersect each other.
data = pd.merge(train_df,features,how='inner',on=['Store','Date','IsHoliday'])
# We are merging the above resultant dataframe with the stores dataframe where they intersect each other.
walmart_df = pd.merge(data,stores,how='inner',on=['Store'])
#Printing the dataframe.
#Now every attributes about the store and department are in a single dataframe. #Now it is easier to analyze.
walmart_df.head(5)

#Printing the number of records. The len function returns the number of records.
print("Number of records ",len(walmart_df))
Number of records 536634
Here, the number of records includes both training and the test data.
# Printing the number of records.
print("Number of records present in the training data ",len(train))
print("Number of records present in the test data ",len(test))
Number of records present in the training data 421570
Number of records present in the test data 115064
Let’s see how many records we have for each store
# For each type of stores , we are counting the number of records
# Creating a counter object on store attribute.
features_store = Counter(walmart_df['Store'].values)
# Matplotlib style(comics)
plt.xkcd()
# Storing the values of the type of stores in x
x = list(features_store.keys())
# Storing the values of the counts of each type of stores in y
y = list(features_store.values())
# Plotting the barplot with x and y
plt.bar(x,y,label='#_records',color='r')
# For displaying purpose
plt.legend(prop={'size':15})
plt.xlabel('Store Number')
plt.ylabel('# Records')
plt.title('Store number vs Number of records')
# Displaying the plot
plt.tight_layout()
plt.show()

- We have at least 7500 records for each store.
- The observations were taken from 2010-02-05 to 2013-07-26.
Now , we will see some simple statistics about our data.
#The describe method returns a dataframe containing some common statistical properties of the data.
walmart_df.describe()

- For the Weekly_Sales attribute, we have a total of 421570 valid values.
- The values for the Weekly_Sales attribute is not present in the test data. We need to predict the values. Since we concatenated the train and test data , the values are marked as NaN (missing).
- We observe some missing values in the CPI and Unemployment attribute.
Machine learning algorithms can’t deal with missing values. Either we need to fill it with some statistic or remove it. We will try to fill the missing values for a particular attribute with the mean of that attribute.
# A boolean array denoting the presence of nan values
missing_cpi = walmart_df['CPI'].isna()
missing_unemp = walmart_df['Unemployment'].isna()
mean_c = walmart_df['CPI'].mean()
mean_u = walmart_df['Unemployment'].mean()
# Calculating the average CPI and Unemployment based on grouping the Store,Department.
mean_CPI = walmart_df.groupby(['Store','Dept'])['CPI'].mean()
mean_Unemployment = walmart_df.groupby(['Store','Dept'])['Unemployment'].mean()
# Iterating through the records
for i in range(len(missing_cpi)):
# If the value is missing in the CPI attribute
if missing_cpi[i]:
# Getting the index to retrieve the average value
index = (walmart_df['Store'][i],walmart_df['Dept'][i])
# Checking whether the index is present or not.
try:
mean_cpi = mean_CPI.get(index)
# If not,fill it with 0.
except KeyError:
mean_cpi = mean_c
walmart_df['CPI'][i] = mean_cpi
# If the value is missing in the Unemployment attribute
if missing_unemp[i]:
# Getting the index to retrieve the average value.
index = (walmart_df['Store'][i],walmart_df['Dept'][i])
# Checking whether the index is present or not.
try:
mean_unemp = mean_Unemployment.get(index)
# If the index is not present,fill it with 0.
except KeyError:
mean_unemp = mean_u
walmart_df['Unemployment'][i] = mean_unemp
# For some store and department,the CPI and Unemployment are not present. We can't calculate the mean. Therefore, we fill with 0
walmart_df['CPI'] = walmart_df['CPI'].fillna(mean_c)
walmart_df['Unemployment'] = walmart_df['Unemployment'].fillna(mean_u)
There are some missing values in the Markdown attributes. Our task is to fill those missing values. There are many ways to perform this task. Since this attribute is mentioned as anonymous in the competition, it is better to fill it with zero. The value zero indicates the absence of value in the particular record.
# Filling the missing values using fillna method.
for i in range(1,6):
walmart_df['MarkDown{0}'.format(i)]=walmart_df['MarkDown{0}'.format(i)].fillna(0)
We observe some negative values in the Weekly_Sales attribute. The values may be wrongly imputed. Since the sales can’t be negative, either we need to remove it or reverse the sign. We don’t want to loose the data. Hence , we reverse the sign of the values if it is negative.
# Calculating the number of records in the train dataframe
num_records = len(walmart_df)
# Calculating the number of records with negative values in the Weekly_Sales attribute.
negative_records = len (walmart_df[ walmart_df['Weekly_Sales']<0 ])
print("Percentage of records with negative values in the Weekly_Sales attribute ",(negative_records/num_records)*100)
#Filling the missing values. It is not necessary, since we are not going to use it.
walmart_df['Weekly_Sales'] = walmart_df['Weekly_Sales'].fillna(0)
#We are converting the values in the Weekly_Sales attribute to positive if it is negative.
walmart_df['Weekly_Sales'] = walmart_df['Weekly_Sales'].map(lambda x:x if x>=0 else -x)
Percentage of records with negative values in the Weekly_Sales attribute 0.2394555693452148
It’s enough with feature engineering. Let’s now focus on data visualization.
# We will sort our dataframe based on the date. This is achieved using the sort_values method.
walmart_df.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)
We sorted our records based on the date, store and the department.
We will now visualize the values of the Weekly_Sales attribute using a box plot.
# Plotting the boxplot for the Weekly_Sales attribute
f, ax = plt.subplots(figsize=(9, 6))
plt.title('Boxplot of Weekly_Sales')
fig = sns.boxplot(y='Weekly_Sales', data=walmart_df)

Let us also look at the histogram.
# Plotting the histogram of the Weekly_Sales attribute
plt.hist(walmart_df['Weekly_Sales'],bins=10)
plt.xlabel('Weekly_Sales')
plt.ylabel('Frequency')
plt.title('Histogram of Weekly_Sales')
plt.show()

- There are many values which are greater than Q3 + 1.5*(IQR) . These could be outliers.
- Since the notion of outliers is subjective, we need to further investigate on it.
Let’s look at the attributes which have contributed to more sales.
# Selecting those records which have higher sales.
slices = walmart_df[walmart_df['Weekly_Sales']>20206]
# Plotting the pair plot.
plt.plot(slices['Fuel_Price'],slices['Weekly_Sales'])
plt.title('Fuel_Price vs Weekly_Sales')
plt.xlabel('Fuel_Price')
plt.ylabel('Weekly_Sales')
plt.show()

- There is no clear trend in the plot.
- The sales are higher when the values of the fuel price are closer to their mean.
# Plotting the pair plot.
plt.xkcd()
plt.plot(slices['Temperature'],slices['Weekly_Sales'])
plt.title('Temperature vs Weekly_Sales')
plt.xlabel('Temperature')
plt.ylabel('Weekly_Sales')
plt.show()

- The sales are higher when the mean temperature is achieved.
- It is self-evident. People tend to go for shopping when the temperature is moderate.
# Plotting the pair plot.
plt.plot(slices['CPI'],slices['Weekly_Sales'])
plt.title('CPI vs Weekly_Sales')
plt.xlabel('CPI')
plt.ylabel('Weekly_Sales')
plt.show()


With the definition above, it is obvious that sales are higher when the CPI is low.
# Plotting the pair plot.
plt.plot(slices['Unemployment'],slices['Weekly_Sales'])
plt.title('Unemployment vs Weekly_Sales')
plt.xlabel('Unemployment')
plt.ylabel('Weekly_Sales')
plt.show()

- There are more sales when the unemployment rate is low.
- Still the unemployment rate doesn’t decide the sales much.
# Plotting the pair plot.
plt.plot(slices['Size'],slices['Weekly_Sales'])
plt.title('Size vs Weekly_Sales')
plt.xlabel('Size')
plt.ylabel('Weekly_Sales')
plt.show()

- There is no linear relationship between sales and size. It is because, the sales also depends on the date.
- Typically one would expect a linear relationship.
Now we will look at the dates where the sales are much higher
#Selecting those records where the sales are much higher
slices = walmart_df[walmart_df['Weekly_Sales']>250000]
print(np.unique(slices['Date'])) # It returns the unique values.
[ ‘2010-02-05’ ‘2010-11-26’ ‘2010-12-17’ ‘2010-12-24’ ‘2011-11-25’ ‘2011-12-23’ ]

The above information are provided in the Competition website.
These dates exactly matches with the dates given in the website where the sales are much higher. The dates mentioned above are very crucial for our modelling purpose.
These dates decides the range of values the Weekly_Sales attribute takes
To add these information, we need to define a flag variable to denote the presence of these dates.
# Converting the date passed as a string to pandas datetime object
def date(dates):
return pd.to_datetime(dates)
# Converting the Date attribute in the dataframe to pandas datetime object.
walmart_df['Date'] = pd.to_datetime(walmart_df['Date'])
# We are defining a variable with two values ('Y','N'). It is set to 'Y', if the date is present and vice-versa.
# np.where function sets the second argument as the value if the condition holds true and the third argument as the value if it doesn't.
Super_bowl = ['2010-02-05','2010-02-12','2011-02-11','2012-02-10','2013-02-08']
Labor_Day = ['2010-09-10','2011-09-09','2012-09-07','2013-09-06']
Thanksgiving = ['2010-11-26','2011-11-23','2011-11-25','2012-11-23','2013-11-29']
Christmas = ['2010-12-24','2010-12-31','2011-12-30','2012-12-28','2013-12-27']
walmart_df['Super_Bowl'] = np.where((walmart_df['Date']==date(Super_bowl[0])) | (walmart_df['Date']==date(Super_bowl[1])) | (walmart_df['Date']==date(Super_bowl[2])) | (walmart_df['Date']==date(Super_bowl[3])) | (walmart_df['Date']==date(Super_bowl[4])),1,0)
walmart_df['Labor_Day'] = np.where((walmart_df['Date']==date(Labor_Day[0])) | (walmart_df['Date']==date(Labor_Day[1])) | (walmart_df['Date']==date(Labor_Day[2])) | (walmart_df['Date']==date(Labor_Day[3])),1,0)
walmart_df['Thanksgiving'] = np.where((walmart_df['Date']==date(Thanksgiving[0])) | (walmart_df['Date']==date(Thanksgiving[1])) | (walmart_df['Date']==date(Thanksgiving[2])) | (walmart_df['Date']==date(Thanksgiving[3])) | (walmart_df['Date']==date(Thanksgiving[4])),1,0)
walmart_df['Christmas'] = np.where((walmart_df['Date']==date(Christmas[0])) | (walmart_df['Date']==date(Christmas[1])) | (walmart_df['Date']==date(Christmas[2])) | (walmart_df['Date']==date(Christmas[3])) | (walmart_df['Date']==date(Christmas[4])),1,0)
Using the np.where function, we are defining an indicator variable to denote the presence of any of those important days. This feature is very useful. If the feature is turned off, we know the sales won’t be higher.
The next variable which tells about the sales is temperature. People prefer a moderate temperature for shopping. Let’s observe the sales when the values of the temperature are increasing or decreasing.
# This routine calculates the length of the longest subarray with increasing values.
def maximum_length_increasing_subarray(array):
# Initializing the length variable with 1.(Assumption:Atleast an array contains a single element in it.)
length = 1
# Initializing the starting position(If the array contains a single element, then start=0)
start=0
# Initializing the m variable. m-> It keeps track of the length. The value 1 denotes, at least it has seen a single element.
m = 1
# For each element in the array
for i in range(len(array)-2,-1,-1):
# It denotes the condition for increasing.
if array[i]<array[i+1]:
m = m + 1
#Updating the maximum_length so far.
if length<m:
length=m
start=i
else:
m=1
# Returning the length and starting position.
return length, start
The above routine returns the length of the subarray which is strictly increasing. In our case the array holds the values of the temperature. We will look at the corresponding sales.
# Calling the method maximum_length_increasing_subarray
length, start=maximum_length_increasing_subarray(walmart_df['Temperature'])
# Printing the starting position and the length of the longest increasing subarray
print("Starting position of the maximum increasing subarray ",start)
print("Length of the maximum increasing subarray ",length)
Starting position of the maximum increasing subarray 536515
Length of the maximum increasing subarray 2
# 2 is the length of the longest increasing subarray.
x = np.arange(1,length+1,1)
#Getting the temperature values
temp = walmart_df['Temperature'][start:start+length]
#Getting the sales values
sales = walmart_df['Weekly_Sales'][start:start+length]
#Plotting the temperature and the weekly sales simultaneously.
plt.plot(x,temp,label='temperature')
plt.plot(x,sales,label='Weekly Sales')
plt.xlabel('Units')
plt.title('Increasing temperature vs Weekly Sales')
plt.legend()
plt.plot()
plt.show()

What happens when the temperature is decreasing ?
# This routine calculates the length of the longest subarray with decreasing values.
def maximum_length_decreasing_subarray(array):
# Initializing the length variable with 1.(Assumption:Atleast an array contains a single element in it.)
length = 1
#Initialzing the starting position(If the array contains a single element, then start=0)
start=0
# Initializing the m variable. m-> It keeps track of the length. The value 1 denotes, at least it has seen a single element.
m = 1
# For each element in the array
for i in range(len(array)-2,-1,-1):
# It denotes the condition for decreasing.
if array[i]>array[i+1]:
m = m + 1
#Updating the maximum_length so far.
if length<m:
length=m
start=i
else:
m=1
# Returning the length and starting position.
return length, start
# Calling the method maximum_length_decreasing_subarray
length, start = maximum_length_decreasing_subarray(walmart_df['Temperature'])
# Printing the starting position and the length of the longest decreasing subarray.
print("Starting position of the maximum decreasing subarray ",start)
print("Length of the maximum decreasing subarray ",length)
Starting position of the maximum decreasing subarray 536567
Length of the maximum decreasing subarray 2
# 2 is the length of the longest decreasing subarray.
x = np.arange(1,length+1,1)
# Getting the temperature values
temp = walmart_df['Temperature'][start:start+length]
# Getting the sales values
sales = walmart_df['Weekly_Sales'][start:start+length]
# Plotting the temperature and the weekly sales simultaneously.
plt.plot(x,temp,label='temperature')
plt.plot(x,sales,label='Weekly Sales')
plt.xlabel('Units')
plt.title('Decreasing temperature vs Weekly Sales')
plt.legend()
plt.plot()
plt.show()

We will render some more box plots to enhance our knowledge on the sales.

The sales are higher in holidays.
The two distributions looks similar other than the maximum value.
The median sale is slightly higher in the holidays.
# Plotting the boxplot for weekly_sales in Type attribute
walmart_sale = pd.concat([walmart_df['Type'], walmart_df['Weekly_Sales']], axis=1)
f, ax = plt.subplots(figsize=(9, 6))
fig = sns.boxplot(x='Type', y='Weekly_Sales', data=walmart_sale, showfliers=False)

The weekly_sales is very low for Type C relative to other type.
This feature quite discriminates among other features.
This feature denotes the range of values the weekly_sales takes.
The store of type A has more sales than any other type.
Even the store of type A has more size . Therefore , size affects the sales.
Since we noticed that the type attribute discriminates the range of values the weekly_sales take, we will plot a pie-chart describing the size of each type of stores.
# For each stores, we are counting the number of records present for it.
stores_count = stores.groupby('Type')['Store'].count()
# Using a pie chart, we are displaying the relative percentage of the observations for each type of stores.
plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Number of Stores of Type '*'")
plt.tight_layout()
plt.show()

The size of the type A store is much larger than other type of stores.
Therefore, the size attribute also discriminates the range of values the sales takes.
# For each type of stores , we are extracting the minimum size value.
stores_count = stores.groupby('Type')['Size'].max()
plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Maximum size of store type '*'")
plt.tight_layout()
plt.show()

The maximum size of the type C is very low compared to other types.
The sales in the type C store may be less compared to other types.
The sales in the type A store may be relatively high.
# For each type of stores , we are extracting the minimum size value.
stores_count = stores.groupby('Type')['Size'].min()
plt.style.use("fivethirtyeight")
slices=stores_count.values
colors=['orange','white','green']
labels=['A','B','C']
plt.pie(slices,labels=labels,wedgeprops={'edgecolor':'black'},shadow=True,startangle=90,autopct='%1.1f%%')
plt.title("Minimum size of store type '*'")
plt.tight_layout()
plt.show()

Here, the type A and C looks similar.
Since we are dealing with time-series data, the future observations depends on the past observations. To incorporate this, we will add the past sales as an attribute. We will add the previous 3 weeks sales as an additional feature.
# Adding an additional feature which holds the previous weeks sales
one_week_before = list(walmart_df['Weekly_Sales'])[:-1]
one_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_1'] = one_week_before
# Adding an additional feature which holds the previous 2 weeks sales
two_week_before = list(walmart_df['Weekly_Sales'])[:-2]
two_week_before.insert(0,walmart_df['Weekly_Sales'][1])
two_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_2'] = two_week_before
# Adding an additional feature which holds the previous 3 weeks sales
three_week_before = list(walmart_df['Weekly_Sales'])[:-3]
three_week_before.insert(0,walmart_df['Weekly_Sales'][2])
three_week_before.insert(0,walmart_df['Weekly_Sales'][1])
three_week_before.insert(0,walmart_df['Weekly_Sales'][0])
walmart_df['delta_3'] = three_week_before
The attribute Date is not atomic. It can be further subdivided. We will extract the month, year and day from the Date attribute.
# Using the map function we are transforming each date to year, month and day respectively.
walmart_df['Year'] = walmart_df['Date'].map(lambda x:x.year)
walmart_df['Month'] = walmart_df['Date'].map(lambda x:x.month)
walmart_df['Day'] = walmart_df['Date'].map(lambda x:x.day)
The values of the Date attribute are pandas datetime object. It contains several methods and properties. Using such properties, we extracted the year, month and day.
Now, we will add the median sales as an additional feature. This varies for each store and department. This feature discriminates the sales among the stores and department. We will find the median by grouping Store, Dept, Type, IsHoliday and Month attribute.
# Calculating the median by groupby operation.
median_df = pd.DataFrame({'Median_Sales':walmart_df.groupby(['Store','Dept','Type','IsHoliday','Month'])['Weekly_Sales'].median()})
# Merging the two dataframes.
walmart_df = pd.merge(walmart_df,median_df,on=['Store','Dept','Type','IsHoliday','Month'],how = 'inner')
We will add another attribute which denotes the difference in sales from the median sales. This attribute denotes the direction of the sales.
# We are taking the difference between the median sales and weekly sales
walmart_df['Difference_Median'] = walmart_df['Weekly_Sales'] - walmart_df['Median_Sales']
- In the competition website, it was given that MarkDown are present only after November 2011.
- It is also given that, MarkDown are available in special occasions.
- We will add an indicator variable to denote the presence of MarkDown attribute.
# Adding a indicator variable to denote the presence of MarkDown attribute
walmart_df['MarkDown1_indicator'] = np.where(walmart_df['MarkDown1']==0,1,0)
walmart_df['MarkDown2_indicator'] = np.where(walmart_df['MarkDown2']==0,1,0)
walmart_df['MarkDown3_indicator'] = np.where(walmart_df['MarkDown3']==0,1,0)
walmart_df['MarkDown4_indicator'] = np.where(walmart_df['MarkDown4']==0,1,0)
walmart_df['MarkDown5_indicator'] = np.where(walmart_df['MarkDown5']==0,1,0)
Now, we will try to visualize the median sales and other attribute related to the sales.
# Routine for plotting.
def plots(x,x_label,y_label,title):
plt.plot(x)
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.title(title)
plt.show()
# Plotting the weekly sales
plots(walmart_df['Weekly_Sales'],'Units','Weekly_Sales','Weekly_Sales')

# Plotting the median sales.
plots(walmart_df['Median_Sales'],'Units','Median_Sales','Median_Sales')

# Plotting the sales from the median.
plots(walmart_df['Difference_Median'],'Units','Difference_Median','Difference in the sales from the median')

# Plotting the difference of previous 1 week sale from the median sales.
plots(walmart_df['delta_1']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_1','Difference in the median sales from the previous weeks sales')

# Plotting the difference of previous 2 week sales from the median sales.
plots(walmart_df['delta_2']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_2','Difference in the Median_Sales from past 2 weeks sales')

# Plotting the difference of previous 3 week sales from the median sales.
plots(walmart_df['delta_3']-walmart_df['Median_Sales'],'Units','Median_Sales - delta_3','Difference in the median sales from the past 3 weeks sales')

Modelling the weekly sales directly is quite hard.
We will rather model the difference in the sales from the median.
The differences has a nice distribution. Hence, it is easier to model.
While we are modelling the differences , we need to mostly focus on the direction rather than the magnitude
Since the Type attribute contains string values, we need to encode it.
# We are encoding the Type attribute using LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
walmart_df['Type'] = le.fit_transform(walmart_df['Type'])
Splitting the dataset into train and test.
# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]
# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]
# Printing the length of the dataset (train/test)
print("Length of the training dataset ",len(train))
print("Length of the test dataset ",len(test))
Length of the training dataset 421570
Length of the test dataset 115064
# Making the Difference_Median attribute as the dependent variable
y_train = train['Difference_Median']
# Making the Difference_Median attribute as the dependent variable
y_test = test['Difference_Median']
# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)
# Printing the dataframe.
train.head()

Modelling
Hyperparameter tuning
The algorithms in machine learning are governed by many parameters. There is no full-fledged theoretical argument for choosing the right set of parameters. It is mostly an educated guess and sometimes theory works too. There are many methods proposed for hyperparameter tuning. Not all are useful. The one which is intuitive and actually correct is grid search. We need to define some values for each parameter that need to be tuned.
Suppose, p and q are the hyperparameters that needs to be tuned.
p = [ 1, 2, 3 ] and q = [ 4, 3 ]
If we want to fit a model with these parameters, we need to try all possible combinations and fit our model. This is called as a brute-force search. At each stage, we also evaluate our model and choose the best one out of it. This process is costlier. If the range of these parameters is large, then the time taken for this process increases substantially. One such method which reduces the time complexity and space complexity too is Randomized Search algorithm.
The randomized search method drives the search process probabilistically. At each stage, a value for each parameter is chosen randomly with equal probability. After that, the model is fitted. We also evaluate our model and choose the best one out of it. This process takes less time and space too. There are many theoretical arguments behind the convergence of this process. The details of it are beyond the scope of this post.
Linear Regression
In statistics, linear regression is a approach to model the scalar dependent variable and one or more independent exploratory variable(s) when the relationship between them is linear or when we want to approximate it with a linear function. In a nutshell, fitting a straight line to the data. Linear Regression works very well in high-dimension. It is due to curse of dimensionality.
To use Linear Regression, we need to do some transformation to the existing data. We need to encode the categorical variables to one-hot vectors. We also need to normalize the numerical variables for the purpose of optimization.
Encoding the categorical variables
# Storing the categories
categories = ['Store','Dept','IsHoliday','Type','Super_Bowl','Labor_Day','Thanksgiving','Christmas','MarkDown1_indicator','MarkDown2_indicator','MarkDown3_indicator','MarkDown4_indicator','MarkDown5_indicator']
columns = list(walmart_df.columns)
# Storing the names of the numerical variables
numeric = []
for i in columns:
if i not in categories and i!='Date' and i!='Difference_Median' and i!='Median_Sales':
numeric.append(i)
# Creating an dictionary to store the one-hot encoding of the categories
temp_dict={}
for i in categories:
temp_dict[i] = pd.get_dummies(walmart_df[i])
# Storing it for future purpose.
temp_dict['numeric'] = walmart_df[numeric]
temp_dict['Date'] = walmart_df['Date']
temp_dict['IsHoliday'] = walmart_df['IsHoliday']
temp_dict['Difference_Median'] = walmart_df['Difference_Median']
temp_dict['Median_Sales'] = walmart_df['Median_Sales']
temp_dict['Store'] = walmart_df['Store']
temp_dict['Dept'] = walmart_df['Dept']
temp_dict['Date'] = walmart_df['Date']
temp_dict['IsHoliday'] = walmart_df['IsHoliday']
# Sorting and concatenating the dataframes.
lin_reg_walmart_df = pd.concat([temp_dict[i] for i in temp_dict.keys()],axis=1)
lin_reg_walmart_df.sort_values('Date',inplace=True,ignore_index=True)
The pd.get_dummies function is used to transform the categorical variable to a one-hot vector. Once we have transformed it, we normally concatenate it with the existing dataframe and remove the categorical variable from it.
Normalizing the numerical variables
# Normalizing the numerical variables.
for i in numeric:
mean = lin_reg_walmart_df[i].mean()
std = lin_reg_walmart_df[i].std()
lin_reg_walmart_df[i]-=mean
lin_reg_walmart_df[i]/=std
There are many techniques to normalize the numerical variables. We used a technique which displaces the mean to zero and scales the standard deviation to one. It is done purely for better convergence in the optimization.
Splitting the dataset into train and test
# Selecting the records for test data
test = lin_reg_walmart_df[lin_reg_walmart_df['Date']>=pd.to_datetime('2012-11-02')]
# Selecting the records for train data
train = lin_reg_walmart_df[lin_reg_walmart_df['Date']<pd.to_datetime('2012-11-02')]
For the training data, we select all those records which are recorded before 2012-11-02.
For the test data, we select all those records which are recorded after 2012-11-02.
# Dependent variable
y_train = train['Difference_Median']
y_test = test['Difference_Median']
train_isholiday = train['IsHoliday']
# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date','IsHoliday','Store','Dept'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)
Rather than treating sales as the dependent variable, we treat the difference in sales from the median sales. We already discussed the reason behind it.
Randomized Search
# Routine to visualize the hyperparameter tuning results.
def visualize(random):
dictionary = random.cv_results_
dictionary.pop('params')
return pd.DataFrame(dictionary)
The above function basically returns a dataframe containing the results from the hyper parameter tuning. The results are stored in the cv_results_ attribute.
# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# alpha : Regularization parameter
alpha = [x for x in np.linspace(start = 0, stop = 1, num = 30)]
random_grid = {'alpha': alpha}
# Printing the random_grid
print(random_grid)

The parameter alpha is need to be tuned. It is a constant to regulate the learning process. When the magnitude of alpha is large, it produces a simple solution.
# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.linear_model import Ridge
ridge_random = Ridge()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
ridge_random = RandomizedSearchCV(estimator = ridge_random, param_distributions = random_grid, n_iter = 75 , cv = 2, verbose= 2, random_state= 42, n_jobs = 3)
# Fit the random search model
ridge_random.fit(train,y_train)

Once we defined the hyper parameters, we need to create a object of type RandomizedSearchCV and start the process using the fit method. We also used cross-validation. It derives an additional unseen dataset from the train dataset and use it for evaluation purpose.
# Calling the visualize routine
visualize(ridge_random)

It is a snapshot of our results. We observe a poor score. This indicates that, relationship is not linear.
# Printing the best parameters
print("The best parameters are ", ridge_random.best_params_)

We printed the best value of the alpha that resulted in the best reduction of the mean squared loss.
# Printing the best estimator and fitting on the train data
best_estimator = ridge_random.best_estimator_
best_estimator.fit(train,y_train)
The best_estimator_ attribute holds the object of the Ridge class with the best alpha. Then, we start the learning process by invoking the fit method.
# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_linreg.joblib")
We saved the model using joblib library.
# Loading the saved estimator from the drive.
best_estimator = joblib.load("./models/best_linreg.joblib")
We loaded the model from disk.
# This routine calculates the custom metric given in the Kaggle website
def custom_metric(train_isholiday,x,y,estimator):
a=np.ones(shape=(x.shape[0]))
a=a*(train_isholiday==True)*4+1 #Gives a weight of 5 if holiday and 1 if not
y_hat = estimator.predict(x)+x['Median_Sales']
# Plotting our predicted weekly sales.
plt.plot(y_hat+x['Median_Sales'],color='r',label='Predicted')
plt.plot(y,color='g',label='Truth')
plt.xlabel('Weeks')
plt.ylabel('Predicted weekly sales')
plt.title('Estimated Weekly Sales')
plt.legend()
plt.show()
# Calculating the weighted mean absolute deviation error.
diff = np.abs(y-y_hat)
diff = diff*a
diff = np.sum(diff)
normalize = 1/(np.sum(a))
diff = diff*normalize
return diff
The above routine predicts the sales using the test data and plots it. We also calculate the metric which is given in the competition.
# Printing the score on the train set
print("The score on the train set ",custom_metric(train_isholiday,train,y_train+train['Median_Sales'],best_estimator))

It’s a good score to get started.
We have almost finished 99% of our task. Next, we need to predict the sales using our best model on the private test data given in the competition website.
Inference
def inference(estimator,test,filename):
# Reading the test.csv from the disk.
t = pd.read_csv('test.csv')
# We are reordering the test data with the submission file.
test.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)
t['Date'] = pd.to_datetime(t['Date'])
v = pd.merge(t,test,on=['Store','Dept','Date','IsHoliday'],how='inner')
v.drop(['Date','Store','Dept','IsHoliday'],axis=1,inplace=True)
# Predicting the weekly sales for the test data.
y_test = best_estimator.predict(v)
# Adding the Median_Sales to the dependent variable.
Y = np.array(y_test)
Y += np.array(v['Median_Sales'])
# Concatenating the columns. This is the expected format for this competition.
t['Id'] = t['Store'].astype(str)+'_'+t['Dept'].astype(str)+'_'+t['Date'].astype(str)
# Adding the predicted weekly sales.
t['Weekly_Sales'] = Y
# Dropping the unnecessary columns.
t.drop(['Store','Dept','Date','IsHoliday'],axis=1,inplace=True)
# Writing it to a csv file.
t.to_csv('submission/'+filename,columns=['Id','Weekly_Sales'],index=False)
# Plotting our predicted weekly sales.
plt.plot(Y)
plt.xlabel('Weeks')
plt.ylabel('Predicted weekly sales')
plt.title('Estimated Weekly Sales')
plt.show()
Most of the things mentioned in the routine has already been discussed earlier.
to_csv function is used to convert the dataframe to a csv file. Once the csv file is created, we upload it. After some processing in the web server, the final score is displayed.
inference(best_estimator,test,'walmart_df_lin.csv')

Score

We are under the top 350 !!!.
Random Forest Regression
Here, we are going to train a random forest model on our dataset and make predictions.
WHY RANDOM FOREST ? Random Forest algorithm works well if the dependent variable we are trying to model is an outcome of several decision(s). This algorithm can’t extrapolate. The sales is actually determined by several variables. Also, we don’t actually want to extrapolate. Because, each year data changes and we need to model our observations iteratively. With all these in mind, we are going to model the sales using Random Forest algorithm.
Encoding the categorical variables
# We are encoding the Type attribute using LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
walmart_df['Type'] = le.fit_transform(walmart_df['Type'])
Previously we used one-hot encoding for transforming the categorical variables. Here, we are using the label encoder. It basically gives a natural number to each unique value. Since, decision trees are constructed by making a decision at each stage, there is no need for one-hot encoding.
Splitting the dataset into train and test
# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]
# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]
# Making the Difference_Median attribute as the dependent variable
y_train = train['Difference_Median']
# Making the Difference_Median attribute as the dependent variable
y_test = test['Difference_Median']
# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)
Hyper parameter tuning
# This routine creates a dataframe from the results of the hyperparameter tuning
def visualize(random_object):
dictionary = random_object.cv_results_
dictionary.pop('params')
return pd.DataFrame(dictionary)
# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 100, num = 4)]
n_estimators.insert(0,200)
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 55, num = 4)]
# Minimum number of samples required to split a node
min_samples_split = [2,5]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_depth': max_depth,
'min_samples_split':[2]}
# Printing the random_grid
print(random_grid)
n_estimators : It is the number of decision trees to construct a forest.
max_depth : It decides the maximum number of decisions needed to construct a decision tree.
min_samples_split : The minimum number of samples needed to be in the data space after making an decision.

# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 6, cv = 2, verbose= 2, random_state= 42, n_jobs = 3)
# Fit the random search model
rf_random.fit(train,y_train)

# Calling the visualize routine
visualize(rf_random)

# Printing the best parameters
print("The best parameters are ",rf_random.best_params_)
The best parameters are { ‘n_estimators’ : 200 , ‘min_samples_split’ : 2 , ‘max_depth’: 55 }
# Printing the best estimator.
print('The best estimator ',rf_random.best_estimator_)
The best estimator RandomForestRegressor(max_depth=55 , n_estimator=200)
# Printing the best estimator and fitting on the train data
best_estimator = rf_random.best_estimator_
best_estimator.fit(train,y_train)
# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_rf.joblib")
# Loading the saved estimator from the drive.
best_estimator = joblib.load("./best_rf.joblib")
# This routine calculates the custom metric given in the Kaggle website
def custom_metric(x,y,estimator):
a=np.ones(shape=(x.shape[0]))
a=a*(x['IsHoliday']==True)*4+1 #Gives a weight of 5 if holiday and 1 if not
y_hat = estimator.predict(x)+x['Median_Sales']
#Plotting our predicted weekly sales.
plt.plot(y_hat+x['Median_Sales'],color='r',label='Predicted')
plt.plot(y,color='g',label='Truth')
plt.xlabel('Weeks')
plt.ylabel('Predicted weekly sales')
plt.title('Estimated Weekly Sales')
plt.legend()
plt.show()
# Calculating the weighted mean absolute deviation
diff = np.abs(y-y_hat)
diff = diff*a
diff = np.sum(diff)
normalize = 1/(np.sum(a))
diff = diff*normalize
return diff
# Printing the score on the train set
print("The score on the train set ",custom_metric(train,y_train+train['Median_Sales'],best_estimator))

Inference
def inference(estimator,test,filename):
# Reading the test.csv from the disk.
t = pd.read_csv('test.csv')
# We are reordering the test data with the submission file.
test.sort_values(by=['Date','Store','Dept'],inplace=True,ignore_index=True)
t['Date'] = pd.to_datetime(t['Date'])
v = pd.merge(t,test,on=['Store','Dept','Date','IsHoliday'],how='inner')
v.drop('Date',inplace=True,axis=1)
v['IsHoliday'] = v['IsHoliday'].astype(bool)
# Predicting the weekly sales for the test data.
y_test = best_estimator.predict(v)
# Adding the Median_Sales to the dependent variable.
Y = np.array(y_test)
Y += np.array(v['Median_Sales'])
# Concatenating the columns. This is the expected format for this competition.
t['Id'] = t['Store'].astype(str)+'_'+t['Dept'].astype(str)+'_'+t['Date'].astype(str)
# Adding the predicted weekly sales.
t['Weekly_Sales'] = Y
# Dropping the unnecessary columns.
t.drop(['Store','Dept','Date','IsHoliday'],axis=1,inplace=True)
# Writing it to a csv file.
t.to_csv('submission/'+filename,columns=['Id','Weekly_Sales'],index=False)
# Plotting our predicted weekly sales.
plt.plot(Y)
plt.xlabel('Weeks')
plt.ylabel('Predicted weekly sales')
plt.title('Estimated Weekly Sales')
plt.show()
inference(best_estimator,test,'walmart_df_rf.csv')

Score

We are under the top 375 !!!!
Xgboost Regression
Xgboost works by using the gradient descent in function space. The idea is very simple and intuitive. First order optimization techniques works by displacing the parameters in the direction of opposite of its gradient. The gradient is calculated with respect to the objective function. Since decision trees can’t be expressed as a algebraic function, it is quite unintuitive to think about calculating the gradients. Xgboost works by seeing the decision trees has a normal function. All the techniques proposed for optimization works here too.
Splitting the dataset into train and test
# Selecting the records for test data
test = walmart_df[walmart_df['Date']>=pd.to_datetime('2012-11-02')]
# Selecting the records for train data
train = walmart_df[walmart_df['Date']<pd.to_datetime('2012-11-02')]
# Xgboost expects it to be of boolean.
train['IsHoliday'] = train['IsHoliday'].astype(bool)
# Dropping the irrelevant attributes.
train.drop(['Weekly_Sales','Difference_Median','Date'],axis=1,inplace=True)
test.drop(['Weekly_Sales','Difference_Median'],axis=1,inplace=True)
Hyper parameter tuning
# Importing the RandomizedSearchCV from the sklearn module
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [40,100,140,180,200]
# Maximum number of levels in tree
max_depth = [None,10,25,55,70]
# Minimum number of samples required to split a node
min_samples_split = [2, 5]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_depth': max_depth,
'learning_rate':[0.1,0.01,0.001]}
#Printing the random_grid
print(random_grid)

learning_rate : It is a constant to regulate the displacement of the parameter(s).
# This routine creates a dataframe from the results of the hyperparameter tuning
def visualize_xgb(random_object):
dictionary = random_object.cv_results_ #Using the cv_results_ attribute to retrieve the results in a dictionary.
columns = ['mean_fit_time','std_fit_time','mean_score_time','std_score_time','n_estimators','max_depth','learning_rate','split0_test_score','split1_test_score','mean_test_score','std_test_score','rank_test_score']
d={}
negate = [4,5,6] #The values in the keys are not needed.
for i in range(0,12):
if i not in negate:
d[columns[i]] = dictionary[columns[i]] #We are extracting the results from the cv_results_ dictionary.
d[columns[4]]=np.array([i['n_estimators'] for i in dictionary['params']])
d[columns[5]]=np.array([i['max_depth'] for i in dictionary['params']])
d[columns[6]]=np.array([i['learning_rate'] for i in dictionary['params']])
return pd.DataFrame(d) # Converting it into a dataframe for nicer visualization.
# Use the random grid to search for best hyperparameters
# First create the base model to tune
from xgboost import XGBRegressor
xgb = XGBRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = random_grid, n_iter = 8, cv = 2, verbose= 2, n_jobs = 3)
# Fit the random search model
xgb_random.fit(train,y_train)

# Calling the visualize routine
visualize_xgb(xgb_random)

# Printing the best parameters
print("The best parameters are ",xgb_random.best_params_)
The best parameters are {‘n_estimators’: 100, ‘max_depth’: None , ‘learning_rate’: 0.1 }
# Printing the best estimator.
print('The best estimator ',xgb_random.best_estimator_)

# Printing the best estimator and fitting on the train data
best_estimator = xgb_random.best_estimator_
best_estimator.fit(train,y_train)

# Printing the score on the train set
print("The score on the train set ",custom_metric(train,y_train+train['Median_Sales'],best_estimator))

# Saving the estimator using the dump method from joblib
joblib.dump(best_estimator, "./models/best_xgb.joblib")
# Loading the saved estimator from the drive.
best_estimator = joblib.load("./models/best_xgb.joblib")
Inference
inference(best_estimator,test,'walmart_df_xgb.csv')

Score

We are under the top 280 !!!!!
Future Work
Since we don’t have fancy hardware, it is hard to perform the hyper parameter tuning. Hence, we didn’t get better results. Since it is a time-series problem, there are many SOTA models in deep learning for this task. Also, we could test it on some statistical algorithms designed especially for time-series. But our goal here is to model it using traditional machine learning algorithms, we have limited ourselves.















