Question - What types of customers who purchase Chips and their purchasing behaviour within the region? To answer this question, we need to complete the following tasks:
# data analysis and wrangling
import pandas as pd
import numpy as np
import re
import xlrd
from scipy import stats
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
%matplotlib inline
# Filter all warnings.
import warnings
warnings.filterwarnings('ignore')
We start by acquiring the transaction and customer datasets into Pandas dataframes.
# Open the local files
from google.colab import files
uploaded = files.upload()
# Load local files as pandas dataframe
transaction = pd.read_csv(r'QVI_transaction_data.csv')
customer = pd.read_csv(r'QVI_purchase_behaviour.csv')
Pandas also helps describe the datasets answering following questions early in the project.
Which features are avaiable in the dataset?
Noting the feature names for directly manipulating or analyzing these. These feature names are followed:
print('Feature names in transaction dataset are',transaction.columns.values)
print('Feature names in customer dataset are',customer.columns.values)
Let's preview the data.
transaction.head()
customer.head()
What are the data types for various features?
Helping us during converting goal.
transaction.info()
print('-'*40) # Draw the seperate line
customer.info()
Which features are categorical?
This helps us select the appropriate plot for visualization.
Which features are numerical?
This helps us select the appropraite plots for visualization.
Which features are mixed data types?
These are candidates for correcting goal.
What are the distributions of numerical features?
This helps us understand the important statistics for each numerical feature.
transaction.describe()
transaction['PROD_QTY'].describe(percentiles=[.8,.99])
transaction['TOT_SALES'].describe(percentiles=[.8,.99])
What are the distribution of categorical features?
transaction.describe(include=['O'])
customer.describe(include=['O'])
plt.figure(figsize=(10,3))
plt.subplot(1,2,1)
sns.countplot(x='LIFESTAGE',data=customer,palette='Blues_d')
plt.xticks(rotation=90)
plt.subplot(1,2,2)
sns.countplot(x='PREMIUM_CUSTOMER',data=customer,palette='Blues_d')
plt.xticks(rotation=0)
STORE_NBR is also a categorical feature, but we will convert it into discrete numbers in the DATA WRANGLE section and then check its distribution.
Which features may contain errors or typos?
PROD_NAME feature may contain errors or typos as there are several ways used to describe a name.
transaction['PROD_NAME'].head(20)
Which features contain blank, null or empty values?
There're no missing values in both transaction and customer datasets.
# Check for nulls for the transaction dataset
null_check_transaction = transaction.isnull().sum()
null_check_transaction
# Check for nulls for the customer dataset
null_check_customer = customer.isnull().sum()
null_check_customer
However, there's only 364 unique dates which indicates a missing date.
# Check the number of transaction by data
transaction['DATE'].nunique()
We arrive at following decisions based on data analysis so far.
Correcting
We want to remain only products in chips in the transaction dataset.
Creating
Converting
We want to convert the format of DATE into a human readable format.
Correlating
We want to know how well each feature correlates with TOT_SALES, especially LIFESTAGE and PREMIUM_CUSTOMER features.
So far we did not change a single feature or value to arrive at these. Let's now execute our decisions
We are only interested in the chips category, so let's remove salsa products in the transaction dataset.
transaction.shape
transaction = transaction.drop(transaction[transaction['PROD_NAME'].str.contains('Salsa')].index)
transaction[transaction['PROD_NAME'].str.contains('Salsa')]
transaction.shape
In the following code, we extract PackSize feature using regular expressions. The RegEx pattern \d+
matches the first digit which ends with digits within PROD_NAME feature.
productname = transaction['PROD_NAME'].tolist()
packsize=[]
for i in range(0,len(transaction)):
packsize.append(re.findall(r'\d+',productname[i]))
result = []
for i in packsize:
result.append(int(i[0]))
transaction['PackSize'] = result
sns.catplot(x='PackSize',kind='count', data=transaction,palette="Blues_d").set(title="The Number of Each Pack Size of Chips")
plt.xticks(rotation=45)
plt.show()
In the following code, we extract Brand feature using split() function.
brand=[]
for i in range(0,len(transaction)):
brand.append(productname[i].split()[0])
transaction['Brand'] = brand
sns.catplot(x='Brand',kind='count', data=transaction,palette="Blues_d").set(title="The Number of Different Brands of Chips")
plt.xticks(rotation=90)
plt.show()
Some of the brand names look like they are of the same brands - such as RED and RRD, which are both Red Rock Deli chips. Let's combine them together.
transaction['Brand']=transaction['Brand'].replace('RED','RRD')
transaction['Brand']=transaction['Brand'].replace('Red','RRD')
transaction['Brand']=transaction['Brand'].replace('Natural','NNC')
transaction['Brand']=transaction['Brand'].replace('Dorito','Doritos')
transaction['Brand']=transaction['Brand'].replace('Infuzions','Infzns')
transaction['Brand']=transaction['Brand'].replace('Smith','Smiths')
transaction['Brand']=transaction['Brand'].replace('Sunbites','Snbts')
transaction['Brand']=transaction['Brand'].replace('Woolworths','WW')
transaction['Brand']=transaction['Brand'].replace('Grain','GrnWves')
sns.catplot(x='Brand',kind='count', data=transaction,palette="Blues_d").set(title="The Number of Different Brands of Chips after Combining Names")
plt.xticks(rotation=90)
plt.show()
Let's convert the Excel date feature into human readable format and then find the missing date in the transaction year.
# Convert the DATE feature from Excel date format to human readable format
dates = []
for date in transaction['DATE']:
dates.append(xlrd.xldate_as_datetime(date,0).date().isoformat())
transaction['DATE'] = dates
We can see that the increase in sales occurs in the lead-up to Christmas and that there were zero sales on Christmas day itself. This is due to shops being closed on Christmas day.
# Check the missing date
df = transaction[['DATE','TOT_SALES']]
df = df.set_index('DATE')
df.index = pd.to_datetime(df.index)
pd.date_range(start='2018-07-01',end='2019-06-30').difference(df.index)
# Create a new data frame which contains the counts for each date
df_date1 = pd.DataFrame(pd.date_range(start='2018-07-01',end='2019-06-30'),columns=['Date1'])
df_date2 = transaction['DATE'].groupby(transaction['DATE']).count().to_frame()
df_date2['Date2'] = pd.to_datetime(df_date2.index)
df_date2 = df_date2.rename(columns={'DATE':'Count'})
df_date = df_date1.merge(df_date2,how='left',left_on='Date1',right_on='Date2')
plt.figure(figsize=(10,3))
plt.plot(df_date['Date1'],df_date['Count'])
plt.title('Number of Transactions over Time')
plt.xlabel('Date')
plt.ylabel('Number of Transactions')
plt.show()
Now we can continue confirming some of our assumptions using visualizatoin for analyzing the data.
# Create a new dataframe, combine, to contain all information in both datasets
combine = transaction.merge(customer,how='left',left_on='LYLTY_CARD_NBR',right_on='LYLTY_CARD_NBR')
combine.info()
Correlating categorical features
Let's start by understanding correlations between categorical features and our goal (TOT_SALES).
Sales are coming mainly from budget older family, mainstream young singles and couples, and mainsteam retirees.
df_pls = combine[['TOT_SALES','PREMIUM_CUSTOMER','LIFESTAGE']]
df_pls_grouped = df_pls.groupby(['PREMIUM_CUSTOMER','LIFESTAGE']).agg({'TOT_SALES':'sum'}).sort_values('TOT_SALES',ascending=False)
df_pls_grouped.plot(kind='bar')
There are more mainstream young singles and couples and mainstream retirees who buy chips. Higher sales may also be driven by more units of chips being bought per customer. Let's have a look at this next.
df_plcs = combine[['LYLTY_CARD_NBR','PREMIUM_CUSTOMER','LIFESTAGE']]
df_plcs_grouped = df_plcs.groupby(['PREMIUM_CUSTOMER','LIFESTAGE']).agg({'LYLTY_CARD_NBR':'nunique'}).sort_values('LYLTY_CARD_NBR',ascending=False)
df_plcs_grouped.plot(kind='bar')
Older families and young families in general buy more chips per customer.
df_plc_grouped = df_pls_grouped.merge(df_plcs_grouped,how='left',left_index=True,right_index=True)
df_plc_grouped['AVG_NUM_UNIT_PER_CUSTOMER']=df_plc_grouped['TOT_SALES']/df_plc_grouped['LYLTY_CARD_NBR']
df_plc_grouped = df_plc_grouped.drop(['TOT_SALES','LYLTY_CARD_NBR'],axis=1).sort_values('AVG_NUM_UNIT_PER_CUSTOMER',ascending=False)
df_plc_grouped.plot(kind='bar')
Let's also investigate the average price per unit chips bought for each customer segment as this is also a driver of total sales.
Mainstream young and midage singles and couples are more willing to pay more per packet of chips compared to their budget and premium counterparts. This may be due to premium shoppers being more likely to buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own consumption.
This is also supported by there being fewer premium young and midage singles and couples buying chips compared to their mainstream counterparts.
df_avg_price_per_unit = combine[['PREMIUM_CUSTOMER','LIFESTAGE','TOT_SALES','PROD_QTY']]
df_avg_price_per_unit['AVG_PRICE_PER_UNIT']=df_avg_price_per_unit['TOT_SALES']/df_avg_price_per_unit['PROD_QTY']
df_avg_price_per_unit_grouped=df_avg_price_per_unit.groupby(['LIFESTAGE','PREMIUM_CUSTOMER']).agg({'AVG_PRICE_PER_UNIT':'mean'}).sort_values('AVG_PRICE_PER_UNIT',ascending=False)
df_avg_price_per_unit_grouped.plot(kind='bar')
As the difference in average price per unit isn't large, we can check if this difference is statistically different.
# Perform an independent t-test between mainstream vs premium and budget young and midage singles and couples
test1 = df_avg_price_per_unit[(df_avg_price_per_unit['LIFESTAGE'].isin(['YOUNG SINGLES/COUPLES','MIDAGE SINGLES/COUPLES'])) & (df_avg_price_per_unit['PREMIUM_CUSTOMER']=='Mainstream')]
test2 = df_avg_price_per_unit[(df_avg_price_per_unit['LIFESTAGE'].isin(['YOUNG SINGLES/COUPLES','MIDAGE SINGLES/COUPLES'])) & (df_avg_price_per_unit['PREMIUM_CUSTOMER']!='Mainstream')]
stats.ttest_ind(test1['AVG_PRICE_PER_UNIT'],test2['AVG_PRICE_PER_UNIT'])
The t-test results in a p-value closed to 0, i.e. the unit price for mainstream, young and midage singles and couples ARE significantly higher than that of budget or premium, young and midage singles and couples.
More Findings in Target Customer
We have found quite a few interesting insights that we can dive deep into. We might want to target customer segment that contribute the most to sales to retain the more further increase sales. Let's look at Mainstream, young singles and couples and find out if they tend to buy a particular brand of chips.
segment1 = combine[(combine['LIFESTAGE'] =='YOUNG SINGLES/COUPLES') & (combine['PREMIUM_CUSTOMER']=='Mainstream')]
other = combine[(combine['LIFESTAGE'] !='YOUNG SINGLES/COUPLES') & (combine['PREMIUM_CUSTOMER']!='Mainstream')]
# Brand affinity compared to the rest of the population
quantity_segment1=segment1.agg({'PROD_QTY':'sum'})
quantity_other=other.agg({'PROD_QTY':'sum'})
quantity_segment1_by_brand = segment1.groupby(['Brand']).agg({'PROD_QTY':'sum'})/quantity_segment1
quantity_other_by_brand = other.groupby(['Brand']).agg({'PROD_QTY':'sum'})/quantity_other
brand_prop=quantity_segment1_by_brand.merge(quantity_other_by_brand,how='left',left_index=True,right_index=True)
brand_prop = brand_prop.rename(columns={'PROD_QTY_x':'targetSegment','PROD_QTY_y':'other'})
brand_prop['affinityBrand']=brand_prop['targetSegment']/brand_prop['other']
brand_prop = brand_prop.sort_values('affinityBrand',ascending=False)
brand_prop['affinityBrand'].plot(kind='bar')
We can see that:
print(brand_prop.head(1))
print(brand_prop.tail(1))
At last, let's find out whether our target segment tends to buy larger packs of chips.
# Preferred pack size compared to the rest of the population
quantity_segment1_by_size = segment1.groupby(['PackSize']).agg({'PROD_QTY':'sum'})/quantity_segment1
quantity_other_by_size = other.groupby(['PackSize']).agg({'PROD_QTY':'sum'})/quantity_other
size_prop=quantity_segment1_by_size.merge(quantity_other_by_size,how='left',left_index=True,right_index=True)
size_prop = size_prop.rename(columns={'PROD_QTY_x':'targetSegment','PROD_QTY_y':'other'})
size_prop['affinitySize']=size_prop['targetSegment']/size_prop['other']
size_prop = size_prop.sort_values('affinitySize',ascending=False)
size_prop['affinitySize'].plot(kind='bar')
It looks like Mainstream young singles and couples are 27% more likely to purchase a 270g pack of chips compared to the rest of the population.
print(size_prop.head(1))
Let's check what brands sell this pack size.
combine[combine['PackSize']==270]['PROD_NAME'].unique()
Twisties are the only brand offering 270g packs and so this may be reflecting a higher likelihood of purchasing Twisties for the target customers.