In today’s day and age, stores like Amazon, Target, Costco, and Walmart are investing heavily into virtualizing its customers’ shopping experiences. Almost anything from a new smartphone to tomorrow’s dinner can be purchased online now. In fact, an increasing number of people are buying groceries online. This leaves small mom and pop shops and convenience stores with physical locations struggling to retain customers and pay the bills. In this project, we will explore a small part of how these physical stores, or even new aspiring store owners, can strategize operations to maximize profits--starting with liquor sales in the midwestern state of Iowa.
The question I hope to answer in this project is: what factors influence Iowans to spend more money on liquor at a store? These factors could include a store’s location, the time of the year, or simply the store’s catalog.
The dataset that I use is "Iowa Liquor Sales - 12 million alcoholic beverage sales in the Midwest", a CSV which can be found here on Kaggle.
Because this dataset contains over 12 million rows of data, which is around 3.4GB, I decided that it would be a good idea to take a random sample of 0.5% of the original CSV to analyze. By doing this, I can whittle down the original dataset to 62,955 rows, or around 17 MB of data, which allows my code to perform a lot faster.
To sample from the original dataset to create a smaller one, I wrote a small Python script, using the Pandas library. The script is detailed below:
import pandas as pd
# read in the humongous CSV into a Pandas dataframe
full_data = pd.read_csv('Iowa_Liquor_Sales.csv')
# a random 0.5% sample of full_data without replacement
sampled_data = full_data.sample(frac=0.005)
# export sampled_data into another CSV
sampled_data.to_csv('iowa_liquor_sales_sample.csv', index=False)
After creating iowa_liquor_sales_sample.csv
, I then uploaded the file to GitHub so it could be accessible online here instead of locally
Imports
Here, we will be using the pandas
data analysis library so that we can read in our CSV dataset and save it as a dataframe, allowing us to perform various forms of data parsing and analysis.
Additionally, we will also be using the numpy
library. The main purpose of numpy here in this section will be to help us identify which rows in our pandas dataframe have missing attributes, or column values. These will be marked with np.nan
.
Both of these libraries, including all (or at least most) of the libraries used in this tutorial can be installed using the Anaconda package manager.
import pandas as pd
import numpy as np
As mentioned earlier, the dataset I use in this tutorial is iowa_liquor_sales_sample.csv
, which I saved on GitHub.
# save unmodified csv into all_data
all_data = pd.read_csv('https://raw.githubusercontent.com/RYLiang18/Richard-CMSC-320-Final-Project-F2020/main/iowa_liquor_sales_sample.csv')
# create another dataframe called data
data = pd.read_csv('https://raw.githubusercontent.com/RYLiang18/Richard-CMSC-320-Final-Project-F2020/main/iowa_liquor_sales_sample.csv')
Here, we can see what the raw dataframe looks like, and we can already notice some inconsistencies. For example, on rows 2 and 3, the city attribute is "DES MOINES" in all caps; however, on row 62954, the city attribute is "Des Moines", written normally. While these are clearly the same cities in our heads, to make our code work, we need to tidy it up, which is done in section 1.3
all_data
The next step is to tidy up our data and make sure that all columns in our dataframe that we plan on using are in the right format.
For this, we will:
Because we will primarily be analyzing the sales of different types of liquor and the locations of liquor stores, some of the attributes that we can defintiely drop are Invoice/Item Number
, Store Name
, Store Number
, Address
, Volume Sold (Gallons)
, and State Bottle Cost
. We may end up using some of the other attributes, so we won't delete them.
In the introduction, I mentioned that store location could be valuable. You may be wondering now that if we care about the location of these liquor sales, why are we deleting the address? This is because the addresses are already stored in the Store Location
attribute. Later on, when we plot a heatmap of all of our liquor sales, we will care mostly about the longitude and latitude of the stores that sales occurred at. These coordinates are also included in the Store Location
attribute.
# Let's drop unused columns
data = data.drop(
[
'Invoice/Item Number',
'Store Name',
'Store Number',
'Address',
'Volume Sold (Gallons)',
'State Bottle Cost'
],
axis = 1
)
To minimize inconsistencies, we will have to make sure that observations with the same specific attribute are labeled as such.
Removing Inconsistencies in dtype: numpy.object
Columns by Converting to Lowercase
One of the first things that we will do to remove inconsistencies is to convert all attributes that are dtype: numpy.str
to lowercase. Case sensitivity is now no longer a reason for inconsistent attributes.
for col_name in data.columns:
if (data[col_name].dtype == np.object):
data[col_name] = data[col_name].str.lower()
Removing Inconsistencies in County Names
To remove inconsistencies in county names, we will cross reference county names with their respective county numbers.
For example, Polk County is County Number 77; therefore, we will make sure that every liquor sale that takes place in County 77 will be labled “polk”.
# Remove entries that are missing the 'County Number' and 'County' Attribute
data = data.dropna(subset=['County Number', 'County'])
# making county to lower case
data["County"] = data["County"].str.lower()
# using a dict to cross reference
county_dict = dict()
county_dict[78.0] = 'pottawattamie'
for i, row in data.iterrows():
county_number = row['County Number']
if county_number in county_dict:
data.at[i, 'County'] = county_dict[county_number]
else:
county_dict[county_number] = row['County']
Removing Inconsistencies in Category Names
Since we also have a column called Category
, which stores Category IDs, we will also remove inconsistencies in the Category Name
column by cross referencing their respective IDs.
data = data.dropna (subset = ['Category', 'Category Name'])
data['Category Name'] = data['Category Name'].str.lower()
category_dict = dict()
for i, row in data.iterrows():
category_number = row['Category']
if category_number in category_dict:
data.at[i, 'Category Name'] = category_dict[category_number]
else:
category_dict[category_number] = row['Category Name']
Another part of tidying up data is to make sure that the columns that we plan on using contain variables in the right format. Since later on in the project, we will be plotting Sale (Dollars)
against different Iowa Counties and also against the time of the year, we need to convert each value in the Sale (Dollars)
column from the $x.xx format to a float that we can plot and compare. The same can be said for State Bottle Retail
since we will be plotting that attribute later on as well.
Finally, we will need to convert the values in the Date
column, which are originally strings in the format “MM/DD/
Converting Sale (Dollars)
into type float
data['Sale (Dollars)'] = data['Sale (Dollars)'].str.replace('$', '')
data['Sale (Dollars)'] = pd.to_numeric(data['Sale (Dollars)'], downcast="float")
Converting State Bottle Retail
into type float
data['State Bottle Retail'] = data['State Bottle Retail'].str.replace('$', '')
data["State Bottle Retail"] = pd.to_numeric(data["State Bottle Retail"], downcast="float")
Converting Date
into type datetime
data['Date']= pd.to_datetime(data['Date'])
We all know that alcohol, especially liquor, can be quite expensive, so it is important for us to remove outliers. We don’t want that one bottle of 300-year-old cognac that was sold for $50k to bias our data analysis. Nor do we want a group of college students buying 50 handles of vodka for yet another insane price to also skew our data.
While it might seem obvious that we should remove outliers from the Sale (Dollars)
column since that is what we will be analyzing the most, notice that Sale (Dollars)
is achieved by multiplying State Bottle Retail
and either Pack
or Bottles Sold
. Because Sale (Dollars)
is based on these other columns, State Bottle Retail
, Pack
, and Bottles Sold
, we should remove outliers for those columns.
Let’s create a new dataset called data_nofliers
to store a version of data
without rows containing outlier values for attributes State Bottle Retail
, Pack
, and Bottles Sold
, in case we would like to perform some data analysis where outliers do not matter.
data_nofliers = data.copy(deep = True)
A commonly used rule says that a data point is an outlier if it is more than $1.5 \cdot IQR$ above the third quartile or $1.5 \cdot IQR$ below the first quartile. The function below removes rows with outlier values for a specific attribute name
def remove_outliers(attribute_name):
global data_nofliers
q1 = data_nofliers[attribute_name].quantile(0.25)
q3 = data_nofliers[attribute_name].quantile(0.75)
iqr = q3-q1
# we consider outliers that are out of the range [q1-1.5*iqr, q3+1.5*iqr]
data_nofliers[attribute_name] = data_nofliers[attribute_name][
data_nofliers[attribute_name].between(q1-1.5*iqr, q3+1.5*iqr)
]
data_nofliers = data_nofliers.dropna(subset = [attribute_name])
remove_outliers('State Bottle Retail')
remove_outliers('Pack')
remove_outliers('Bottles Sold')
data_nofliers
In this section, we will take a deeper look at the dataset that we curated to try to reveal trends and create visualizations for our data. If we put ourselves in the shoes of a mom and pop shop or a person trying to open a liquor business in Iowa, it would be very helpful, for example, to know statistics like where liquor sale transactions occur the most and what time of the year liquor sales are most popular.
Imports
For creating visually appealing graphs and charts, the main libraries we will use are seaborn
and matplotlib.pyplot
. seaborn
is a library based on top of matplotlib that helps make graphs more visually appealing.
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
The first thing that we will want to look at is how liquor sales are distributed across Iowa, and the best way to do this would be through an interactive heatmap over the state. This will show us which locations in Iowa have the highest density of liquor sales.
Also, note that since numerical outliers do not matter when generating a heatmap, we will not be using data_nofliers
, but rather data
since it has more observations.
Imports
Because we aren’t making a graph or chart, we won’t be using seaborn
or matplotlib
; instead, we will be using folium
and folium.plugin
, which are libraries that make it easy to visualize data on a Leaflet
map.
import folium
from folium import plugins
The easiest way to plot each of our liquor sales in data
is through geographical coordinates - longitude and latitude.
However, after looking at data
, we notice that geographic coordinates are actually embedded in the Store Location
attribute of each liquor sale. To easily plot the locations of each sale, let’s parse out the coordinates from Store Location
and save them in a new column.
# Remove entries that are missing the 'Store Location' Attribute
data = data.dropna(subset=['Store Location'])
# adding a new column to store liquor store coordinates
data['store_coords'] = None
# we will use regex to parse out liquor store coordinates
import re
for i, row in data.iterrows():
# get store location string
store_loc_str = row['Store Location']
# extract the coordinates part of the string
store_coord_str = re.search(r'\((.*?)\)',store_loc_str)
if store_coord_str is None:
continue
store_coord_str = store_coord_str.group(1)
# splitting into [lat, lng]
store_coords = store_coord_str.split(',')
store_coords = [float(numeric_string) for numeric_string in store_coords]
# saving store_coords into data
if len(store_coords) == 2:
data.at[i, 'store_coords'] = store_coords
# now that we have coordinates, we can drop the Store Location column
data = data.drop(['Store Location'], axis = 1)
Now that we have extracted the coordinates into their own column, we also deleted Store Location
since the addresses no longer matter.
Since there are bound to be rows in data
that do not have coordinates, we will have to ignore those when plotting them in folium, which can easily be done through list comprehension.
store_coords_lst = data['store_coords'].tolist()
store_coords_lst = [i for i in store_coords_lst if i]
liquor_map = folium.Map(
location = [42.229799658046346, -93.59540792864269],
zoom_start = 8
)
liquor_map.add_child(
plugins.HeatMap(
data = store_coords_lst,
radius = 15
)
)
Based on our sample data, we can see through the red spots on our heatmap that liquor transactions happen more frequently in 9 cities:
Through this heatmap alone, we can say that these six cities would be great to open liquor shops in due to the high traffic. However, if we zoom in on our map, we can see that there is also much more competition in those locations.
From now on, the counties containing one or more of these 9 cities will be referred to as hotspot counties, and all other counties will be labelled as regular counties.
Knowing now that there are certain areas in Iowa that are liquor sale hotspots, we would now like to know how these hotspot counties affect liquor sales. Specifically, we would like to know how they affect:
State Bottle Retail
Sale (Dollars)
Bottles Sold
The first step in knowing this is to distinguish the hotspot counties.
target_cities = {
'des moines',
'council bluffs',
'sioux city',
'ames',
'waterloo',
'cedar rapids',
'iowa city',
'davenport',
'dubuque'
}
# collect target_counties
target_counties = set()
for _, row in data.iterrows():
if row['City'] in target_cities:
target_counties.add(row['County'])
# output target_counties
print('Our target counties are: ')
target_counties
Liquor hotspots reside in these 10 counties, shown by target_counties
. Since we would like to compare these 10 counties against more normal Iowa counties with smaller liquor sale hotspots, let’s randomly select 30 Iowan Counties that do not include one of the 10 target_counties
to represent the rest of Iowa.
# now let's get 30 random counties that aren't in target_counties
thirty_more = None
# simulate a do while loop
while True:
thirty_more = np.random.choice(
data.County.unique(),
30,
replace = False
)
has_duplicates = np.any(np.in1d(
thirty_more,
np.array(list(target_counties))
))
if not has_duplicates:
break
print('Our thirty regular counties are: ')
thirty_more
Now that we have gotten our 10 “hotspot counties” and 30 “regular counties”, we’ll want to use seaborn
to plot graphs comparing how each of these groups affect the 1, 2, and 3 variables mentioned above.
Violin plots are the best option here since they are like box plots, but also show distribution of data.
# new dataframe with just observations from target counties
target_counties_data = data.loc[data['County'].isin(target_counties)]
target_counties_data = target_counties_data[
target_counties_data.index.isin(data_nofliers.index)
].copy()
# new dataframe with just observations from regular counties
thirty_more_data = data.loc[data['County'].isin(thirty_more)]
thirty_more_data = thirty_more_data[
thirty_more_data.index.isin(data_nofliers.index)
].copy()
from matplotlib import gridspec
# function to plot out data from target_counties_data
# and thirty_more_data
def plot(attribute_name, attribute_title):
fig = plt.figure(figsize=(30,10))
# 10 target counties, 30 regular counties => 10:30 ratio
gs = gridspec.GridSpec(1, 2, width_ratios= [10, 30])
# target counties chart
ax1 = plt.subplot(gs[0])
graph = sns.violinplot(
x = 'County',
y = attribute_name,
data = target_counties_data,
ax = ax1
)
graph.set_xticklabels(graph.get_xticklabels(), rotation=45)
graph.set_title("Popular Counties vs " + attribute_title)
# regular counties chart
ax2 = plt.subplot(gs[1])
graph2 = sns.violinplot(
x = 'County',
y = attribute_name,
data = thirty_more_data,
ax = ax2
)
graph2.set_xticklabels(graph2.get_xticklabels(), rotation=90)
graph2.set_title("Regular Counties vs " + attribute_title)
plt.show()
Right off the bat, as expected, we see that the graphs for our hotspot counties have thicker and wider violins for each county, meaning that more alcohol sales happen in those counties. This is expected, as per our heatmap, which shows that we have more data for those counties.
# %%%%%%%%%%%%%%%%%%%%% COUNTIES VS LIQUOR BOTTLE PRICES %%%%%%%%%%%%%%%%%%%%%
plot('State Bottle Retail', 'Liquor Bottle Prices')
In this first row of graphs, we can observe that across all the hotspot counties, on average, bottles of liquor will retail for around \$12, with most liquor bottle prices ranging from \$5 - \$20.
Across our “regular counties”, there was slightly more variation in liquor prices, but the average bottle of liquor also seemed to also cost around \$10 - \$20, not much more different than liquor in our hotspot counties.
What this tells us is that liquor prices do not seem to be significantly marked up or down in hotspot counties compared to regular counties, and vice versa.
# %%%%%%%%%%%%%%%% COUNTIES VS AMOUNT SPENT PER TRANSACTION %%%%%%%%%%%%%%%%
plot('Sale (Dollars)', 'Amount Spent Per Transaction')
When comparing hotspot and regular counties against Sale (Dollars)
, we can see again that on average, across all counties, most people spend around \$50 - \$100 each time they visit the liquor store.
Another thing that we notice is that there is a lot more consistency in the graph for the hotspot counties than for regular counties. Specifically, the peaks of each violin in the hotspot counties’ graph are relatively close to one another, whereas the peaks of each violin in the regular counties’ graph have a lot of variation.
# %%%%%%%%%%%%%%%% COUNTIES VS BOTTLES SOLD PER TRANSACTION %%%%%%%%%%%%%%%%
plot('Bottles Sold', 'Bottles Sold Per Transaction')
Again, for this data, it is difficult to tell if liquor stores in hotspot counties see an increase or decrease in bottles sold per transaction compared to liquor stores in other counties.
The violins for each of these graphs have multiple areas of distribution probably because vendors typically sell liquor in packs of 1, 6, or 12.
Up until now, we have been focusing on how different counties in Iowa affect Liquor sales. One thing that we did not yet take into consideration is time, specifically, time of the year. Could it be possible that people tend to spend more on liquor or visit liquor stores more often during certain times of the year?
To find these out, we plan on making an overlay graph, plotting the amount spent per transaction and total number of liquor sales against the month of the transaction.
To do this, we need two things:
month
column in data_nofliers
so that we can categorize sales by monthmonth
and total number of sales
, which we will derive from data
data_nofliers['month'] = pd.DatetimeIndex(data_nofliers['Date']).month
print('month column addded to end of data_nofliers')
data_nofliers
# making the dataframe storing
# month and number of sales per month
from collections import defaultdict
month_dict = defaultdict(lambda: 0)
for date in data['Date']:
month_dict[date.month-1]+=1
month_dict_df = pd.DataFrame(month_dict.items(), columns=["month", "total number of sales"])
print('The other dataframe with columns: month, total number of sales')
month_dict_df
fig, ax = plt.subplots(figsize=(20,10))
sns.stripplot(
x = 'month',
y = 'Sale (Dollars)',
data = data_nofliers,
)
ax2 = ax.twinx()
sns.lineplot(
x = 'month',
y = 'total number of sales',
data = month_dict_df,
color = 'r',
linewidth = 4
)
plt.title("Amount Spent per Transaction and Total Number of Sales vs. Month of Year")
plt.show()
From this graph, we can see that across all months, the distribution of money spent buying liquor stays relatively constant. While there are several high rollers every month, most people spend under $350 when going to the liquor store regardless of the month.
However, we can see change in the number of liquor sales per month. The chronological trend in Iowa seems to be that:
Imports
The only package that we will be using isstatsmodels.stats.weightstats
, which we will use to perform a two-sample hypothesis test.
from statsmodels.stats import weightstats as stests
Based on the graphs that we generated above that plot liquor sales by county, we noticed that for the most part, the amount of money spent buying liquor did not change much across both hotspot counties and regular counties.
We can use two-sample hypothesis testing to see if this is actually the case, that the true average amount of money spent on liquor per transaction is the same across hotspot counties and regular counties.
Here are our null and alternative hypotheses:
$ \begin{align*} H_0 & : \mu_\mathrm{hotspot\,counties} = \mu_\mathrm{regular\,counties} \\ H_a & : \mu_\mathrm{hotspot\,counties} \neq \mu_\mathrm{regular\,counties} \end{align*} $
Firstly, let’s separate data_nofliers
into two dataframes, one for hotspot counties called target_counties_data
and the other for regular counties called regular_counties_data
.
target_counties_data = data_nofliers.loc[data_nofliers['County'].isin(target_counties)]
regular_counties_data = data_nofliers.loc[~data_nofliers['County'].isin(target_counties)]
Here’s the code for our two-sample hypothesis test. We use a z-test since our sample sizes for both hotspot counties and regular counties are over 30.
ztest, pval = stests.ztest(
x1 = target_counties_data['Sale (Dollars)'],
x2 = regular_counties_data['Sale (Dollars)'],
value = 0,
alternative = 'two-sided'
)
print("p-value = " + str(pval))
Surprisingly, p-value = 2.8020461450061577e-58, which is considerably less than alpha = 0.05, so we definitely reject the null hypothesis.
Based on our sample data, we do have enough evidence to reject the null hypothesis at a significance level of 0.05.
Although our graphs show that the amounts spent on liquor per transaction are roughly equal across all Iowan counties, hotspot or not, we have enough evidence to conclude that the true average amount spent per transaction in hotspot counties is different from the true average amount spent per transaction in regular counties.
Imports
For data preprocessing, we will use sklearn.preprocessing.LabelEncoder
, and for multiple Ordinary Least Squares (OLS) linear regression, we will use statsmodels.api
.
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
After conducting hypothesis testing, we now know that the amounts spent on liquor per transaction are not roughly the same across Iowa. We would like to analyze this further and use multiple linear regression to see if there are any factors other than brand that influence how much is spent on alcohol. The features that we are considering the most to be potential influences are:
Feature Name | Column Name |
---|---|
City | City |
Zip Code | Zip Code |
County Number | County Number |
County | County |
Liquor Category Name | Category Name |
Liquor Category ID | Category |
Month | epoch (to be created) |
Date (epoch) | City |
Store Number | Store Number |
The target (dependent variable) that we wish to predict is how much people spend on alcohol each time they go and buy it, which is Sale (Dollars)
. However, since how much is spent on alcohol is dependent on other factors like the volume of alcohol bought and the number of bottles sold, we will be doing separate linear regressions to see if we can identify a linear relationship with those. So to clarify, here are the targets that we will try to run linear regression on:
Target Name | Column Name |
---|---|
Amount Spent on Liquor per Transaction | Sale (Dollars) |
Volume of Alcohol Bought | Volume Sold (Liters) |
Number of Bottles Bought | Bottles Sold |
Because all of our features except Date are categorical data, before we can use them in linear regression, we need to encode them in a way that they can be interpreted quantitatively. Firstly, we will convert our Date
column into an epoch, or Unix Timestamp. This is the number of seconds that have elapsed since January 1, 1970. Secondly, for the rest of our data, we will use the LabelEncoder
from sklearn.preprocessing
for categorical encoding. This turns each value in a column into a number.
ml_df = data[data.index.isin(data_nofliers.index)].copy()
# creating the month index
ml_df['month'] = pd.DatetimeIndex(ml_df['Date']).month
ml_df['epoch'] = ml_df.Date.values.astype(np.int64) // 10 ** 9
ml_df2 = ml_df.copy()
label_encoder = LabelEncoder()
cat_features = [
'City',
'Zip Code',
'County Number',
'County',
'Category Name',
'Category',
'month',
]
for feature in cat_features:
ml_df2[feature + " encoded"] = label_encoder.fit_transform(ml_df2[feature])
To conduct multiple linear regression, we have to define X and y, our independent variables and dependent variable respectively. For us, since this is multiple linear regression, our X will be the dataframe slice containing our feature columns, and our y will be either Sale (Dollars)
, Volume Sold (Liters)
, or Bottles Sold
.
features = ml_df2.loc[:, [
'City encoded',
'Zip Code encoded',
'County Number encoded',
'County encoded',
'Category Name encoded',
'Category encoded',
'month encoded',
'epoch',
]]
# linear regression function
def linear_regression(target):
global features
global ml_df2
X = features
y = ml_df2[target]
ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())
# function to plot scatter plots after linear regression
def regplot(x1, x2, y):
global ml_df2
fig = plt.figure(figsize=(30,10))
gs = gridspec.GridSpec(1, 2, width_ratios= [1, 1])
ax1 = plt.subplot(gs[0])
graph = sns.regplot(
x = x1,
y = y,
data = ml_df2,
ax = ax1,
line_kws={"color": "red"}
)
graph.set_xticklabels(graph.get_xticklabels(), rotation=45)
graph.set_title(x1 + ' vs ' + y)
ax2 = plt.subplot(gs[1])
graph2 = sns.regplot(
x = x2,
y = y,
data = ml_df2,
ax = ax2,
color = 'g',
line_kws={"color": "red"}
)
graph2.set_xticklabels(graph2.get_xticklabels(), rotation=90)
graph2.set_title(x2 + ' vs ' + y)
plt.show()
Sale (Dollars)
¶linear_regression('Sale (Dollars)')
regplot('Category Name encoded', 'County encoded', 'Sale (Dollars)')
From the readout above, we can see that the R-Squared value is 0.542. This means that we can explain 54.2% of the variability of the response data around the mean of Sale (Dollars)
.
Although our R-squared value is quite low, we can see from the P > |t|
column that the only columns with p-values greater than our significance level of 0.05 are Zip Code and month, which means that we identified a linear correlation between all other features and Sale (Dollars)
. However, despite this, most of our coefficients are close to 0, so it is hard to tell whether any of the features we selected really affected Sale (Dollars)
.
Even so, out of all our features, the variables that seem to have the greatest impact on how much people spend on liquor are:
Volume Sold (Liters)
¶linear_regression('Volume Sold (Liters)')
regplot('Category Name encoded', 'month encoded', 'Volume Sold (Liters)')
When our dependent variable is Volume Sold (Liters)
, we performed a little worse than when our dependent variable was Sale (Dollars)
with an R-squared value of 0.509.
The only feature with a p-value here greater than the significance level of 0.05 was City
. All others were 0, meaning there is enough evidence to conclude that there exisist a linear relationship between them and Volume Sold
. However, identical to our last linear regression model, all of our coefficients were quite close to zero. Once again, it is difficult to discern any legitimate correlations.
The variables that affected Volume Sold (Liters)
the most were:
Bottles Sold
¶linear_regression('Bottles Sold')
target = ml_df2['Bottles Sold']
regplot('Category Name encoded', 'County encoded', 'Bottles Sold')
We see that history repeats itself with Bottles Sold
as our dependent variable. Surprisingly, this model performs the best with an R-squared value of 0.561. All of the p-values associated with our feature coefficients were under the significance level of 0.05 with the exception of City and Category ID.
Also identical to the last two linear regression models is that our feature coefficients are all very close to 0.
The variables that affected Bottles Sold
the most were:
After performing multiple linear regression, we noticed that the coefficients that most influenced the amount people spend on liquor, the volume of liquor that people bought, and the number of bottles bought were the category of liquor being bought, the county that a store is in, and the month of the sale.
However, even though this was our conclusion, we have to address that the coefficients for these variables were all close to 0 and that all of our R-squared values were at best a little higher than 0.5. From this, we can say that our linear regression models for all 3 dependent variables did not fit very well, and liquor category, county, and month only affected liquor sales slightly.
Even so, the fact that we have more positive coefficients for these 3 properites means that these are the more important factors to consider for current and future store owners when selling liquor.
In this project, we analyzed liquor sales in midwestern Iowa and went through each of the three critical steps in the data science pipeline: data curation, parsing, and management; exploratory data analysis; and finally, hypothesis testing and machine learning.
Exploratory Data Analysis
During exploratory data analysis, we generated a heatmap that showed us which locations in Iowa are hotspots for liquor sales. Through this, we then drew up several charts comparing liquor sales in hotspot counties and regular counties in Iowa. Finally, we created an overlay plot seeing how the time of the year affected liquor sales.
Hypothesis Testing and Machine Learning
In hypothesis testing and machine learning, we learned that on average, location does have an effect on how much Iowans spent buying liquor. Using multiple linear regression, we tried to analyze these sales further to determine what factors most influence how much Iowans spend on liquor. Our Ordinary Least Squares linear regression models told us that these factors were the type of liquor sold at a store as well as the county a store was located in. However, in the end, it was hard to say if these two properties had a legitimate impact on liquor sales since all of our coefficient values were near 0.
Even though we were unsuccessful in determining what affects liquor sales in Iowa with our particular CSV, it does not mean that there are no factors at all that influence sales. We simply need to continue digging.
Closing
Overall, through its ability to extract meaning and patterns out of seemingly random data, data science with Python is an extremely powerful tool that can change the way the world works in the digital era. While today, we analyzed data about liquor sales in Iowa, the three main steps in the data science pipeline can be adapted to any type of data.