본문 바로가기

머신러닝

파이썬_쇼핑몰 매출 분석2 (월별,요일별,시간별 매출 분석, top 판매 제품, top매출 아이템 판매량 추이 분석)

[실습]Ch02. 매출, 가장 많이 팔린 아이템 확인하기

목표

  1. 아이템별 지표 확인하기
  2. 시간별 지역별 판매 지표 확인하기
In [1]:
import numpy as np
import pandas as pd
# seaborn
import seaborn as sns
COLORS = sns.color_palette()

%matplotlib inline

데이터 로딩

  1. 정제된 데이터 사용(retail.csv)
In [2]:
dtypes = {
    'UnitPrice': np.float32,
    'CustomerID': np.int32,
    'Quantity': np.int32
}
retail = pd.read_csv('./OnlineRetailClean.csv', dtype=dtypes)
retail.head()
Out[2]:
Unnamed: 0 InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country CheckoutPrice
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom 15.30
1 1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom 20.34
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom 22.00
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom 20.34
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom 20.34

날짜 타입 데이터 변환

  • 문자열로 로딩하는 것보다 date/datetime 타입으로 로딩하는 것이 분석에 용이
In [3]:
retail.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397884 entries, 0 to 397883
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Unnamed: 0     397884 non-null  int64  
 1   InvoiceNo      397884 non-null  int64  
 2   StockCode      397884 non-null  object 
 3   Description    397884 non-null  object 
 4   Quantity       397884 non-null  int32  
 5   InvoiceDate    397884 non-null  object 
 6   UnitPrice      397884 non-null  float32
 7   CustomerID     397884 non-null  int32  
 8   Country        397884 non-null  object 
 9   CheckoutPrice  397884 non-null  float64
dtypes: float32(1), float64(1), int32(2), int64(2), object(4)
memory usage: 25.8+ MB
In [4]:
retail['InvoiceDate']=pd.to_datetime(retail['InvoiceDate'],infer_datetime_format=True)
retail.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397884 entries, 0 to 397883
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Unnamed: 0     397884 non-null  int64         
 1   InvoiceNo      397884 non-null  int64         
 2   StockCode      397884 non-null  object        
 3   Description    397884 non-null  object        
 4   Quantity       397884 non-null  int32         
 5   InvoiceDate    397884 non-null  datetime64[ns]
 6   UnitPrice      397884 non-null  float32       
 7   CustomerID     397884 non-null  int32         
 8   Country        397884 non-null  object        
 9   CheckoutPrice  397884 non-null  float64       
dtypes: datetime64[ns](1), float32(1), float64(1), int32(2), int64(2), object(3)
memory usage: 25.8+ MB

해당 기간 동안의 매출

  • 전체 매출
  • 국가별 매출
  • 월별 매출
  • 요일별 매출
  • 시간별 매출

전체 매출

In [5]:
retail['CheckoutPrice'].sum()
Out[5]:
8911407.904

국가별 매출

In [6]:
retail.groupby('Country').sum()['CheckoutPrice'].sort_values()
Out[6]:
Country
Saudi Arabia            1.459200e+02
Bahrain                 5.484000e+02
Czech Republic          8.267400e+02
RSA                     1.002310e+03
Brazil                  1.143600e+03
European Community      1.300250e+03
Lithuania               1.661060e+03
Lebanon                 1.693880e+03
United Arab Emirates    1.902280e+03
Unspecified             2.667070e+03
Malta                   2.725590e+03
USA                     3.580390e+03
Canada                  3.666380e+03
Iceland                 4.310000e+03
Greece                  4.760520e+03
Israel                  7.221690e+03
Poland                  7.334650e+03
Austria                 1.019868e+04
Cyprus                  1.359038e+04
Italy                   1.748324e+04
Denmark                 1.895534e+04
Channel Islands         2.045044e+04
Singapore               2.127929e+04
Finland                 2.254608e+04
Portugal                3.343989e+04
Norway                  3.616544e+04
Japan                   3.741637e+04
Sweden                  3.837833e+04
Belgium                 4.119634e+04
Switzerland             5.644395e+04
Spain                   6.157711e+04
Australia               1.385213e+05
France                  2.090240e+05
Germany                 2.288671e+05
EIRE                    2.655459e+05
Netherlands             2.854463e+05
United Kingdom          7.308392e+06
Name: CheckoutPrice, dtype: float64

월별 매출

In [7]:
retail.head()
Out[7]:
Unnamed: 0 InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country CheckoutPrice
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 15.30
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 22.00
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34
In [8]:
retail.set_index('InvoiceDate') 
Out[8]:
Unnamed: 0 InvoiceNo StockCode Description Quantity UnitPrice CustomerID Country CheckoutPrice
InvoiceDate
2010-12-01 08:26:00 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2.55 17850 United Kingdom 15.30
2010-12-01 08:26:00 1 536365 71053 WHITE METAL LANTERN 6 3.39 17850 United Kingdom 20.34
2010-12-01 08:26:00 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2.75 17850 United Kingdom 22.00
2010-12-01 08:26:00 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 3.39 17850 United Kingdom 20.34
2010-12-01 08:26:00 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 3.39 17850 United Kingdom 20.34
... ... ... ... ... ... ... ... ... ...
2011-12-09 12:50:00 541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 0.85 12680 France 10.20
2011-12-09 12:50:00 541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2.10 12680 France 12.60
2011-12-09 12:50:00 541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 4.15 12680 France 16.60
2011-12-09 12:50:00 541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 4.15 12680 France 16.60
2011-12-09 12:50:00 541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 4.95 12680 France 14.85

397884 rows × 9 columns

In [9]:
#def extract_month(date):
#    month = str(date.month)
#    if date.month <10:
#        month='0'+month
#    return str(date.year)+month

def extract_month(date):
    return str(date.year)+str(date.month).rjust(2,'0')

# rjust사용하기 위해서는 string으로 변환 먼저 해줘야 함
In [10]:
#grouping하기 위해서 해당 변수를 우선 index로 만들기
rev_by_month=retail.set_index('InvoiceDate').groupby(extract_month).sum()['CheckoutPrice']
rev_by_month
Out[10]:
201012     572713.890
201101     569445.040
201102     447137.350
201103     595500.760
201104     469200.361
201105     678594.560
201106     661213.690
201107     600091.011
201108     645343.900
201109     952838.382
201110    1039318.790
201111    1161817.380
201112     518192.790
Name: CheckoutPrice, dtype: float64
In [11]:
def plot_bar(df, xlabel,ylabel,title,color=COLORS[0], figsize=(30,10), rotation=45):
    plot=df.plot(kind='bar')
    plot.set_xlabel(xlabel,fontsize=11)
    plot.set_ylabel(ylabel,fontsize=11)
    plot.set_title(title, fontsize=13)
    plot.set_xticklabels(labels=df.index, rotation=rotation)
    
plot_bar(rev_by_month,'Month','Revenue','Revenue by Month')

요일별 매출

In [12]:
rev_by_dow=retail.set_index('InvoiceDate').groupby(lambda date:date.dayofweek).sum()['CheckoutPrice']
rev_by_dow #0이 월요일, 5가 토요일, 6이 일요일
Out[12]:
0    1367146.411
1    1700634.631
2    1588336.170
3    1976859.070
4    1485917.401
6     792514.221
Name: CheckoutPrice, dtype: float64
In [13]:
plot_bar(rev_by_dow,'DOW','Revenue','Revenue by Day of Week')
In [14]:
Day_week=np.array(['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
Day_week
Out[14]:
array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], dtype='<U3')
In [15]:
rev_by_dow.index=Day_week[rev_by_dow.index]
plot_bar(rev_by_dow,'DOW','Revenue','Revenue by Day of Week')

시간별 매출

In [16]:
rev_by_hour=retail.set_index('InvoiceDate').groupby(lambda date:date.hour).sum()['CheckoutPrice']
plot_bar(rev_by_hour,'Hour','Revenue','Revenue by Hour')

매출 데이터로부터 insight

  • 전체 매출의 82%가 UK에서 발생
  • 11년도의 가장 많은 주문이 발생한 달 11월(12월의 전체 데이터가 반영이 되진 않았음)
    • 11, 12월의 판매량이 압도(블랙프라이데이, 사이버먼데이, 크리스마스 휴일)
  • 일주일중 목요일까지는 성장세를 보이다가, 이후로 하락(토요일에는 주문X)
  • 7시를 시작으로 주문이 시작되어 12시까지 증가세, 15시까지 하락을, 15시 이후 부터 급락)

제품별 metrics

  • Top 10 판매 제품
  • Top 10 매출 제품
In [17]:
#Top 3 판매 제품
top_selling=retail.groupby('StockCode').sum()['Quantity'].sort_values(ascending=False)[:3]
top_selling
Out[17]:
StockCode
23843    80995
23166    77916
84077    54415
Name: Quantity, dtype: int32
In [18]:
#Top 10 매출 제품
top_revenue=retail.groupby('StockCode').sum()['CheckoutPrice'].sort_values(ascending=False)[:10]
top_revenue
Out[18]:
StockCode
23843     168469.60
22423     142592.95
85123A    100603.50
85099B     85220.78
23166      81416.73
POST       77803.96
47566      68844.33
84879      56580.34
M          53779.93
23084      51346.20
Name: CheckoutPrice, dtype: float64

top 3 아이템의 월별 판매량 추이

In [19]:
#전체 아이템에 대한 월별 판매량 추이
retail.set_index('InvoiceDate').groupby(['StockCode',extract_month]).sum()[['Quantity','CheckoutPrice']]
Out[19]:
Quantity CheckoutPrice
StockCode
10002 201012 224 190.40
201101 337 286.45
201102 50 42.50
201103 23 19.55
201104 189 160.65
... ... ... ...
POST 201108 237 5185.95
201109 279 5894.50
201110 338 7993.50
201111 460 10349.95
201112 123 2760.00

30407 rows × 2 columns

In [20]:
#top3 아이템에 대한 월별 판매량 추이
monthly_top3=retail.set_index('InvoiceDate').groupby(['StockCode',extract_month]).sum()[['Quantity','CheckoutPrice']].loc[top_selling.index]
monthly_top3
Out[20]:
Quantity CheckoutPrice
StockCode
23166 201101 74215 77183.60
201105 792 869.04
201106 391 458.51
201107 718 826.94
201108 405 486.09
201109 342 397.26
201110 235 283.67
201111 631 708.11
201112 187 203.51
23843 201112 80995 168469.60
84077 201012 5139 1150.47
201101 1488 385.44
201102 3457 795.17
201103 3888 943.20
201104 10224 2281.44
201105 4944 1249.44
201106 1920 533.76
201107 3600 982.56
201108 2256 654.24
201109 3462 985.70
201110 8174 1953.98
201111 4500 1294.20
201112 1363 376.65
In [21]:
plot_bar(monthly_top3['CheckoutPrice'],'Product/Month','Revenue','top3 items')
In [ ]: