본문 바로가기

머신러닝

파이썬_Titanic 생존여부 예측

titaninc_0907
In [2]:
pip install plotly
Requirement already satisfied: plotly in c:\users\flyto\anaconda3\lib\site-packages (4.9.0)
Requirement already satisfied: six in c:\users\flyto\anaconda3\lib\site-packages (from plotly) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\flyto\anaconda3\lib\site-packages (from plotly) (1.3.3)
Note: you may need to restart the kernel to use updated packages.
In [3]:
pip install cufflinks
Requirement already satisfied: cufflinks in c:\users\flyto\anaconda3\lib\site-packages (0.17.3)
Requirement already satisfied: ipython>=5.3.0 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (7.16.1)
Requirement already satisfied: colorlover>=0.2.1 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (0.3.0)
Requirement already satisfied: numpy>=1.9.2 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (1.18.5)
Requirement already satisfied: six>=1.9.0 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (1.15.0)
Requirement already satisfied: setuptools>=34.4.1 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (49.2.0.post20200714)
Requirement already satisfied: ipywidgets>=7.0.0 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (7.5.1)
Requirement already satisfied: plotly>=4.1.1 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (4.9.0)
Requirement already satisfied: pandas>=0.19.2 in c:\users\flyto\anaconda3\lib\site-packages (from cufflinks) (1.0.5)
Requirement already satisfied: jedi>=0.10 in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.17.1)
Requirement already satisfied: decorator in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (4.4.2)
Requirement already satisfied: backcall in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: pygments in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (2.6.1)
Requirement already satisfied: pickleshare in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (3.0.5)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.4.3)
Requirement already satisfied: traitlets>=4.2 in c:\users\flyto\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (4.3.3)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\flyto\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (5.3.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\flyto\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (3.5.1)
Requirement already satisfied: nbformat>=4.2.0 in c:\users\flyto\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (5.0.7)
Requirement already satisfied: retrying>=1.3.3 in c:\users\flyto\anaconda3\lib\site-packages (from plotly>=4.1.1->cufflinks) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in c:\users\flyto\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\flyto\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2.8.1)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\flyto\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=5.3.0->cufflinks) (0.7.0)
Requirement already satisfied: wcwidth in c:\users\flyto\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.3.0->cufflinks) (0.2.5)
Requirement already satisfied: ipython-genutils in c:\users\flyto\anaconda3\lib\site-packages (from traitlets>=4.2->ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: tornado>=4.2 in c:\users\flyto\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.0.4)
Requirement already satisfied: jupyter-client in c:\users\flyto\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.1.6)
Requirement already satisfied: notebook>=4.4.1 in c:\users\flyto\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (6.0.3)
Requirement already satisfied: jupyter-core in c:\users\flyto\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (4.6.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\flyto\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (3.2.0)
Requirement already satisfied: pyzmq>=13 in c:\users\flyto\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (19.0.1)
Requirement already satisfied: prometheus-client in c:\users\flyto\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.0)
Requirement already satisfied: Send2Trash in c:\users\flyto\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.5.0)
Requirement already satisfied: jinja2 in c:\users\flyto\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.11.2)
Requirement already satisfied: nbconvert in c:\users\flyto\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (5.6.1)
Requirement already satisfied: terminado>=0.8.1 in c:\users\flyto\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.3)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\flyto\anaconda3\lib\site-packages (from jupyter-core->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (227)
Requirement already satisfied: attrs>=17.4.0 in c:\users\flyto\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (19.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\flyto\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->cufflinks) (0.16.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\flyto\anaconda3\lib\site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.1.1)
Requirement already satisfied: bleach in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (3.1.5)
Requirement already satisfied: defusedxml in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.6.0)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.8.4)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (1.4.2)
Requirement already satisfied: testpath in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.4.4)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\flyto\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.3)
Requirement already satisfied: webencodings in c:\users\flyto\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (0.5.1)
Requirement already satisfied: packaging in c:\users\flyto\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (20.4)
Requirement already satisfied: pyparsing>=2.0.2 in c:\users\flyto\anaconda3\lib\site-packages (from packaging->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->cufflinks) (2.4.7)
Note: you may need to restart the kernel to use updated packages.
In [4]:
pip install chart_studio
Collecting chart_studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
Requirement already satisfied: retrying>=1.3.3 in c:\users\flyto\anaconda3\lib\site-packages (from chart_studio) (1.3.3)
Requirement already satisfied: requests in c:\users\flyto\anaconda3\lib\site-packages (from chart_studio) (2.24.0)
Requirement already satisfied: plotly in c:\users\flyto\anaconda3\lib\site-packages (from chart_studio) (4.9.0)
Requirement already satisfied: six in c:\users\flyto\anaconda3\lib\site-packages (from chart_studio) (1.15.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\flyto\anaconda3\lib\site-packages (from requests->chart_studio) (2020.6.20)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\flyto\anaconda3\lib\site-packages (from requests->chart_studio) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in c:\users\flyto\anaconda3\lib\site-packages (from requests->chart_studio) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\flyto\anaconda3\lib\site-packages (from requests->chart_studio) (1.25.9)
Note: you may need to restart the kernel to use updated packages.
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0
In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import chart_studio.plotly as py
import cufflinks as cf
cf.go_offline(connected=True)
In [6]:
#train=pd.read_csv('D:\\me\\mine\\python\\titanic\\train.csv',dtype={'Pclass':str})
train=pd.read_csv('C:\\Users\\flyto\\Documents\\me\\kaggle\\titanic\\train.csv',dtype={'Pclass':str})
In [7]:
train.head()
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [8]:
#test=pd.read_csv('D:\\me\\mine\\python\\titanic\\test.csv',dtype={'Pclass':str})
test=pd.read_csv('C:\\Users\\flyto\\Documents\\me\\kaggle\\titanic\\test.csv',dtype={'Pclass':str})
In [9]:
test.head()
Out[9]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [10]:
train.isnull().sum()
Out[10]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [11]:
import seaborn as sns
sns.distplot(train['Age'],bins=20)
plt.show()
In [12]:
train.dtypes
Out[12]:
PassengerId      int64
Survived         int64
Pclass          object
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [13]:
train['Age']=train['Age'].fillna(train['Age'].mean())
train.isnull().sum()
Out[13]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [14]:
train['Age10']=(np.ceil(train['Age']/ 10) * 10).astype(int)
train.head()
Out[14]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age10
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 30
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 40
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 30
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 40
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 40
In [15]:
sns.distplot(train['Age10'],bins=10)
plt.show()
In [16]:
train.dtypes
Out[16]:
PassengerId      int64
Survived         int64
Pclass          object
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Age10            int32
dtype: object
In [17]:
train['Cabin']
Out[17]:
0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object
In [18]:
train.groupby('Cabin').sum()
Out[18]:
PassengerId Survived Age SibSp Parch Fare Age10
Cabin
A10 584 0 36.000000 0 0 40.1250 40
A14 476 0 29.699118 0 0 52.0000 30
A16 557 1 48.000000 1 0 39.6000 50
A19 285 0 29.699118 0 0 26.0000 30
A20 600 1 49.000000 1 0 56.9292 50
... ... ... ... ... ... ... ...
F33 930 3 87.000000 0 0 34.0000 100
F38 777 0 29.699118 0 0 7.7500 30
F4 803 2 5.000000 4 2 78.0000 20
G6 864 2 59.000000 2 5 54.3250 80
T 340 0 45.000000 0 0 35.5000 50

147 rows × 7 columns

In [19]:
train.groupby('Cabin').mean()
Out[19]:
PassengerId Survived Age SibSp Parch Fare Age10
Cabin
A10 584.0 0.0 36.000000 0.0 0.00 40.125000 40.000000
A14 476.0 0.0 29.699118 0.0 0.00 52.000000 30.000000
A16 557.0 1.0 48.000000 1.0 0.00 39.600000 50.000000
A19 285.0 0.0 29.699118 0.0 0.00 26.000000 30.000000
A20 600.0 1.0 49.000000 1.0 0.00 56.929200 50.000000
... ... ... ... ... ... ... ...
F33 310.0 1.0 29.000000 0.0 0.00 11.333333 33.333333
F38 777.0 0.0 29.699118 0.0 0.00 7.750000 30.000000
F4 401.5 1.0 2.500000 2.0 1.00 39.000000 10.000000
G6 216.0 0.5 14.750000 0.5 1.25 13.581250 20.000000
T 340.0 0.0 45.000000 0.0 0.00 35.500000 50.000000

147 rows × 7 columns

In [20]:
train.groupby('Cabin').mean().sort_values(by='Survived',ascending=True)
Out[20]:
PassengerId Survived Age SibSp Parch Fare Age10
Cabin
A10 584.0 0.0 36.000000 0.0 0.0 40.1250 40.0
B86 140.0 0.0 24.000000 0.0 0.0 79.2000 30.0
B94 264.0 0.0 40.000000 0.0 0.0 0.0000 40.0
C110 111.0 0.0 47.000000 0.0 0.0 52.0000 50.0
C111 453.0 0.0 30.000000 0.0 0.0 27.7500 30.0
... ... ... ... ... ... ... ...
C92 652.0 1.0 39.349559 1.0 0.0 89.1042 40.0
B18 427.0 1.0 30.000000 0.0 1.0 57.9792 35.0
C90 711.0 1.0 24.000000 0.0 0.0 49.5042 30.0
D11 766.0 1.0 51.000000 1.0 0.0 77.9583 60.0
C62 C64 701.0 1.0 18.000000 1.0 0.0 227.5250 20.0

147 rows × 7 columns

In [21]:
train.groupby('Cabin').mean().sort_values(by='Fare',ascending=True).iplot(kind='line')
In [22]:
sns.scatterplot(train['Fare'],train['Pclass'])
plt.show()
In [23]:
sns.scatterplot(train['Cabin'],train['Survived'])
plt.show()
In [24]:
traindf=train.drop('Cabin',axis=1,)
traindf.count()
Out[24]:
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            891
SibSp          891
Parch          891
Ticket         891
Fare           891
Embarked       889
Age10          891
dtype: int64
In [41]:
#test
traindf3=train.drop('Cabin',axis=1,)
traindf3.count()
Out[41]:
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            891
SibSp          891
Parch          891
Ticket         891
Fare           891
Embarked       889
Age10          891
dtype: int64
In [42]:
traindf4=traindf3[traindf3['Embarked'].notna()]
traindf4.count()
Out[42]:
PassengerId    889
Survived       889
Pclass         889
Name           889
Sex            889
Age            889
SibSp          889
Parch          889
Ticket         889
Fare           889
Embarked       889
Age10          889
dtype: int64
In [56]:
traindf4['Sex']=='male'
Out[56]:
0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 889, dtype: bool
In [64]:
traindf4.loc[traindf4['Sex']=='male', 'Sex'] = 1
traindf4.loc[traindf4['Sex']=='female', 'Sex'] = 0
traindf4.head()
Out[64]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Age10
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 S 30
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C 40
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 S 30
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 S 40
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 S 40
In [71]:
traindf4.loc[traindf4['Embarked']=='S', 'Embarked'] = 1
traindf4.loc[traindf4['Embarked']=='C', 'Embarked'] = 2
traindf4.loc[traindf4['Embarked']=='Q', 'Embarked'] = 3
traindf4.head()
Out[71]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Age10
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 1 30
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 2 40
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 1 30
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 1 40
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 1 40
In [76]:
train=traindf4[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Survived']]
X_train=traindf4[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]
y_train=traindf4[['Survived']]
In [80]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler=StandardScaler()
data_scaled=scaler.fit_transform(train)

X_train,X_test,y_train,y_test=train_test_split(X_train, y_train, test_size=0.3, random_state=0)

lr_clf=LogisticRegression()
lr_clf.fit(X_train,y_train)
lr_preds=lr_clf.predict(X_test)

from sklearn.metrics import accuracy_score, roc_auc_score
accuracy_score(y_test,lr_preds)
C:\Users\flyto\anaconda3\lib\site-packages\sklearn\utils\validation.py:73: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\flyto\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[80]:
1.0
In [81]:
roc_auc_score(y_test,lr_preds)
Out[81]:
1.0