데이터마이닝 연습으로 Decision Tree 모델을 사용한 예제이다.
먼저 사용한 페키지이다.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.1 √ purrr 0.3.4
## √ tibble 3.0.1 √ dplyr 1.0.0
## √ tidyr 1.1.0 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rpart)
library(rpart.plot)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
다음으로 우리가 사용할 데이터이다. 데이터는 https://www.openml.org/d/29 에서 가져왔다.
raw_hw_tb = read_csv("data/dataset_29_credit-a.csv",na="?")
## Parsed with column specification:
## cols(
## A1 = col_character(),
## A2 = col_double(),
## A3 = col_double(),
## A4 = col_character(),
## A5 = col_character(),
## A6 = col_character(),
## A7 = col_character(),
## A8 = col_double(),
## A9 = col_logical(),
## A10 = col_logical(),
## A11 = col_character(),
## A12 = col_logical(),
## A13 = col_character(),
## A14 = col_character(),
## A15 = col_double(),
## class = col_character()
## )
해당자료는 Credit Approval로 credit card applications와 관련된 자료이다. Ross Quinlan의 자료로, 1987년 UCI에 공개된 자료이다. 모든 자료는 보안을 위해 다른 단어로 변경되었다. 정확히 각각의 변수가 의미하는 바는 알 수 없느나 홈페이지를 통해 각 변수들의 특성을 파악할 수 있다. 먼저 A1, A4, A5, A6, A7, A9, A10, A12, A13는 명목형 자료이다. 대부분은 2개의 범주로 구성되어 있으나, A6과 같은경우 15개의 범주로 구성되어 있다. 반면 A2, A3, A8, A11, A14, A15는 숫자형 범주이다. read_csv에서 보면 그 특성과 다르게 가져왔기에 이에 맞게 변수를 변환하였다. 또한 중간중간 NA가 있으나 나무모형을 사용하기에 그대로 사용하였다.
hw_tb = raw_hw_tb %>% mutate(
A1=as.factor(A1), A4=as.factor(A4), A5=as.factor(A5), A6=as.factor(A6), A7=as.factor(A7),
A9=as.factor(A9), A10=as.factor(A10), A12=as.factor(A12), A13=as.factor(A13),
A2=as.numeric(A2), A3=as.numeric(A3), A8=as.numeric(A8), A11=as.numeric(A11),
A14=as.numeric(A14), A15=as.numeric(A15),
class=as.factor(class)
)
자료의 구조와 총 개수를 보면 다음과 같다.
str(hw_tb)
## tibble [690 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ A1 : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
## $ A2 : num [1:690] 30.8 58.7 24.5 27.8 20.2 ...
## $ A3 : num [1:690] 0 4.46 0.5 1.54 5.62 ...
## $ A4 : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
## $ A5 : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
## $ A6 : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
## $ A7 : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
## $ A8 : num [1:690] 1.25 3.04 1.5 3.75 1.71 ...
## $ A9 : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
## $ A10 : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 1 1 1 1 1 ...
## $ A11 : num [1:690] 1 6 0 5 0 0 0 0 0 0 ...
## $ A12 : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 2 1 2 2 1 1 2 ...
## $ A13 : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
## $ A14 : num [1:690] 202 43 280 100 120 360 164 80 180 52 ...
## $ A15 : num [1:690] 0 560 824 3 0 ...
## $ class: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. A1 = col_character(),
## .. A2 = col_double(),
## .. A3 = col_double(),
## .. A4 = col_character(),
## .. A5 = col_character(),
## .. A6 = col_character(),
## .. A7 = col_character(),
## .. A8 = col_double(),
## .. A9 = col_logical(),
## .. A10 = col_logical(),
## .. A11 = col_character(),
## .. A12 = col_logical(),
## .. A13 = col_character(),
## .. A14 = col_character(),
## .. A15 = col_double(),
## .. class = col_character()
## .. )
nrow(hw_tb)
## [1] 690
#1. Training 데이터와 Test 데이터를 50:50의 비율로 분할하시오.
set.seed(1234)
n = sample(1:nrow(hw_tb), round(nrow(hw_tb)/2))
train = hw_tb[n,]
test = hw_tb[-n,]
c(nrow(train), nrow(test))
## [1] 345 345
#2. R 프로그램의 ‘rpart’ 명령어를 사용하여 의사결정나무를 수행하고자 한다. 단, hyper-parameter는 아래와 같이 조정한다.
A. minsplit = 1 ~ 46 (5의 간격으로)
B. cp = 0.001 ~ 0.01 (0.001의 간격으로)
C. xval = 0 으로 고정 (pruning 없음)
D. 그외 parameter 값들은 default 값을 사용
3. 위 2번의 조건에 맞는 의사결정나무를 training 데이터를 이용하여 생성하고, test 데이터를 이용하여 예측 정확도를 계산하고자 한다. 이때 예측정확도는 AUROC 값을 사용한다. 그 결과, 총 110개의 AUROC 값을 구할 수 있다. 이를 minsplit과 cp 값의 조합에 따라 AUROC 값으로 3차원 포물선 그래프(3D surface plot)를 생성하시오.
auc = matrix(NA,nrow=10,ncol=10)
for(i1 in 1:10){
for(i2 in 1:10){
tree_control = rpart.control(minsplit = 5*i1-4, cp =0.001*i2, xval=0)
temp_tree = assign(paste('tree',5*i1-4,i2,sep='_'), rpart(class ~ . , data=train, method='class', control=tree_control) )
temp_prob = assign(paste('prob',5*i1-4,i2,sep='_'), predict(temp_tree, newdata=test, type="prob") )
temp_roc = assign(paste('roc',5*i1-4,i2,sep='_'), roc(test$class ~ temp_prob[,2]) )
auc[i1,i2]=temp_roc$auc
}
}
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
코드 실행 결과는 아래와 같은 객체를 통해 확인할 수 있다. 자세한 내용은 생략한다.
각각의 모형 : tree_(minsplit값)_(cp값*1000)
각각의 예측 : prob_(minsplit값)_(cp값*1000)
각각의 roc(4번) : roc_(minsplit값)_(cp값*1000)
해당 auroc를 이용하여 3d surface plot을 그리면 아래와 같다.
cp = 0.001*1:10
minsplit = 1:10*5-4
rownames(auc)=cp
colnames(auc)=minsplit
plot_ly(z = auc,x=cp,y=minsplit) %>%
add_surface() %>%
layout(scene = list(
xaxis = list(title = 'cp',tickvals=cp),
yaxis = list(title = 'minsplit',tickvals=minsplit),
zaxis = list(title = 'auc'))
)
그래프는 ploty를 사용하였기에 html에서 종합적으로 볼 수 있다.
#4. 위의 결과에서 예측정확도가 가장 높은 최적의 hyper-parameter 조합은 무엇인지 밝히시오.
먼저 auc 행렬을 이용해 최댓값을 찾으면 다음의 위치이다.
auc == max(auc)
## 1 6 11 16 21 26 31 36 41 46
## 0.001 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.002 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.003 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.004 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## 0.005 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.006 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.007 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.008 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.009 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.01 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
즉 cp가 0.005이고 minsplit이 1, 6, 11, 16, 21일때 가장 좋은 auroc를 가진다. 즉 해당 조합이 최적의 hyper-parameter이다.