데이터마이닝 - Decision Tree

이터마이닝 연습으로 Decision Tree 모델을 사용한 예제이다.

먼저 사용한 페키지이다.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.1     √ purrr   0.3.4
## √ tibble  3.0.1     √ dplyr   1.0.0
## √ tidyr   1.1.0     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rpart)
library(rpart.plot)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

다음으로 우리가 사용할 데이터이다. 데이터는 https://www.openml.org/d/29 에서 가져왔다.

raw_hw_tb = read_csv("data/dataset_29_credit-a.csv",na="?")
## Parsed with column specification:
## cols(
##   A1 = col_character(),
##   A2 = col_double(),
##   A3 = col_double(),
##   A4 = col_character(),
##   A5 = col_character(),
##   A6 = col_character(),
##   A7 = col_character(),
##   A8 = col_double(),
##   A9 = col_logical(),
##   A10 = col_logical(),
##   A11 = col_character(),
##   A12 = col_logical(),
##   A13 = col_character(),
##   A14 = col_character(),
##   A15 = col_double(),
##   class = col_character()
## )

해당자료는 Credit Approval로 credit card applications와 관련된 자료이다. Ross Quinlan의 자료로, 1987년 UCI에 공개된 자료이다. 모든 자료는 보안을 위해 다른 단어로 변경되었다. 정확히 각각의 변수가 의미하는 바는 알 수 없느나 홈페이지를 통해 각 변수들의 특성을 파악할 수 있다. 먼저 A1, A4, A5, A6, A7, A9, A10, A12, A13는 명목형 자료이다. 대부분은 2개의 범주로 구성되어 있으나, A6과 같은경우 15개의 범주로 구성되어 있다. 반면 A2, A3, A8, A11, A14, A15는 숫자형 범주이다. read_csv에서 보면 그 특성과 다르게 가져왔기에 이에 맞게 변수를 변환하였다. 또한 중간중간 NA가 있으나 나무모형을 사용하기에 그대로 사용하였다.

hw_tb = raw_hw_tb %>% mutate(
    A1=as.factor(A1), A4=as.factor(A4), A5=as.factor(A5), A6=as.factor(A6), A7=as.factor(A7),
    A9=as.factor(A9), A10=as.factor(A10), A12=as.factor(A12), A13=as.factor(A13),
    
    A2=as.numeric(A2), A3=as.numeric(A3), A8=as.numeric(A8), A11=as.numeric(A11),
    A14=as.numeric(A14), A15=as.numeric(A15),
    
    class=as.factor(class)
  )

자료의 구조와 총 개수를 보면 다음과 같다.

str(hw_tb)
## tibble [690 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ A1   : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
##  $ A2   : num [1:690] 30.8 58.7 24.5 27.8 20.2 ...
##  $ A3   : num [1:690] 0 4.46 0.5 1.54 5.62 ...
##  $ A4   : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
##  $ A5   : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
##  $ A6   : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
##  $ A7   : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
##  $ A8   : num [1:690] 1.25 3.04 1.5 3.75 1.71 ...
##  $ A9   : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
##  $ A10  : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 1 1 1 1 1 ...
##  $ A11  : num [1:690] 1 6 0 5 0 0 0 0 0 0 ...
##  $ A12  : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 2 1 2 2 1 1 2 ...
##  $ A13  : Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
##  $ A14  : num [1:690] 202 43 280 100 120 360 164 80 180 52 ...
##  $ A15  : num [1:690] 0 560 824 3 0 ...
##  $ class: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   A1 = col_character(),
##   ..   A2 = col_double(),
##   ..   A3 = col_double(),
##   ..   A4 = col_character(),
##   ..   A5 = col_character(),
##   ..   A6 = col_character(),
##   ..   A7 = col_character(),
##   ..   A8 = col_double(),
##   ..   A9 = col_logical(),
##   ..   A10 = col_logical(),
##   ..   A11 = col_character(),
##   ..   A12 = col_logical(),
##   ..   A13 = col_character(),
##   ..   A14 = col_character(),
##   ..   A15 = col_double(),
##   ..   class = col_character()
##   .. )
nrow(hw_tb)
## [1] 690

#1. Training 데이터와 Test 데이터를 50:50의 비율로 분할하시오.

set.seed(1234)
n = sample(1:nrow(hw_tb), round(nrow(hw_tb)/2)) 
train = hw_tb[n,] 
test = hw_tb[-n,]
c(nrow(train), nrow(test))
## [1] 345 345

#2. R 프로그램의 ‘rpart’ 명령어를 사용하여 의사결정나무를 수행하고자 한다. 단, hyper-parameter는 아래와 같이 조정한다.

  • A. minsplit = 1 ~ 46 (5의 간격으로)

  • B. cp = 0.001 ~ 0.01 (0.001의 간격으로)

  • C. xval = 0 으로 고정 (pruning 없음)

  • D. 그외 parameter 값들은 default 값을 사용

3. 위 2번의 조건에 맞는 의사결정나무를 training 데이터를 이용하여 생성하고, test 데이터를 이용하여 예측 정확도를 계산하고자 한다. 이때 예측정확도는 AUROC 값을 사용한다. 그 결과, 총 110개의 AUROC 값을 구할 수 있다. 이를 minsplit과 cp 값의 조합에 따라 AUROC 값으로 3차원 포물선 그래프(3D surface plot)를 생성하시오.

auc = matrix(NA,nrow=10,ncol=10)

for(i1 in 1:10){
  for(i2 in 1:10){
    tree_control = rpart.control(minsplit = 5*i1-4, cp =0.001*i2, xval=0)
    temp_tree = assign(paste('tree',5*i1-4,i2,sep='_'), rpart(class ~ . , data=train, method='class', control=tree_control) )
    temp_prob = assign(paste('prob',5*i1-4,i2,sep='_'), predict(temp_tree, newdata=test, type="prob")  )
    temp_roc = assign(paste('roc',5*i1-4,i2,sep='_'), roc(test$class ~ temp_prob[,2])  )
    auc[i1,i2]=temp_roc$auc
  }
}
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases
## Setting levels: control = -, case = +
## Setting direction: controls < cases

코드 실행 결과는 아래와 같은 객체를 통해 확인할 수 있다. 자세한 내용은 생략한다.

  • 각각의 모형 : tree_(minsplit값)_(cp값*1000)

  • 각각의 예측 : prob_(minsplit값)_(cp값*1000)

  • 각각의 roc(4번) : roc_(minsplit값)_(cp값*1000)

해당 auroc를 이용하여 3d surface plot을 그리면 아래와 같다.

cp = 0.001*1:10
minsplit = 1:10*5-4
rownames(auc)=cp
colnames(auc)=minsplit

plot_ly(z = auc,x=cp,y=minsplit) %>%
        add_surface() %>%
        layout(scene = list(
                xaxis = list(title = 'cp',tickvals=cp),
                yaxis = list(title = 'minsplit',tickvals=minsplit),
                zaxis = list(title = 'auc'))
               )

그래프는 ploty를 사용하였기에 html에서 종합적으로 볼 수 있다.

#4. 위의 결과에서 예측정확도가 가장 높은 최적의 hyper-parameter 조합은 무엇인지 밝히시오.

먼저 auc 행렬을 이용해 최댓값을 찾으면 다음의 위치이다.

auc == max(auc)
##           1     6    11    16    21    26    31    36    41    46
## 0.001 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.002 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.003 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.004  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## 0.005 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.006 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.007 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.008 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.009 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 0.01  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

즉 cp가 0.005이고 minsplit이 1, 6, 11, 16, 21일때 가장 좋은 auroc를 가진다. 즉 해당 조합이 최적의 hyper-parameter이다.

updatedupdated2020-08-202020-08-20