R 기반의 데이터 마이닝 정리
- 예측력 회귀 모형 비교 기준
- 예측력 분류 모형 비교 기준
Regression
Logistic Regression
- 모형 해석
- 코드
Neural Network
Decision Tree
Ensemble Methods
Bagging : Boostrapping AGGregatING
- 과정
- 코드
Boosting
- AdaBoost
  - 코드
- Gradient Boost
Random Forest
Clustering : Unsupervised Learnging
Hierarchical clustering 계층적 군집분석
K-means clustering
- 과정
- K 결정법
- Elbow Point
- silhouette
  - 코드
K-medioid clustering
- 코드
Density-based clustering
- 2개의 모수
  - 코드

R 기반의 데이터 마이닝 정리

설명력 위주 모형 : 통계모형

예측력은 떨어지나 모형을 이해하기 좋음
비교기준 : R²
- X ~ Y 의 인과관계를 얼마나 설명하는지
- 선형이므로 오버피팅을 무시해도됨. 즉 R²가 중요

예측력 위주 모형 : 예측력은 뛰어나지만 이해와 해석이 어려움

미래의 Y값을 얼마나 잘 예측하는지가 기준
머신러닝은 비선형 모델로 오버피팅의 위험
- train set과 test set을 분할하여 overfiting 막음

예측력 회귀 모형 비교 기준

아래의 기준들은 전부 test data를 사용해 계산함

예측 결정계수 R²

R² = corr(y, ŷ)²

평균절대오차 MAE : 절대적인 오차의평균을 이용

$$MAE = {1 \over n} \sum |y - \hat y |$$

Mean absolute percentage error MAPE : 실제값 대비 얼마나 예측값이 차이가 있었는지 %로 표현

$$MAPE = {100\% \over n} \sum {|y - \hat y | \over |y|}$$

Mean squred error MSE : 절대값이 아닌 제곱을 취한 지표

$$MAPE = {100\% \over n} \sum (y - \hat y )^2$$

예측력 분류 모형 비교 기준

Accuracy 계열

단점 : 작위성이 있다 (측도에 따라, cut-off에 따라 순위변동 가능)

모형 작성시 : sensitivity(TPR), specificity(TNR)

현장 적용시 : PPV, NPV

정확도 : $a+d \over a+b+c+d$
민감도 : TPR
특이도 : TNR
정밀도 : PPV
F-1 score : TPR과 PPV의 조화평균
BCR : TPR과 TNR의 기하평균

ROC 계열

ROC : 같은 기술이라면 특이도와 민감도는 반비례관계를 이용한 그래프

- x축 : 1-특이도

- y축 : 민감도

AUROC : area under the ROC curve

- 일반적으로 적당한 기준 : 75% 이상

- 최소 기준 : 65% 이상

Lift Chart 계열

3가지 종류(Response, Captured Response, Lift) 중 어느것을 사용하던 결과는 동일

Lift Chart : 발생확률이 작은 순으로 정렬 후 구간화

- x축 : 구간화된 sample size

- y축 : 반응률 Response 또는 반응검출률 Captured Response 또는 향상도 Lift

    - 반응률 Response : 상위등급이 높게나오고 급락하면 good

    - 향상도 Lift : 상위등급은 1보다 크고, 하위등급은 0에 가까울수록 good

cummulative Lift Chart : 갑자기 오르거나 내리는등 일관되지 않은 부분이 존재하여 cumulative 선호

- 누적 반응률 Response : 경사가 급하면 좋고, unif하면 나쁨

- 누적 향상도 Lift : 1로 하강하며 상위등급이 클수록 good

- 누적 반응 검출률 captured response : 유일하게 상승

Cumulative accruacy profile CAP : 지니계수 개념

- 클수록 좋음

- ( 완벽한 분류시스템 누적반응검출률 면적 - 모형의 누적반응검출률 면적 ) / ( 완벽한 분류시스템 누적반응검출률 면적 - 랜덤한 누적반응검출률 면적 )

Profit chart : maximize profit = income - cost

K-S 통계량량

불량 누적분포와 우량 누적분포의 차이가 가장 큰 값

- 차이가 클수록 좋음

cf. 이론적 분포함수 : 누적분포함수, 실제 data의 분포함수 : 경험분포함수

Regression

간섭효과 comfuounding effect 제거 방안

통제 control : 상수화 -> 일정하게 고정
데이터 확장 : X변수 추가 수집 -> 간섭효과를 양성화

범주형 독립 변수

지시변수 사용 : 0 또는 1

R² 증가 방법

유의한 X변수 발굴

- 독립적이고 다양한 X변수일 수록 유리

범주형 X변수의 교호작용 반영

변수 선택

중요한 소수의 예측변수를 찾아낸는 것이 중요 : 일반적으로 AIC 기준

all subsets
backwrd elimination
forward selection
stepwise elimination

모형 해석

기울기 β

수학적 해석보다는 실무적 해석이 중요

절편은 데이터 범위를 벗어난 경우 의미없으므로 해석하지 않는다.

모수β의 p-value : 기울기 유의성 검정

p-value가 0.05보다 작으면 H₀ : β = 0을 기각 : 즉 유의한 기울기

결정계수 R²

Y의 총 변동량 중에서 X에 의해서 설명된 분량 : 즉 회귀모형의 설명력

- $R^2$이 1에 가까울 수록 완전히 설명

Adjusted R² : X변수의 수가 많을수록 좋아지는 R²의 overfitting의 문제를 반영

모형의 p-value : 회귀모형의 유효성

p-value가 0.05보다 작으면 H₀ : all β_i = 0을 기각 : 즉 적어도 하나 이상의 설명변수가 유의하다

cf. R²와 p-value의 관련성

- 높은 R-squre, 낮은 p-value : 데이터 품질이 높은 경우

- 낮은 R-squre, 낮은 p-value : X변수 추가 발굴 (금융 데이터)

- 낮은 R-squre, 높은 p-value :유의하지 않은 X변수로 구성된 회귀분석

- 높은 R-squre, 높은 p-value : 불가능

R²는 분야별 유연한 기준 필요

- 의학 약학 분야와 같이 실험데이터는 인과관계 단순 : R-squre 높게 나옴

- 금융 경제 분야와 같이 관찰 데이터는 많은 변수와 인과관계 복잡 : R-squre 낮게 나옴

모형의 타당성 검토

선형회귀의 기본 가정 : p-value 계산시 F분포를 이용하기에 필요

정규성 : 오차항의 분포가 평균이 0인 정규성
- normal QQ plot
등분산성 : 오차항의 분산이 동일
- ŷ에 따른 잔차 그래프(residuals plot)가 메가폰 형태 같은 것이 없어야 한다.
- ŷ 증가시 R² 상승의 경우 Y변수 변환 필요 : log(or sqrt) scaling
- 분산은 일정하나 ŷ 증가시 추세가 존재할 경우 추가 X변수 발굴 필요
독립성 : 오차항들이 서로 독립

잔차 분석 : 모형 추정 후 오차의 추정치인 잔차를 통해 위의 가정들을 검토 가능

가법모형과 승법모형

가법모형 : 더하기만 있는 형태로 곱하기는 없음

승법모형 : 교호작용과 상호작용 포함

- 고차(교호작용)가 유의하면 저차가 유의하지 않아도 포함 : 즉 교호작용이 유의하면 main effect에서 유의하지 않아도 포함

- 참고로 $$X$$와 $$X^2$$ 사이의 다중 공산성은 거의 없음 : 다중 공산성은 직선의 관계에서 강하게 발생

코드

### Regression ###

# indicator variables
usedcar2$Ind1<-as.numeric(usedcar2$Color == 'white')
usedcar2$Ind2<-as.numeric(usedcar2$Color == 'silver')

# training and test data
set.seed(1234)
i = sample(1:nrow(usedcar2), round(nrow(usedcar2)*0.7)) #70% for training data, 30% for testdata
train = usedcar2[i,] 
test = usedcar2[-i,]



# regression fitting
lm_used<-lm(Price ~ Odometer + Ind1 + Ind2 + Ind1:Odometer, data=train)
summary(lm_used)

## 
## Call:
## lm(formula = Price ~ Odometer + Ind1 + Ind2 + Ind1:Odometer, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -680.08 -161.16    2.81  167.01  775.62 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.699e+04  2.639e+02  64.405  < 2e-16 ***
## Odometer      -6.410e-02  6.817e-03  -9.403 9.72e-14 ***
## Ind1          -9.388e+02  4.680e+02  -2.006 0.049018 *  
## Ind2           3.321e+02  8.983e+01   3.697 0.000451 ***
## Odometer:Ind1  2.689e-02  1.213e-02   2.217 0.030155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281.6 on 65 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:   0.71 
## F-statistic: 43.24 on 4 and 65 DF,  p-value: < 2.2e-16

# Trellis plot : 범주별 그래프
library(lattice)
mypanel <- function(x, y) {
  panel.xyplot(x, y)
  panel.loess(x, y, col="red", lwd=2, lty=2)  #loess : 비선형 회귀분석의 이름 
  panel.lmline(x, y, col="black", lwd=2, lty=3)
}
xyplot(Price~Odometer|Color,data=usedcar2,panel=mypanel) # color라는 범주별 그래프를 그려줌

# regression diagnostics
plot(lm_used,which=2) ## QQ plot

plot(lm_used,which=1) ## fitted values vs. residuals

# stepwise
ls_st = step(lm_used, direction='both')

## Start:  AIC=794.46
## Price ~ Odometer + Ind1 + Ind2 + Ind1:Odometer
## 
##                 Df Sum of Sq     RSS    AIC
## <none>                       5152676 794.46
## - Odometer:Ind1  1    389490 5542166 797.56
## - Ind2           1   1083262 6235938 805.81

### Model Evaluation for Regression ###
# predicted values
pred1 = predict(ls_st, newdata=test, type='response')

# predictive R^2
cor(test$Price, pred1)^2

## [1] 0.674462

# MAE
mean(abs(test$Price - pred1))

## [1] 235.4435

# MAPE
mean(abs(test$Price - pred1)/abs(test$Price))*100

## [1] 1.57497

# RMSE
sqrt(mean((test$Price - pred1)^2))

## [1] 288.0425

Logistic Regression

설명력위주의 분류분석 : 종속변수가 범주형 변수

선형모델 : cut-off에 따른 hyperplane 분류경계선 자체는 선형

- 오분류 많이 발생 가능

cut-off기준 : 일반적으로 0.5이나, 불균형자료(imbalanced data)의 경우 P(y=1)가 cut-off값이 됨

logit P(y = i) = β₀ + β₁x

장점
- X가 범주형인경우 one-hot encoding을 통해 지시변수로 사용 가능
단점
- NA가 많은 경우 사용 불가능
- 교호작용의 있을 경우 해석이 어려움

모형 해석

X₁이 1 커지면 e^β₁배 만큼 오즈가 변함 : 오즈비 개념

β의 p-value
AUROC

R²가 없으므로, 대신 AUROC 사용

deviance의 p-value

모형의 유의성 확인으로 null model의 deviance와 fitted model의 deviance 비교

코드

### Logistic Regression ###
complete=complete.cases(directmail)
table(complete)

## complete
## FALSE  TRUE 
##   273  9727

directmail1<-directmail[complete,]
dim(directmail1)

## [1] 9727    9

# training and test data
set.seed(1234)
i = sample(1:ncol(directmail1), round(ncol(directmail1)*0.7)) #70% for training data, 30% for testdata
train = directmail1[i,] 
test = directmail1[-i,]

# fitting full model
full_model = glm(RESPOND~. , family="binomial",data=train)
summary(full_model)

## 
## Call:
## glm(formula = RESPOND ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
## [1]  0  0  0  0  0  0
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.457e+01  5.126e+06       0        1
## AGE          1.728e-16  2.443e+04       0        1
## BUY18       -3.387e-14  3.080e+05       0        1
## CLIMATE             NA         NA      NA       NA
## FICO        -3.081e-16  5.703e+03       0        1
## INCOME      -4.179e-15  1.511e+04       0        1
## MARRIED      5.936e-14  9.330e+05       0        1
## OWNHOME             NA         NA      NA       NA
## GENDERM             NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.000e+00  on 5  degrees of freedom
## Residual deviance: 2.572e-10  on 0  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 23

# model's significance
empty_model = glm(RESPOND~1, family="binomial", data=train)
anova(full_model, empty_model, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: RESPOND ~ AGE + BUY18 + CLIMATE + FICO + INCOME + MARRIED + OWNHOME + 
##     GENDER
## Model 2: RESPOND ~ 1
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1         0  2.572e-10                     
## 2         5  2.572e-10 -5        0        1

# variable selection
step_model = step(full_model, direction='both') # direction='backward' is default

## Start:  AIC=12
## RESPOND ~ AGE + BUY18 + CLIMATE + FICO + INCOME + MARRIED + OWNHOME + 
##     GENDER
## 
## 
## Step:  AIC=12
## RESPOND ~ AGE + BUY18 + CLIMATE + FICO + INCOME + MARRIED + OWNHOME
## 
## 
## Step:  AIC=12
## RESPOND ~ AGE + BUY18 + CLIMATE + FICO + INCOME + MARRIED
## 
## 
## Step:  AIC=12
## RESPOND ~ AGE + BUY18 + FICO + INCOME + MARRIED
## 
##           Df  Deviance AIC
## - AGE      1 2.572e-10  10
## - BUY18    1 2.572e-10  10
## - FICO     1 2.572e-10  10
## - INCOME   1 2.572e-10  10
## - MARRIED  1 2.572e-10  10
## <none>       2.572e-10  12
## 
## Step:  AIC=10
## RESPOND ~ BUY18 + FICO + INCOME + MARRIED
## 
##           Df  Deviance AIC
## - BUY18    1 2.572e-10   8
## - FICO     1 2.572e-10   8
## - INCOME   1 2.572e-10   8
## - MARRIED  1 2.572e-10   8
## <none>       2.572e-10  10
## + AGE      1 2.572e-10  12
## + OWNHOME  1 2.572e-10  12
## + GENDER   1 2.572e-10  12
## 
## Step:  AIC=8
## RESPOND ~ FICO + INCOME + MARRIED
## 
##           Df  Deviance AIC
## - FICO     1 2.572e-10   6
## - INCOME   1 2.572e-10   6
## - MARRIED  1 2.572e-10   6
## <none>       2.572e-10   8
## + AGE      1 2.572e-10  10
## + BUY18    1 2.572e-10  10
## + OWNHOME  1 2.572e-10  10
## + GENDER   1 2.572e-10  10
## 
## Step:  AIC=6
## RESPOND ~ INCOME + MARRIED
## 
##           Df  Deviance AIC
## - INCOME   1 2.572e-10   4
## - MARRIED  1 2.572e-10   4
## <none>       2.572e-10   6
## + AGE      1 2.572e-10   8
## + BUY18    1 2.572e-10   8
## + FICO     1 2.572e-10   8
## + OWNHOME  1 2.572e-10   8
## + GENDER   1 2.572e-10   8
## 
## Step:  AIC=4
## RESPOND ~ MARRIED
## 
##           Df  Deviance AIC
## - MARRIED  1 2.572e-10   2
## <none>       2.572e-10   4
## + AGE      1 2.572e-10   6
## + BUY18    1 2.572e-10   6
## + FICO     1 2.572e-10   6
## + INCOME   1 2.572e-10   6
## + OWNHOME  1 2.572e-10   6
## + GENDER   1 2.572e-10   6
## 
## Step:  AIC=2
## RESPOND ~ 1
## 
##           Df  Deviance AIC
## <none>       2.572e-10   2
## + AGE      1 2.572e-10   4
## + BUY18    1 2.572e-10   4
## + FICO     1 2.572e-10   4
## + INCOME   1 2.572e-10   4
## + MARRIED  1 2.572e-10   4
## + OWNHOME  1 2.572e-10   4
## + GENDER   1 2.572e-10   4

# predicted probability
prob_pred1 = predict(step_model, newdata=test, type='response')


# odds ratio
exp(coef(step_model))

##  (Intercept) 
## 2.143345e-11

y_pred1 = as.numeric(prob_pred1 > 0.075)
tab1=table(test$RESPOND, y_pred1)
print(tab1)

##    y_pred1
##        0
##   0 8992
##   1  729

sum(diag(tab1))/sum(tab1)

## [1] 0.9250077

# ROC curve
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roccurve1 <- roc(test$RESPOND ~ prob_pred1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roccurve1, col="red", print.auc=TRUE, print.auc.adj=c(1,-7), auc.polygon=TRUE)

# Scored data
scored_dat = cbind(prob_pred1,test$RESPOND)
head(scored_dat)

##      prob_pred1  
## 3  2.143345e-11 0
## 7  2.143345e-11 0
## 9  2.143345e-11 0
## 11 2.143345e-11 0
## 12 2.143345e-11 0
## 13 2.143345e-11 0

head(scored_dat[order(-prob_pred1),],30)

##      prob_pred1  
## 3  2.143345e-11 0
## 7  2.143345e-11 0
## 9  2.143345e-11 0
## 11 2.143345e-11 0
## 12 2.143345e-11 0
## 13 2.143345e-11 0
## 14 2.143345e-11 1
## 15 2.143345e-11 0
## 16 2.143345e-11 1
## 17 2.143345e-11 0
## 18 2.143345e-11 0
## 19 2.143345e-11 0
## 20 2.143345e-11 0
## 21 2.143345e-11 1
## 22 2.143345e-11 0
## 23 2.143345e-11 0
## 24 2.143345e-11 0
## 25 2.143345e-11 0
## 26 2.143345e-11 0
## 27 2.143345e-11 0
## 28 2.143345e-11 0
## 29 2.143345e-11 0
## 30 2.143345e-11 0
## 31 2.143345e-11 0
## 32 2.143345e-11 0
## 33 2.143345e-11 0
## 34 2.143345e-11 0
## 35 2.143345e-11 0
## 36 2.143345e-11 0
## 37 2.143345e-11 0

# List Chart 1
library(BCA)
layout(matrix(c(1,2), 2, 1))
test$RESPOND=factor(test$RESPOND)
lift.chart("step_model", data=test, targLevel="1", trueResp=0.07, type="incremental", sub="Test")
lift.chart("step_model", data=test, targLevel="1", trueResp=0.07, type="cumulative", sub="Test")

BCA 패키지는 Y값이 factor

targLevel=“1” : 1을 센다는 의미

type=“incremental” : 누적이 아닌경우

type=“cumulative” : 누적인 경우

trueResp=0.07 : 임의의 가이드라인

# List Chart 2
library(ROCR)
pred1 = prediction(prob_pred1, test$RESPOND)
perf1 = performance(pred1,"lift","rpp")
plot(perf1, main="lift chart")
# plot(perf2, add=TRUE, lty=2, col="red")
legend(0.7, 3, legend=c("Model 1", "Model 2"),col=c("black", "red"), lty=1:2)


# K-S statistics
ks.test(prob_pred1[test$RESPOND==1], prob_pred1[test$RESPOND==0])

Neural Network

비선형 통계 모형으로 universal approximator 범용 근사기

장점

예측력은 좋음
회귀 모형, 분류 모형 둘다 가능

단점

추정해야 할 값들이 많음
- 좋은 결과를 위해 많이 대입해 시도해봐야 함하며
학습에 많은 시간이 걸림
black box : 해석이 어려움
overfitting의 위험이 큼
input values가 반드시 numeric
입증verification이 어렵다

sensitivity analysis

NN을 해석하기 위한 방법이지만, 다변량 분석을 단변량 분석처럼 하기에 추천하지 않는 방법

모든 input value의 평균값을 NN모형에 대입
하나의 x변수를 바꿀때(min에서 max로) 변화하는 모형에 따른 output의 변화 측정
- 하나의 x변수를 조금씩 늘려가면서 민감도 그림을 그릴 수 도 있음
sensitive inputs을 중요한 변수로 판단함

hidden nodes : combination function + activation (squashing) function

활성함수로 보통 sigmoid(logitstic) functions와 hypertangent functions 사용

sigmoid functions $sigmoid(x)={e^x\over {1+e^x}}$ : 0~1 값 반환
hypertangent functions $tanh(x)={1-e^x\over {1+e^x}}$ : -1~1값 반환

output layer nodes

회귀 목적 : identity activation function만을 사용

분류 목적 : sigmoid function 사용

- 범주형의 경우 class마다 확률인 0~1을 가지는 하나의 output node

training the networks

오차에 비례하는 objective function를 줄어들게 학습함

cross entropy를 작게 만드는 계수 추정

초기 weight 설정하여 error 계산
gradient descent method에 따라 weight를 조정하며 error가 더이상 줄어들지 않는 지점까지 판단

실용적인 tips

단순하면서 예측을 잘하는 모형이 좋은 모형
먼저 통계모형을 사용해(no hidden layer) 만들어 보기

그리고 노드를 하나씩 추가하며 성능 확인하기
generalization error(일종의 unseen data의 error)가 증가하기 전까지 추가 및 변형

NN는 input data에 매우 민감한 방법 : 좋은 데이터 타입 필요

분산이 비슷한 연속형 변수들
적절한 변수 개수
범주형 변수는 지시변수를 사용하고, 개수가 지시변수별로 비슷해야 한다.

모든 변수가 0~1 또는 -1~1로 scale

따라서 normalize by z-scores 필요 : $new\ x={x-min\over max-min}$
- 정규화 normalize : 0~1사이의 수로 만들어 줌
- 표준화 standardize : 평균0, 표준편차 1로 만들어줌
categorical variable의 경우 지시변수 사용
ordinal variables의 경우 equal spacing 사용 (0~1 or -1~1을 구간화)
변수 선택도 좋은 방법 : decision tree를 만들어 variable importance를 계산해 변수 선택
- 선형모형이 아닌, decision tree 같은 비선형 모형을 바탕으로 변수 선택해야 함

outlier가 있으면 성능을 저하시킴

코드

### Neural Network ###
complete=complete.cases(directmail)
directmail1<-directmail[complete,]
nobs=nrow(directmail1)

nor = function(x) {(x-min(x))/(max(x)-min(x))}

directmail1$AGE <- nor(directmail1$AGE)
directmail1$CLIMATE <- nor(directmail1$CLIMATE)
directmail1$FICO <- nor(directmail1$FICO)
directmail1$INCOME <- nor(directmail1$INCOME)

directmail1$GENDER <- as.numeric(directmail1$GENDER == 'F')



# training and test data
set.seed(1234)
i = sample(1:nobs, round(nobs*0.7)) #70% for training data, 30% for testdata
train = directmail1[i,]  ; test = directmail1[-i,]


library(neuralnet)
set.seed(1234)
nn1<-neuralnet(RESPOND~AGE+BUY18+CLIMATE+FICO+INCOME+MARRIED+OWNHOME+GENDER, 
               data=train, hidden=c(2,2), stepmax = 1e+05, threshold = 0.1, 
               act.fct='logistic', linear.output=F) 
print(nn1$weights)

## [[1]]
## [[1]][[1]]
##             [,1]       [,2]
##  [1,] -1.5007954 -0.5678670
##  [2,]  3.7583278  6.3630250
##  [3,]  3.0686893 -6.9450319
##  [4,] -4.8835950  6.3408508
##  [5,] -0.7893066  1.0792173
##  [6,] -0.6800452  0.8605907
##  [7,] -1.5235982  0.7830834
##  [8,]  3.2645385 -4.8048407
##  [9,]  0.3456037 -0.5042786
## 
## [[1]][[2]]
##           [,1]       [,2]
## [1,] -31.02980  0.2934988
## [2,]  56.36981 -0.4801595
## [3,] 105.96162 -0.6858643
## 
## [[1]][[3]]
##           [,1]
## [1,] -1.343509
## [2,] -3.277906
## [3,]  4.915556

head(nn1$net.result[[1]])

##            [,1]
## 7663 0.06660508
## 8238 0.06654296
## 7362 0.09112853
## 8308 0.06659674
## 7475 0.06024468
## 9452 0.07474474

plot(nn1)

뉴럴 네트워크는 초기값을 이용하기에 set.seed로 고정

hidden = c(3,3) : 히든 노드의 수

stepmax = 1e+06 : 반복 최대 횟수 (수렴하지 않으면 stepmax 까지)

threshold = 0.01 : 오차가 수렴하는 조건 (cross entropy < 0.01)

act.fct = ‘logistic’ : activation 함수

linear.output = F : Y변수가 연속형(회귀분석)이면 T, 분류분석이면 F

# comparison
pred1 <- compute(nn1,covariate=test[,-1])
head(pred1$net.result,10)

##          [,1]
## 2  0.06170793
## 4  0.08481002
## 6  0.08420344
## 7  0.07952745
## 8  0.05923194
## 13 0.05210743
## 14 0.04533934
## 20 0.07912428
## 22 0.06195479
## 23 0.08710547

다른것과 다르게 predict가 아닌 compute를 사용해 예측

covariate = test데이터에서 반응변수(“RESPOND”)를 제외한 행렬

library(pROC)
roccurve1 <- roc(test$RESPOND ~ as.vector(pred1$net.result))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roccurve1, col="red", print.auc=TRUE, print.auc.adj=c(1,7), auc.polygon=TRUE)

Decision Tree

의사결정나무는 전체 데이터 집합을 partition

간단한 모형 underfitting : 설명력 높고 예측력 낮다.

복잡한 모형 overfitting일 : 설명력 낮고 예측력 높다. -> 완전히 성장한 나무는 overfitting 되기 쉬움

장점

- Y가 범주형이든 연속형이등 가능 : 분류 나무, 회귀나무

- 로지스틱 모형에 비해 더 계산이 빠름 : 신속한 판단 가능

- 단변량 분할 one varaiable at a time

    - small n, large p data 가능

- X 변수

    - 변수개수에 영향을 덜받음
    
    - 변수의 중요도 파악 가능    
    
    - 이상치 및 결측치 영향 최소화
    
    - 범주형 X변수 처리 용이

        - 지시변수 불필요
    
    - X 변수가 범주형이면 교효작용 찾는데 탁월한 의사결정 나무 유리

    - 범주들의 대범주화하여 X변수 선택 기능
    
        - 타 분석방법에 앞서 전처리 과정으로 사용가능
        
        - 신경망등 다른 분석에 선택된 변수 사용가능

- 해성의 용이성

- Y 변수가 소수 그룹일경우도 처리가 용이함

단점

- 분류경계선 hyperplabne 근처에서는 오분류 가능

- 다른 방법에 비해 분류 정확도(예측도)가 낮을 수 있음

CART 의사결정 나무 특징

노드가 항상 2개로 분할

숫자형 X변수의 경우 부등호 사용, 범주형 X변수의 경우 부분집합 포함 여부 사용

불순도를 이용하여 greedy serach

매 분할마다 하나의 X변수만 사용

분류 나무의 분할 방법 : 불순도 측정

불순도 함수

Gini impurity 지니 불순도 : $1-\sum^{K}_{j=1} P_j^2$
- K : Y의 범주 개수, P_i : i번째 범주에 포함될 확률
- 지니불순도 최댓값 0.5 : 완전히 균일분포(all P_i =1/K)인 경우
- 지니불순도 최솟값 0 : 불순도가 작을수록 좋음
엔트로피 Entropy : $-\sum^{K}_{j=1} P_j log(P_j)$
이탈도 Deviance : $-2\sum^{K}_{j=1} n_j log(P_j)$

불순도 측정 : 분할 이전의 노드의 불순도와, 분할 후 각각의 노드별 불순도를 측정
분할 후 각각의 노드별 분순도를 표본수에 따라 가중평균
분할 이후와 이전이 불순도가 감소했는지를 측정
greedy search를 통해 최적의 분할을 찾음

CART의 feture selection 기준

parameter를 조정을 통해 최적 찾음

maxdepth : leaf node의 최대 깊이를 제한
- maxdepth 클수록 tree 커짐
minsplit : 각 노드별 최소한의 관측치 수 제한
- minsplit 작을수록 tree 커짐
최소 향상도 cp : complexity parmeter 분할시 최소한으로 작아져야 하는 불순도
- cp 작을수록 tree 커짐

대안 분할 surrogate split

missing value가 있을 경우 다른 추가적인 대안 분할 기준

- main split과 비슷한 속성이 있는 것을 surrogate split으로 이용함

- NA일 경우 imputation(평균, 중앙값으로 NA 대체) 불필요

Pruning 가지치기

overfitting을 막기위해 불필요한 가지 제거

데이터를 3개지로 분할 : training data, pruning data, test data

pruning data를 사용
- pruning data는 test데이터가 아닌 다른 데이터
교차검증 cross-validation로 예측오차를 계산
- test data중 일부를 pruning data로 사용하고 이 과정을 여러번 반복
예측오차가 가장 작은 모형 선택

코드

# training and test data
set.seed(1234)
i = sample(1:nrow(hmeq), round(nrow(hmeq)*0.6)) #60% for training data, 40% for test data
train = hmeq[i,] 
test = hmeq[-i,]

library(rpart)
# default tree
tree0 <- rpart(BAD ~ ., data = train, method="class")

library(rpart.plot)
prp(tree0, type=4, extra=2, digits=3) # 시각화

summary(tree0)

## Call:
## rpart(formula = BAD ~ ., data = train, method = "class")
##   n= 3576 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.03608480      0 1.0000000 1.0000000 0.03276489
## 2 0.02232747      4 0.8389716 0.8714479 0.03109437
## 3 0.01353180      6 0.7943166 0.8619756 0.03096181
## 4 0.01217862      9 0.7496617 0.8565629 0.03088544
## 5 0.01014885     13 0.7009472 0.8389716 0.03063408
## 6 0.01000000     15 0.6806495 0.8322057 0.03053610
## 
## Variable importance
##  DELINQ DEBTINC    CLNO    LOAN   VALUE    NINQ MORTDUE     YOJ     JOB   CLAGE 
##      36      16      10       9       9       6       6       3       2       2 
##  REASON 
##       2 
## 
## Node number 1: 3576 observations,    complexity param=0.0360848
##   predicted class=0  expected loss=0.2066555  P(node) =1
##     class counts:  2837   739
##    probabilities: 0.793 0.207 
##   left son=2 (2813 obs) right son=3 (763 obs)
##   Primary splits:
##       DELINQ  < 0.5      to the left,  improve=115.06830, (327 missing)
##       DEBTINC < 45.15491 to the left,  improve= 73.83211, (763 missing)
##       DEROG   < 0.5      to the left,  improve= 70.98588, (406 missing)
##       LOAN    < 5050     to the right, improve= 29.63802, (0 missing)
##       CLAGE   < 146.9731 to the right, improve= 27.65414, (181 missing)
##   Surrogate splits:
##       CLNO < 47.5     to the left,  agree=0.769, adj=0.011, (202 split)
## 
## Node number 2: 2813 observations,    complexity param=0.0360848
##   predicted class=0  expected loss=0.1397085  P(node) =0.7866331
##     class counts:  2420   393
##    probabilities: 0.860 0.140 
##   left son=4 (2776 obs) right son=5 (37 obs)
##   Primary splits:
##       DEBTINC < 44.67086 to the left,  improve=51.59042, (468 missing)
##       DEROG   < 1.5      to the left,  improve=21.94472, (303 missing)
##       CLNO    < 2.5      to the right, improve=18.44249, (125 missing)
##       CLAGE   < 174.7841 to the right, improve=16.86937, (180 missing)
##       LOAN    < 5050     to the right, improve=16.23187, (0 missing)
## 
## Node number 3: 763 observations,    complexity param=0.0360848
##   predicted class=0  expected loss=0.4534731  P(node) =0.2133669
##     class counts:   417   346
##    probabilities: 0.547 0.453 
##   left son=6 (700 obs) right son=7 (63 obs)
##   Primary splits:
##       DELINQ  < 4.5      to the left,  improve=27.68855, (4 missing)
##       DEBTINC < 43.64351 to the left,  improve=24.35699, (295 missing)
##       NINQ    < 2.5      to the left,  improve=23.61305, (25 missing)
##       DEROG   < 0.5      to the left,  improve=18.75757, (103 missing)
##       CLAGE   < 152.0833 to the right, improve=14.19029, (1 missing)
## 
## Node number 4: 2776 observations,    complexity param=0.01217862
##   predicted class=0  expected loss=0.129683  P(node) =0.7762864
##     class counts:  2416   360
##    probabilities: 0.870 0.130 
##   left son=8 (2729 obs) right son=9 (47 obs)
##   Primary splits:
##       CLNO  < 2.5      to the right, improve=19.34495, (125 missing)
##       LOAN  < 5050     to the right, improve=16.33724, (0 missing)
##       DEROG < 1.5      to the left,  improve=15.66174, (302 missing)
##       CLAGE < 91.55739 to the right, improve=13.27388, (180 missing)
##       VALUE < 21505    to the right, improve=12.14257, (25 missing)
## 
## Node number 5: 37 observations
##   predicted class=1  expected loss=0.1081081  P(node) =0.01034676
##     class counts:     4    33
##    probabilities: 0.108 0.892 
## 
## Node number 6: 700 observations,    complexity param=0.0360848
##   predicted class=0  expected loss=0.4128571  P(node) =0.1957494
##     class counts:   411   289
##    probabilities: 0.587 0.413 
##   left son=12 (637 obs) right son=13 (63 obs)
##   Primary splits:
##       NINQ    < 3.5      to the left,  improve=21.94178, (23 missing)
##       DEBTINC < 43.64351 to the left,  improve=20.35856, (253 missing)
##       CLAGE   < 152.0833 to the right, improve=16.58194, (1 missing)
##       DEROG   < 0.5      to the left,  improve=15.53125, (93 missing)
##       JOB     splits as  LRLRLRR,      improve=14.05560, (0 missing)
## 
## Node number 7: 63 observations
##   predicted class=1  expected loss=0.0952381  P(node) =0.01761745
##     class counts:     6    57
##    probabilities: 0.095 0.905 
## 
## Node number 8: 2729 observations,    complexity param=0.01014885
##   predicted class=0  expected loss=0.1220227  P(node) =0.7631432
##     class counts:  2396   333
##    probabilities: 0.878 0.122 
##   left son=16 (2652 obs) right son=17 (77 obs)
##   Primary splits:
##       LOAN  < 5050     to the right, improve=17.522390, (0 missing)
##       DEROG < 1.5      to the left,  improve=14.739050, (302 missing)
##       CLAGE < 172.5612 to the right, improve=12.230540, (145 missing)
##       VALUE < 21505    to the right, improve=10.899230, (25 missing)
##       NINQ  < 4.5      to the left,  improve= 7.890703, (260 missing)
## 
## Node number 9: 47 observations,    complexity param=0.01217862
##   predicted class=1  expected loss=0.4255319  P(node) =0.01314318
##     class counts:    20    27
##    probabilities: 0.426 0.574 
##   left son=18 (29 obs) right son=19 (18 obs)
##   Primary splits:
##       VALUE   < 75083.5  to the left,  improve=10.564930, (0 missing)
##       YOJ     < 1.5      to the left,  improve=10.563490, (11 missing)
##       JOB     splits as  LRRRR-R,      improve= 9.478723, (0 missing)
##       MORTDUE < 43381.5  to the left,  improve= 8.437500, (15 missing)
##       NINQ    < 4.5      to the right, improve= 7.347144, (0 missing)
##   Surrogate splits:
##       JOB     splits as  LRRLR-R,      agree=0.809, adj=0.500, (0 split)
##       REASON  splits as  LRL,          agree=0.702, adj=0.222, (0 split)
##       LOAN    < 15350    to the left,  agree=0.681, adj=0.167, (0 split)
##       MORTDUE < 50478.5  to the left,  agree=0.660, adj=0.111, (0 split)
##       NINQ    < 0.5      to the right, agree=0.660, adj=0.111, (0 split)
## 
## Node number 12: 637 observations,    complexity param=0.02232747
##   predicted class=0  expected loss=0.3736264  P(node) =0.178132
##     class counts:   399   238
##    probabilities: 0.626 0.374 
##   left son=24 (625 obs) right son=25 (12 obs)
##   Primary splits:
##       DEBTINC < 43.64351 to the left,  improve=17.02290, (213 missing)
##       LOAN    < 6050     to the right, improve=12.93077, (0 missing)
##       JOB     splits as  LRLRLLR,      improve=12.45810, (0 missing)
##       DELINQ  < 1.5      to the left,  improve=12.31275, (4 missing)
##       CLAGE   < 225.4314 to the right, improve=12.20299, (1 missing)
## 
## Node number 13: 63 observations
##   predicted class=1  expected loss=0.1904762  P(node) =0.01761745
##     class counts:    12    51
##    probabilities: 0.190 0.810 
## 
## Node number 16: 2652 observations
##   predicted class=0  expected loss=0.112368  P(node) =0.7416107
##     class counts:  2354   298
##    probabilities: 0.888 0.112 
## 
## Node number 17: 77 observations,    complexity param=0.01014885
##   predicted class=0  expected loss=0.4545455  P(node) =0.02153244
##     class counts:    42    35
##    probabilities: 0.545 0.455 
##   left son=34 (48 obs) right son=35 (29 obs)
##   Primary splits:
##       MORTDUE < 45250    to the right, improve=13.065200, (11 missing)
##       VALUE   < 76762    to the right, improve=10.079490, (2 missing)
##       YOJ     < 8        to the left,  improve= 6.400122, (8 missing)
##       CLAGE   < 77.9912  to the right, improve= 6.058741, (12 missing)
##       NINQ    < 0.5      to the left,  improve= 5.928205, (12 missing)
##   Surrogate splits:
##       VALUE < 57109    to the right, agree=0.955, adj=0.870, (11 split)
##       CLNO  < 12.5     to the right, agree=0.727, adj=0.217, (0 split)
##       LOAN  < 2700     to the right, agree=0.712, adj=0.174, (0 split)
##       JOB   splits as  LLLRLLR,      agree=0.712, adj=0.174, (0 split)
##       CLAGE < 77.9912  to the right, agree=0.682, adj=0.087, (0 split)
## 
## Node number 18: 29 observations,    complexity param=0.01217862
##   predicted class=0  expected loss=0.3103448  P(node) =0.00810962
##     class counts:    20     9
##    probabilities: 0.690 0.310 
##   left son=36 (20 obs) right son=37 (9 obs)
##   Primary splits:
##       YOJ    < 1.5      to the left,  improve=8.555556, (11 missing)
##       VALUE  < 48922    to the right, improve=5.517689, (0 missing)
##       CLNO   < 0.5      to the left,  improve=5.517689, (0 missing)
##       REASON splits as  LLR,          improve=5.198107, (0 missing)
##       LOAN   < 11600    to the right, improve=3.767328, (0 missing)
##   Surrogate splits:
##       VALUE  < 44400    to the right, agree=0.833, adj=0.571, (11 split)
##       REASON splits as  LRR,          agree=0.778, adj=0.429, (0 split)
##       NINQ   < 0.5      to the left,  agree=0.778, adj=0.429, (0 split)
##       CLNO   < 0.5      to the left,  agree=0.778, adj=0.429, (0 split)
##       LOAN   < 14700    to the left,  agree=0.722, adj=0.286, (0 split)
## 
## Node number 19: 18 observations
##   predicted class=1  expected loss=0  P(node) =0.005033557
##     class counts:     0    18
##    probabilities: 0.000 1.000 
## 
## Node number 24: 625 observations,    complexity param=0.02232747
##   predicted class=0  expected loss=0.3616  P(node) =0.1747763
##     class counts:   399   226
##    probabilities: 0.638 0.362 
##   left son=48 (574 obs) right son=49 (51 obs)
##   Primary splits:
##       LOAN   < 6050     to the right, improve=13.16430, (0 missing)
##       CLAGE  < 82.75537 to the right, improve=13.07259, (1 missing)
##       JOB    splits as  LRLRLLR,      improve=12.30388, (0 missing)
##       VALUE  < 60068    to the right, improve=11.87397, (30 missing)
##       DELINQ < 1.5      to the left,  improve=11.60892, (4 missing)
## 
## Node number 25: 12 observations
##   predicted class=1  expected loss=0  P(node) =0.003355705
##     class counts:     0    12
##    probabilities: 0.000 1.000 
## 
## Node number 34: 48 observations
##   predicted class=0  expected loss=0.2708333  P(node) =0.01342282
##     class counts:    35    13
##    probabilities: 0.729 0.271 
## 
## Node number 35: 29 observations
##   predicted class=1  expected loss=0.2413793  P(node) =0.00810962
##     class counts:     7    22
##    probabilities: 0.241 0.759 
## 
## Node number 36: 20 observations
##   predicted class=0  expected loss=0  P(node) =0.005592841
##     class counts:    20     0
##    probabilities: 1.000 0.000 
## 
## Node number 37: 9 observations
##   predicted class=1  expected loss=0  P(node) =0.002516779
##     class counts:     0     9
##    probabilities: 0.000 1.000 
## 
## Node number 48: 574 observations,    complexity param=0.0135318
##   predicted class=0  expected loss=0.3310105  P(node) =0.1605145
##     class counts:   384   190
##    probabilities: 0.669 0.331 
##   left son=96 (333 obs) right son=97 (241 obs)
##   Primary splits:
##       DELINQ < 1.5      to the left,  improve=11.324740, (4 missing)
##       DEROG  < 0.5      to the left,  improve= 9.756626, (72 missing)
##       CLAGE  < 82.75537 to the right, improve= 9.440944, (1 missing)
##       LOAN   < 35850    to the left,  improve= 9.396460, (0 missing)
##       JOB    splits as  LRLRLLR,      improve= 9.121343, (0 missing)
##   Surrogate splits:
##       CLNO   < 34.5     to the left,  agree=0.600, adj=0.038, (4 split)
##       REASON splits as  RLL,          agree=0.596, adj=0.030, (0 split)
##       JOB    splits as  LRLLLRL,      agree=0.596, adj=0.030, (0 split)
##       LOAN   < 6750     to the right, agree=0.589, adj=0.013, (0 split)
##       CLAGE  < 55.202   to the right, agree=0.588, adj=0.008, (0 split)
## 
## Node number 49: 51 observations,    complexity param=0.01217862
##   predicted class=1  expected loss=0.2941176  P(node) =0.01426174
##     class counts:    15    36
##    probabilities: 0.294 0.706 
##   left son=98 (17 obs) right son=99 (34 obs)
##   Primary splits:
##       CLNO    < 23.5     to the right, improve=11.294120, (0 missing)
##       MORTDUE < 42850    to the right, improve= 8.000000, (6 missing)
##       VALUE   < 54842.5  to the right, improve= 6.696429, (3 missing)
##       CLAGE   < 157.0398 to the right, improve= 6.352941, (0 missing)
##       YOJ     < 4.5      to the right, improve= 4.235294, (1 missing)
##   Surrogate splits:
##       CLAGE   < 151.0333 to the right, agree=0.824, adj=0.471, (0 split)
##       YOJ     < 6.5      to the right, agree=0.745, adj=0.235, (0 split)
##       DELINQ  < 2.5      to the right, agree=0.725, adj=0.176, (0 split)
##       VALUE   < 59018.5  to the right, agree=0.706, adj=0.118, (0 split)
##       MORTDUE < 52017    to the right, agree=0.686, adj=0.059, (0 split)
## 
## Node number 96: 333 observations
##   predicted class=0  expected loss=0.2492492  P(node) =0.09312081
##     class counts:   250    83
##    probabilities: 0.751 0.249 
## 
## Node number 97: 241 observations,    complexity param=0.0135318
##   predicted class=0  expected loss=0.4439834  P(node) =0.06739374
##     class counts:   134   107
##    probabilities: 0.556 0.444 
##   left son=194 (203 obs) right son=195 (38 obs)
##   Primary splits:
##       VALUE   < 126206.5 to the left,  improve=8.916962, (17 missing)
##       LOAN    < 35550    to the left,  improve=8.496324, (0 missing)
##       CLAGE   < 110.3866 to the right, improve=7.138055, (0 missing)
##       MORTDUE < 67970.5  to the left,  improve=4.980707, (11 missing)
##       JOB     splits as  -LLLLLR,      improve=4.457637, (0 missing)
##   Surrogate splits:
##       MORTDUE < 104486.5 to the left,  agree=0.915, adj=0.441, (12 split)
##       CLNO    < 54       to the left,  agree=0.875, adj=0.176, (5 split)
##       JOB     splits as  -LLLLLR,      agree=0.871, adj=0.147, (0 split)
##       REASON  splits as  RLL,          agree=0.866, adj=0.118, (0 split)
##       CLAGE   < 445.9928 to the left,  agree=0.866, adj=0.118, (0 split)
## 
## Node number 98: 17 observations
##   predicted class=0  expected loss=0.2352941  P(node) =0.004753915
##     class counts:    13     4
##    probabilities: 0.765 0.235 
## 
## Node number 99: 34 observations
##   predicted class=1  expected loss=0.05882353  P(node) =0.00950783
##     class counts:     2    32
##    probabilities: 0.059 0.941 
## 
## Node number 194: 203 observations,    complexity param=0.0135318
##   predicted class=0  expected loss=0.3842365  P(node) =0.05676734
##     class counts:   125    78
##    probabilities: 0.616 0.384 
##   left son=388 (162 obs) right son=389 (41 obs)
##   Primary splits:
##       MORTDUE < 48036.5  to the right, improve=6.814823, (11 missing)
##       CLAGE   < 110.3866 to the right, improve=5.615953, (0 missing)
##       VALUE   < 58870    to the right, improve=4.238080, (13 missing)
##       DELINQ  < 2.5      to the left,  improve=4.179322, (4 missing)
##       JOB     splits as  -LLRLLR,      improve=3.869381, (0 missing)
##   Surrogate splits:
##       VALUE < 61429    to the right, agree=0.818, adj=0.054, (6 split)
##       CLNO  < 8.5      to the right, agree=0.818, adj=0.054, (5 split)
##       LOAN  < 36050    to the left,  agree=0.812, adj=0.027, (0 split)
##       CLAGE < 313.9898 to the left,  agree=0.812, adj=0.027, (0 split)
## 
## Node number 195: 38 observations
##   predicted class=1  expected loss=0.2368421  P(node) =0.0106264
##     class counts:     9    29
##    probabilities: 0.237 0.763 
## 
## Node number 388: 162 observations
##   predicted class=0  expected loss=0.3148148  P(node) =0.04530201
##     class counts:   111    51
##    probabilities: 0.685 0.315 
## 
## Node number 389: 41 observations
##   predicted class=1  expected loss=0.3414634  P(node) =0.01146532
##     class counts:    14    27
##    probabilities: 0.341 0.659

rpart는 자동적으로 교호작용이 포함되므로 포함해서 작성하면 안됨

method = ‘class’ : 분류나무

method = ‘anova’ : 회귀나무 (Y가 연속형인 경우)

# maximal tree
set.seed(1234)
my.control <- rpart.control(xval=10, cp=0.001, minsplit=35)
tree1 <- rpart(BAD ~ ., data = train, method="class", control=my.control)
plot(tree1, uniform=T, compress=T, margin=0.05)

xval : cross-validation으로 몇번 교차검정할지를 의미

printcp(tree1)

## 
## Classification tree:
## rpart(formula = BAD ~ ., data = train, method = "class", control = my.control)
## 
## Variables actually used in tree construction:
##  [1] CLAGE   CLNO    DEBTINC DELINQ  DEROG   JOB     LOAN    MORTDUE NINQ   
## [10] REASON  VALUE   YOJ    
## 
## Root node error: 739/3576 = 0.20666
## 
## n= 3576 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0360848      0   1.00000 1.00000 0.032765
## 2  0.0223275      4   0.83897 0.86739 0.031038
## 3  0.0135318      6   0.79432 0.85521 0.030866
## 4  0.0121786      9   0.74966 0.85656 0.030885
## 5  0.0101488     12   0.71313 0.84303 0.030693
## 6  0.0060893     14   0.69283 0.77402 0.029662
## 7  0.0054127     16   0.68065 0.76184 0.029472
## 8  0.0040595     18   0.66982 0.74290 0.029171
## 9  0.0036085     22   0.65223 0.73613 0.029062
## 10 0.0033829     29   0.62111 0.74290 0.029171
## 11 0.0030447     39   0.57104 0.74425 0.029193
## 12 0.0020298     44   0.55480 0.74966 0.029279
## 13 0.0013532     46   0.55074 0.75372 0.029344
## 14 0.0010000     47   0.54939 0.76184 0.029472

cp : 복잡도 cost-complexity

nsplit : number of slplit

rel error : (train data의) 상대오차

xerror : cross-validation의 오차

plotcp(tree1)

최종 노드의 수 = 분할의 수 + 1

1 SE rule : xerror +- 1 * Xstd

오차는 동일(위 같은 구간안에 있는 것들)하면서 간결한 모형 선택

점선 아래의 오차들은 통계적으로 차이 없는 것들

# pruning
tree1.prun <- prune(tree1, cp = 0.006)
print(tree1.prun)

## n= 3576 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 3576 739 0 (0.79334452 0.20665548)  
##     2) DELINQ< 0.5 2813 393 0 (0.86029150 0.13970850)  
##       4) DEBTINC< 44.67086 2776 360 0 (0.87031700 0.12968300)  
##         8) CLNO>=2.5 2729 333 0 (0.87797728 0.12202272)  
##          16) LOAN>=5050 2652 298 0 (0.88763198 0.11236802)  
##            32) DEROG< 1.5 2575 265 0 (0.89708738 0.10291262) *
##            33) DEROG>=1.5 77  33 0 (0.57142857 0.42857143)  
##              66) JOB=Other,Sales 44  12 0 (0.72727273 0.27272727) *
##              67) JOB=,Mgr,Office,ProfExe,Self 33  12 1 (0.36363636 0.63636364) *
##          17) LOAN< 5050 77  35 0 (0.54545455 0.45454545)  
##            34) MORTDUE>=45250 48  13 0 (0.72916667 0.27083333) *
##            35) MORTDUE< 45250 29   7 1 (0.24137931 0.75862069) *
##         9) CLNO< 2.5 47  20 1 (0.42553191 0.57446809)  
##          18) VALUE< 75083.5 29   9 0 (0.68965517 0.31034483) *
##          19) VALUE>=75083.5 18   0 1 (0.00000000 1.00000000) *
##       5) DEBTINC>=44.67086 37   4 1 (0.10810811 0.89189189) *
##     3) DELINQ>=0.5 763 346 0 (0.54652687 0.45347313)  
##       6) DELINQ< 4.5 700 289 0 (0.58714286 0.41285714)  
##        12) NINQ< 3.5 637 238 0 (0.62637363 0.37362637)  
##          24) DEBTINC< 43.64351 625 226 0 (0.63840000 0.36160000)  
##            48) LOAN>=6050 574 190 0 (0.66898955 0.33101045)  
##              96) DELINQ< 1.5 333  83 0 (0.75075075 0.24924925) *
##              97) DELINQ>=1.5 241 107 0 (0.55601660 0.44398340)  
##               194) VALUE< 126206.5 203  78 0 (0.61576355 0.38423645)  
##                 388) MORTDUE>=48036.5 162  51 0 (0.68518519 0.31481481) *
##                 389) MORTDUE< 48036.5 41  14 1 (0.34146341 0.65853659) *
##               195) VALUE>=126206.5 38   9 1 (0.23684211 0.76315789) *
##            49) LOAN< 6050 51  15 1 (0.29411765 0.70588235)  
##              98) CLNO>=23.5 17   4 0 (0.76470588 0.23529412) *
##              99) CLNO< 23.5 34   2 1 (0.05882353 0.94117647) *
##          25) DEBTINC>=43.64351 12   0 1 (0.00000000 1.00000000) *
##        13) NINQ>=3.5 63  12 1 (0.19047619 0.80952381) *
##       7) DELINQ>=4.5 63   6 1 (0.09523810 0.90476190) *

prp(tree1.prun, type=4, extra=2, digits=3)

# comparison of trees
prob0 <- predict(tree0, newdata=test, type="prob") 
prob1 <- predict(tree1.prun, newdata=test, type="prob")

type=“prob” : 분류

type=“vector” : 회귀

library(pROC)
roccurve0 <- roc(test$BAD ~ prob0[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roccurve1 <- roc(test$BAD ~ prob1[,2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roccurve1, col="red", print.auc=TRUE, print.auc.adj=c(2.5,-8), auc.polygon=TRUE)
plot(roccurve0, col="green", add=TRUE,  print.auc=TRUE, print.auc.adj=c(-1,-5))

아래는 회귀나무

## Regression tree
library(rpart)
library(rpart.plot)

# default tree
tree_default = rpart(Price ~ Odometer + Color, data=usedcar2, method="anova")
prp(tree_default, type=4, extra=1, digits=3)

# Pruning
set.seed(1234)
my.control = rpart.control(xval=10, cp=0.001, minsplit = 5)
tree_max = rpart(Price ~ Odometer + Color, data=usedcar2, method="anova", control=my.control)
printcp(tree_max)

## 
## Regression tree:
## rpart(formula = Price ~ Odometer + Color, data = usedcar2, method = "anova", 
##     control = my.control)
## 
## Variables actually used in tree construction:
## [1] Color    Odometer
## 
## Root node error: 25739561/100 = 257396
## 
## n= 100 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.5562810      0  1.000000 1.02029 0.110236
## 2  0.1019455      1  0.443719 0.45653 0.069843
## 3  0.0333941      2  0.341773 0.41029 0.059533
## 4  0.0248981      3  0.308379 0.44397 0.057177
## 5  0.0208226      4  0.283481 0.42798 0.053579
## 6  0.0131915      6  0.241836 0.43228 0.055434
## 7  0.0121626      7  0.228645 0.43330 0.053882
## 8  0.0105416      8  0.216482 0.43131 0.054461
## 9  0.0094785      9  0.205940 0.43300 0.055745
## 10 0.0089991     10  0.196462 0.44330 0.058084
## 11 0.0089429     12  0.178464 0.43867 0.058210
## 12 0.0084675     14  0.160578 0.43745 0.058258
## 13 0.0064467     18  0.125873 0.43800 0.062779
## 14 0.0063479     20  0.112980 0.44632 0.064453
## 15 0.0037235     21  0.106632 0.44614 0.063252
## 16 0.0026928     22  0.102908 0.44974 0.063959
## 17 0.0022088     23  0.100215 0.43844 0.063390
## 18 0.0020300     24  0.098007 0.43734 0.063495
## 19 0.0015420     25  0.095977 0.43452 0.062900
## 20 0.0010353     26  0.094435 0.42770 0.061689
## 21 0.0010000     28  0.092364 0.42499 0.060838

plotcp(tree_max)

tree_prun = prune(tree_max, cp=0.04)
prp(tree_prun, type=4, extra=1, digits=3)

Ensemble Methods

- 로지스틱 : bias가 크지만 variance는 작음

    - 하나의 관측값이 변해도 큰 차이 없음

    - likelihood 함수에 의존하여 다양성 확보 어려움
  
- 의사결정 나무, 머신러닝 : bias는 작으나 variance가 큼

    - 하나의 관측값에도 크게 변화 : instability 즉 hyperplane의 변동성이 크다

    - 다양성 확보 가능

앙상블 : instability가 큰 모델을 여러번 학습하여 bias와 variance를 줄여 오류의 감소 추구

- 분산 감소에 의한 오류 감소 : Bagging, Random Forest

- 편향 감소에 의한 오류 감소 : Boosting(분산도 줄이지만 편향을 더 많이 줄임)

- 분산과 편향 모두 감소에 의한 오류 감소 : Mixture of Exports

다양성divesity를 어떻게 확보할 것인지가 중요, 그리고 이를 어떻게 결합aggregate할 것인가

-> 각각의 모델은 성능도 좋으면서 서로 다양한 형태를 나타내는 것이 가장 이상적

배깅 : 데이터 변형을 통해 tree 생성
부스팅 : 가중치 변형을 통해 tree 생성
랜덤포레스트 : 배깅의 boostrap + 변수 임의 추출

Bagging : Boostrapping AGGregatING

classification on unweighted majority voting

단점

해석력을 잃어버림

장점

예측력을 높임

과정

전체 데이터 집합에서 각 학습 데이터 셋을 boostrapping 즉 복원추출하여 원래 데이터 수만큼 크기를 갖도록 생성
- Bootstrap sample training set 복원추출된 데이터셋 :

T⁽¹⁾, ... , T^(B)

boostrap을 바탕으로 의사 결정 나무 시행, 이 과정을 반복
- Classifiers for each sample :

C₁(x, T⁽¹⁾), ... , C_B(x, T^(B))

- 각 의사결정나무는 서로 다른 학습 데이터셋을 사용하게 됨

최종 예측은 각 의사나무의 예측 결과를 다수결의 방법을 통해 취합 -> 분산 감소
- number of times that classified as j :

$$N_j=\sum ^B_{b=1}I[ C_b(x,T^{(b)})=y_j]$$

- Bagging :

C(x) = argmax_jN_j

코드

library(rpart)
library(adabag)

## Loading required package: caret

## Loading required package: ggplot2

## Loading required package: foreach

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

set.seed(1234)
i = sample(1:nrow(german), round(nrow(german)*0.7)) #70% for training data, 30% for testdata
german.train = german[i,] 
german.test = german[-i,]


my.control <- rpart.control(xval=0, cp=0, minsplit=5, maxdepth=10)
bag.train.german <- bagging(y ~ ., data = german.train, mfinal=50, control=my.control)

bagging을 할 때는 pruning 하지 않고 tree를 크게 만듦 : 분산이 커져 변동성이 증가하나 bias는 작아짐

xval=0 : cross validation이 0이란 의므로 pruning 하지 않음

mfinal : tree의 개수로 요즘에는 100~200이 기본

pred.bag.german <- predict.bagging(bag.train.german, newdata=german.test)
print(bag.train.german$importance)

##         age       check      credit     debtors    duration  employment 
##  6.98289316 19.74036811 12.53993235  1.73453022  9.14184507  6.64297839 
##     foreign     history     housing installment         job  numcredits 
##  0.09733153  7.43690856  1.29543971  1.79045071  2.29440866  0.69766383 
##      others    personal    property     purpose   residence residpeople 
##  1.84126119  2.48535083  5.06881899 10.61329877  3.17103304  0.63697943 
##     savings   telephone 
##  5.10899380  0.67951365

변수 중요도로 숫자의 절대적 의미는 없고 상대적 의미만 있음

비선형 모델의 변수중요도 : 각각의 tree마다의 변수중요도를 통합한 것 -> 이 결과를 NN에도 사용 가능함

1-sum( diag(pred.bag.german$confusion) ) / sum(pred.bag.german$confusion)

## [1] 0.27

head(pred.bag.german$prob)

##      [,1] [,2]
## [1,] 0.44 0.56
## [2,] 0.54 0.46
## [3,] 0.00 1.00
## [4,] 0.34 0.66
## [5,] 0.46 0.54
## [6,] 0.06 0.94

library(pROC)
roccurve <- roc(german.test$y ~ pred.bag.german$prob[,1])

## Setting levels: control = bad, case = good

## Setting direction: controls > cases

plot(roccurve)

auc(roccurve)

## Area under the curve: 0.7418

Boosting

classification on unequal weighted training data

오분류 관측치의 가중치를 높이며 분류를 반복, 각각의 분류를 합해 최종 분류를 계산하는 방법

- 분류경계선 근처의 가중치가 커짐

- 결국 variance와 bias 감소 (bias가 더 크게 감소)

부스팅의 경우 tree를 작게 만듦

train set과 test set을 7대3으로 분할하면 그 속성이 남아 있게되므로, 이러한 분할도 여러번 해보는 것이 좋음

AdaBoost

먼저 초기 가중치를 줌

Initialized weight : w_i = 1/N

가중치를 사용하여 분류하며 가중치를 변경하는 과정 i=1를 i=M까지 반복

가중치 w_i를 이용하여 classifier C_m(x) 생성
오차인 $err_m={\sum^N_{i=1}w_iI(y_i\not= C_m(X)i)) \over \sum^N_{i=1}w_i}$ 를 계산 : 오분류시 $I(y_i\not= C_m(X_i))=1$, 정분류시 $I(y_i\not= C_m(X_i))=0$
tree의 가중치인 α_m = log((1 − err_m)/err_m) 를 계산 : error가 적으면 α상승
새로운 가중치 생성 $w_i=w_i\ \text{exp}[\alpha_mI(y_i\not= C_m(X_i))]$
새로운 가중치 전체의 합이 1이 되도록 조정 : ∑w_i = 1

반복한 분류를 가중치를 반영하여 최종 분류를 만듦 $C_{AD}(X)=sign[\sum^M_{m=1}\alpha_mC_M(X)]$

즉 bagging은 tree당 같은 가중치이지만, Adaboost는 tree별로 다른 가중치α_m를 반영

코드

##### Boosting
library(rpart)
library(adabag)
set.seed(1234)
my.control <- rpart.control(xval=0, cp=0, maxdepth=1)
boo.german <- boosting(y ~ ., data = german, boos=T, mfinal=100, control=my.control)

summary(boo.german)

boo.german$trees

print(boo.german$importance)
importanceplot(boo.german)

pred.boo.german <- predict.boosting(boo.german, newdata=german) # 원래는 데이터 분할 필요
head(pred.boo.german$prob,10)

print(pred.boo.german$confusion)
1-sum(diag(pred.boo.german$confusion))/sum(pred.boo.german$confusion)
evol.german=errorevol(boo.german, newdata=german)
plot.errorevol(evol.german)


roccurve <- roc(german$y ~ pred.boo.german$prob[,1])
plot(roccurve)
auc(roccurve)

xval=0 : 푸루닝 하지 않음

maxdepth=1 : 분할 한번뿐 -> 일반적으로 boosting은 작을 수록 좋음 (보통 1~4

boos=T : 가중치 반영시 bootstrap에서 가중치 높은 것을 여러번 더 뽑을 수 있게 하면 AdaBoost 방법을 이용해 boosting 할 수 있다.

Gradient Boost

Y = h₁(X) + err1에서 오류를 다시 err1 = h₂(X) + err2와 같이 분류하는 방법 : F_m(X) = F_m − 1(X) + w_mh_m(X)

- 즉 error를 바탕으로 boosting을 계속하는 방법

Gredient Descent 알고리즘으로 최적 weight 계산

각 함수별 최적 weight 찾으면 예측 정확도는 더 높아짐

Random Forest

변수의 임의 추출을 통해 다양성 확보

bootstrapping과 predictor subset selection을 동시에 적용하여 개별 tree의 다양성을 극대화
tree들이 서로 독립이 되도록 하고자 함
각 노드를 분할할 때, p개의 변수 중에서 탐색하지 않고, m개 (m<p)의 변수 중에서 탐색하여 분할함
- m=p이면 배깅과 동일
m의 값이 작으면 각 나무모형들 간의 상관관계가 감소함 -> m이 너무 적으면 정확도가 낮아짐
일반적인 앙상블의 크기 m : 분류 data인 경우 $m=\sqrt{p}$, 회귀 data인 경우 $p\over 3$

과정

boostrap 적용 : bagging과 동일
변수의 부분집합 선택을 통한 다양성 확보 : X변수중 random하게 일부만 greedy search
- 즉 decision tree에서 분할시 분할에 greedy search하게 되는 x변수가 매번 random하게 정해짐
- 여러번 해도 모든 변수를 greedy search하면 중요한 변수는 뿌리 노드 근처에 자주 나옴

개별 분류 모델의 결과 aggregating 방법

majority voting : $\hat{Y}_{Ensemble}=\text{argmax}_i(\sum_{j=1}^n\delta({Y}_j=i),\ i\in\{0,1\})$

weighted (각 모델의 예측정확도 TrnAcc_j 사용) voting : $\hat{Y}_{Ensemble}= \text{argmax}_i({\sum_{j=1}^n(TrnAcc_j)\delta(Y_j=i)\over\sum_{j=1}^n(TrnAcc_j)},\ i\in\{0,1\})$

weighted (각 class별 에측 확률) voting : $\hat{Y}_{Ensemble}= \text{argmax}_i({{1\over n}\sum_{j=1}^nP(Y_j=i)},\ i\in\{0,1\})$

OOB Error : Out Of Bag error

개별 학습 데이터셋 구성시 bootstrap 되지 않은 개체들을 검증용으로 사용

이 값을 test 데이터로 삼으면, test error가 계산됨

변수 중요도

회귀분석과는 다르게 개별 변수가 통계적으로 얼마나 유의한지에 대한 정보(p-value)가 없음어 간접적인 방식으로 추정

- 절대적인 개념이 아니라 상대적인 개념

- 다른 모형에서도 OOB를 활용한 변수중요도개념 자주 활용

원래 OOB 데이터 집합에 대해서 OOB Error e_j 구함
- j 는 1부터 m개 까지의 각각의 tree
특정 변수의 값을 임의로 뒤섞은 random permutation OOB 데이터 집합에 대해서 permutaion OOB Error p_j를 구함
- 다른 X변수와 Y변수를 제외하고, 오직 특정 X변수 하나만의 순서를 임의로 바꿈
- 해당 특정 X변수가 noise 변수가 됨
OOB Error 차이d_i = e_i − p_i의 평균과 분산을 계산
M개의 모든 tree를 통해 계산한 OBB Error의 차이의 편균과 표준편차로 변수중요도 계산
- i번째 변수의 변수 중요도 :

$$v_i={\bar{d}\over s_d}$$

$\bar{d}=\sum_{j=1}^md_j/m$ 및 $s_d^2=\sum_{j}^{m-1}(d_j-\bar{d})^2/(m-1)$

- 오류율의 차이가 클수록 해당변수가 tree에서 중요한 역할

코드

library(randomForest)
set.seed(1234)
i = sample(1:nrow(german), round(nrow(german)*0.7)) #70% for training data, 30% for testdata
german.train = german[i,] 
german.test = german[-i,]

rf.train.german <- randomForest(y ~ ., data = german.train, ntree=100, mtry=5, 
                                importance=T, na.action=na.omit)
pred.rf.german <- predict(rf.train.german, newdata=german.test)

tab=table(german.test$y,pred.rf.german, dnn=c("Actual","Predicted"))
print(tab)
1-sum(diag(tab))/sum(tab)
prob.rf.german <- predict(rf.train.german, newdata=german.test, type="prob")
head(prob.rf.german)

코드 : 회귀분석의 경우

### Random Forest ###
## Regression
nobs=nrow(usedcar2)

# indicator variables
usedcar2$Ind1<-as.numeric(usedcar2$Color == 'white')
usedcar2$Ind2<-as.numeric(usedcar2$Color == 'silver')

# training and test data
set.seed(1234)
i = sample(1:nobs, round(nobs*0.7)) #70% for training data, 30% for testdata
train = usedcar2[i,] 
test = usedcar2[-i,]

# several models
tmpmodel = lm(Price ~ Odometer+Ind1+Ind2+Ind1:Odometer+Ind2:Odometer, data=train)
model1 = step(tmpmodel, direction = 'both')

## Start:  AIC=796.19
## Price ~ Odometer + Ind1 + Ind2 + Ind1:Odometer + Ind2:Odometer
## 
##                 Df Sum of Sq     RSS    AIC
## - Odometer:Ind2  1     19462 5152676 794.46
## <none>                       5133214 796.19
## - Odometer:Ind1  1    394444 5527658 799.37
## 
## Step:  AIC=794.46
## Price ~ Odometer + Ind1 + Ind2 + Odometer:Ind1
## 
##                 Df Sum of Sq     RSS    AIC
## <none>                       5152676 794.46
## + Odometer:Ind2  1     19462 5133214 796.19
## - Odometer:Ind1  1    389490 5542166 797.56
## - Ind2           1   1083262 6235938 805.81

summary(model1)

## 
## Call:
## lm(formula = Price ~ Odometer + Ind1 + Ind2 + Odometer:Ind1, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -680.08 -161.16    2.81  167.01  775.62 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.699e+04  2.639e+02  64.405  < 2e-16 ***
## Odometer      -6.410e-02  6.817e-03  -9.403 9.72e-14 ***
## Ind1          -9.388e+02  4.680e+02  -2.006 0.049018 *  
## Ind2           3.321e+02  8.983e+01   3.697 0.000451 ***
## Odometer:Ind1  2.689e-02  1.213e-02   2.217 0.030155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281.6 on 65 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:   0.71 
## F-statistic: 43.24 on 4 and 65 DF,  p-value: < 2.2e-16

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model2 <- randomForest(Price ~ Odometer + Color, data = train, ntree=100, mtry=2,
                           importance=T, na.action=na.omit)

# predicted values
pred1 = predict(model1, newdata=test, type='response')
pred2 = predict(model2, newdata=test, type='response')

# predictive R^2
cor(test$Price, pred1)^2

## [1] 0.674462

cor(test$Price, pred2)^2

## [1] 0.6775222

# MAE
mean(abs(test$Price - pred1))

## [1] 235.4435

mean(abs(test$Price - pred2))

## [1] 209.5104

# MAPE
mean(abs(test$Price - pred1)/abs(test$Price))*100

## [1] 1.57497

mean(abs(test$Price - pred2)/abs(test$Price))*100

## [1] 1.408753

# variable importance of Random Forest
model2$importance

##           %IncMSE IncNodePurity
## Odometer 318591.9      16475559
## Color     22017.8       1372346

Clustering : Unsupervised Learnging

X 변수 대부분 연속형이어야 함 : distance를 사용하므로 범주형의 경우 계산 불가 (지시변수도 가능하나 좋지 못한 방법)

군집분석은 자료 사이의 거리를 이용하기에, 변수별 단위가 결과에 큰 영향 -> 반드시 표준화 standardization 필요

표준화 standardization

관측치 표준화
- 관측치가 사람의 경우 관측치 표준화 불필요
- 파생변수로 관찰치 표준화가능
  - 파생변수 생성시 비율의 경우 관려적으로 log 취함
변수 표준화
- 각 변수의 관찰값으로부터 그 변수의 평균을 빼고, 그 변수의 표준편차로 나누는 것
- 표준화된 자료는 모든 변수가 평균이 0이고 표준편차가 1

군집화의 기준 : distance

동일한 군집에 속하면 여러 속성이 비슷하고, 다른군집에 속한 관측치는 그렇지 않음
유사성보다는 비유사성dissimilarity를 기준으로 하여 distance 사용

distance measures

유클리드 Euclidean 거리 : $d(x,y)=(\sum_{i=1}^p(x_i-y_i)^2)^{1/2}$ when p=2 (x, y)
- pairwise 거리
Minkowski 거리 : $d(x,y)=(\sum_{i=1}^p(x_i-y_i)^m)^{1/m}$
- 일반화된 형태로, m = 2 이면 Euclidean
Manhalanobis 거리 : $d(x,y)=\sqrt{(x-y)^TS^{-1}(x-y)}$
- 공분산행렬 S : correlation 반영 - 다른 거리와 다르게 ind 가정 없어서 이론적으로는 가장 우월하나 실제 s 계산이 어려움
Manhattan Distance : $d_{Manhattan}(x,y)=\sum^p_{i=1}|x-y|$

Hierarchical clustering 계층적 군집분석

장점

- 군집의 수를 알 필요가 없음 -> 사후에 판단 가능

- 해석에 용이

    - 덴드로그램을 통해 군집화 프로세스와 결과물을 표현가능

단점

- 계산속도가 느림 

   - nC2 즉 n^2에 비례하여 계산량 증가

- 이상치에 대한 사전검토 필요

    - 이상치가 존재할 경우, 초기 단계에 잘못 분류된 군집은 분석이 끝날때까지 소속군집이 변하지 않음
    
    - Centroid 방법이 이상치에 덜 민감함

과정

1개의 entity를 가지는 관측치 그대로인 N개의 군집으로 시작
NxN symmetric 거리 행렬 D = {d_ik}을 생성
거리행렬 D의 원소중 가장 가까운 군집의 쌍 U와 V를 찾아 (UV)라는 하나의 군집으로 합침
거리행렬 D중 새롭게 변화 되는 부분인 (UV) 와 다른 군집 W 사이의 거리 d_(UV)W를 계산
- single linkage 단일연결법 :

d_(UV)W = min(d_UW, d_VW)

- complete linkage 완전연결법 :

d_(UV)W = max(d_UW, d_VW)

- average linkage 평균연결법 :

$$d_{(UV)W}={\sum^{n_{UV}}_{i=1}\sum^{n_{W}}_{i=1}d_{ij}\over n_{UV}n_W})$$

- centroid method 중심점연결법 :

d_(UV)W = distance between the centroids of cluster UV and W

- Ward's Method

위의 과정을 N-1 반복하여 모든 관측치가 하나의 군집으로 바꿈
Dendro gram을 활용 : 어느 levels에서 어떻게 결합되어있는지 판단

dendrogram 고드름 그림

계층적 군집분석에만 dendrogram이 있음 : graphic to ilustrate the merges or divisions

코드

### Hierarchical Clustering
zUSArrests=scale(USArrests) # scale() : 변수 표준화

dist(zUSArrests) # dist() : 거리행렬

##                  Alabama    Alaska   Arizona  Arkansas California  Colorado
## Alaska         2.7037541                                                   
## Arizona        2.2935197 2.7006429                                         
## Arkansas       1.2898102 2.8260386 2.7177583                               
## California     3.2631104 3.0125415 1.3104842 3.7636409                     
## Colorado       2.6510673 2.3265187 1.3650307 2.8310512  1.2876185          
## Connecticut    3.2152975 4.7399125 3.2628575 2.6076395  4.0663898 3.3279920
## Delaware       2.0192927 3.6213633 1.9093696 1.8003239  3.0737852 2.5547456
## Florida        2.2981353 2.9967642 1.7493928 3.3721968  2.0250039 2.4458600
## Georgia        1.1314351 2.8194388 2.7871963 2.2117614  3.3780585 2.8649105
## Hawaii         3.3885300 4.5301340 3.2621208 2.9723097  3.6589083 2.8233524
## Idaho          2.9146623 4.0580555 3.5210071 1.7687255  4.4879436 3.4767685
## Illinois       1.8734993 3.2670626 1.0825512 2.4626424  1.9117469 1.7898322
## Indiana        2.0761411 3.3655952 2.6407486 1.4450503  3.4061273 2.3655622
## Iowa           3.4878952 4.7251910 4.1157513 2.4252661  4.9708591 3.9406898
## Kansas         2.2941096 3.6808173 2.7762838 1.5718411  3.6071725 2.6272281
## Kentucky       1.8475879 3.5440903 3.3567681 1.0598104  4.2463809 3.2274013
## Louisiana      0.7722224 2.9631431 2.2178519 2.0254276  3.0176625 2.6546743
## Maine          3.4851115 4.8322605 4.2961903 2.3621893  5.2699843 4.2713441
## Maryland       1.2896460 2.2777590 1.2117356 2.0582244  2.2312581 1.9667562
## Massachusetts  2.9874810 4.3729925 2.5162281 2.6881270  3.2156499 2.6522793
## Michigan       1.8814771 2.1154937 1.1940906 2.5895050  1.5146739 1.2363108
## Minnesota      3.2314338 4.4266606 3.5388450 2.3300992  4.3123134 3.3283853
## Mississippi    1.2831907 3.2554326 3.4551406 1.9318631  4.4200736 3.8491042
## Missouri       1.6309686 2.5360573 1.5958731 1.6717500  2.2891751 1.3127406
## Montana        2.3317271 3.6575988 3.3270869 1.2290066  4.2494176 3.1845338
## Nebraska       2.6625170 3.9136902 3.1641791 1.7240495  4.0197242 3.0034613
## Nevada         3.1024305 2.3443182 1.9260292 3.7086787  1.1968261 1.3988595
## New Hampshire  3.5619825 4.8650686 4.2430411 2.4949861  5.1270892 4.1126287
## New Jersey     2.6980230 4.1791832 2.1755787 2.7398478  2.7463023 2.3229870
## New Mexico     1.5993970 2.0580889 1.0376848 2.3183196  1.8010201 1.5467439
## New York       2.0723680 3.2903769 1.0725219 2.7478626  1.6787069 1.7363385
## North Carolina 1.6043662 3.2403071 3.1478947 2.0717938  4.2802569 3.8649275
## North Dakota   4.0614988 5.2110254 4.9319844 2.8756492  5.8660699 4.8014019
## Ohio           2.2698519 3.5903348 2.3585705 1.9617104  3.0133425 2.1188236
## Oklahoma       1.9570874 3.3416664 2.2648377 1.4224574  3.1488712 2.2263966
## Oregon         2.3705678 2.6990696 2.0008664 1.8477626  2.6574019 1.5331980
## Pennsylvania   2.5161340 4.1239537 2.9188907 1.9739986  3.7144562 2.8541709
## Rhode Island   3.3951297 5.0629572 3.0570151 3.0883430  3.8883995 3.4810739
## South Carolina 0.9157968 2.5640542 2.7992041 1.7074195  3.7546959 3.2131137
## South Dakota   3.0835587 4.2467198 4.1020099 1.8724822  5.0529153 3.9667318
## Tennessee      0.8407489 2.3362541 2.2989846 1.4254486  3.0119267 2.1972111
## Texas          1.6463225 3.1527905 1.6448574 2.3505545  2.1698156 1.7947199
## Utah           3.0906007 3.9480881 2.5244431 2.6049855  3.0701663 2.2461228
## Vermont        3.9791527 4.8707876 5.1003665 2.7442984  6.0323504 4.8924735
## Virginia       1.4859733 3.0492081 2.3106550 0.9971035  3.2159723 2.2622539
## Washington     2.6481824 3.2715253 2.1399117 2.1313402  2.7746720 1.7897920
## West Virginia  3.1243471 4.5004558 4.4974190 1.9951691  5.4883565 4.4210375
## Wisconsin      3.5047330 4.8711543 3.9425867 2.6102451  4.7354960 3.7846917
## Wyoming        1.8291027 3.4993456 2.6923028 0.9912639  3.7242766 2.8211492
##                Connecticut  Delaware   Florida   Georgia    Hawaii     Idaho
## Alaska                                                                      
## Arizona                                                                     
## Arkansas                                                                    
## California                                                                  
## Colorado                                                                    
## Connecticut                                                                 
## Delaware         1.7568475                                                  
## Florida          4.4700701 3.0614170                                        
## Georgia          3.9738227 2.9838715 2.1812958                              
## Hawaii           1.3843291 2.4748807 4.3596338 3.8105218                    
## Idaho            1.6354214 2.0382540 4.6999827 3.8005715 2.3658101          
## Illinois         2.7400560 1.5584719 1.7711863 2.3135778 2.7329756 3.2728945
## Indiana          1.6147898 1.6973340 3.6150778 2.6924143 1.5460727 1.4923351
## Iowa             1.5470089 2.6068606 5.2682765 4.2517889 2.1564575 0.8584962
## Kansas           1.2280424 1.5510864 3.8424558 3.0071474 1.4648766 1.2103118
## Kentucky         2.3346386 2.2514939 3.9474983 2.4408198 2.5203345 1.6565236
## Louisiana        3.5329409 2.3266996 1.7529677 0.8592544 3.5687157 3.5283772
## Maine            1.8792141 2.6560808 5.3946798 4.3334217 2.7160558 0.8486112
## Maryland         3.4968269 1.9624834 1.4355204 1.8388691 3.6148670 3.4014584
## Massachusetts    0.9468199 1.4382527 3.7753087 3.6706708 1.3276676 2.2201020
## Michigan         3.7037870 2.5165292 1.3357020 1.9185489 3.4123472 3.7775301
## Minnesota        0.9843793 2.1652930 4.7635252 3.9621842 1.4673850 1.0124936
## Mississippi      4.1762631 3.0510628 3.0886673 1.5828594 4.4777223 3.6002946
## Missouri         2.4383227 1.6723281 2.5182466 2.1021909 2.1832480 2.4697182
## Montana          1.8584328 2.0306850 4.2696476 3.0967288 2.2488801 0.8286936
## Nebraska         1.2116949 1.8113430 4.3082894 3.4295510 1.6628657 0.7515014
## Nevada           4.5868149 3.5920897 1.9500388 2.9023041 4.0281974 4.7300228
## New Hampshire    1.6169000 2.6744233 5.3778074 4.3427351 2.3112009 0.9249563
## New Jersey       1.6108823 1.5808719 3.1900596 3.1989350 1.5050500 2.7425260
## New Mexico       3.6233659 2.2271650 1.2965798 1.9015384 3.5506088 3.5883476
## New York         3.0239174 1.8992106 1.5730970 2.3634498 2.9055803 3.5910319
## North Carolina   4.1894604 2.7475286 2.9994188 2.3351307 4.7330517 3.5929592
## North Dakota     2.5099838 3.3615239 6.0356613 4.8596758 3.1974906 1.4144557
## Ohio             1.4443671 1.5838515 3.3897305 2.8043208 1.1494313 1.9647327
## Oklahoma         1.4510623 1.1802929 3.3553471 2.7121515 1.6585736 1.5168111
## Oregon           2.1756954 1.7742778 3.3399718 2.9998878 2.0031861 1.9757247
## Pennsylvania     0.8721491 1.5894850 3.9389869 3.1817981 1.2119256 1.5171866
## Rhode Island     1.0756115 1.6230495 4.2314871 4.1832075 2.0590981 2.4592705
## South Carolina   4.0127954 2.7039667 2.5295912 1.3970074 4.2531214 3.4549959
## South Dakota     2.2397424 2.6722813 5.1015141 3.8729745 2.8044891 0.8070290
## Tennessee        3.2302375 2.3195070 2.3992285 1.0122252 3.0747375 2.9234395
## Texas            2.8734475 2.0031365 1.8537984 1.7575559 2.5901696 3.3172180
## Utah             1.2825907 1.8080931 3.9274528 3.7183994 1.0709720 2.0268663
## Vermont          3.2066152 3.7144653 6.0766416 4.7091538 3.7208347 1.7797462
## Virginia         1.9277004 1.4088230 3.1515587 2.2249559 2.0479238 1.6999289
## Washington       1.6963486 1.6350170 3.5570666 3.3016469 1.5452901 1.8861921
## West Virginia    2.7117590 3.0381601 5.3004067 3.8545331 3.2831874 1.4398440
## Wisconsin        1.0354597 2.4410507 5.1085370 4.2281611 1.6666970 1.2105401
## Wyoming          1.6218573 1.2586225 3.6325811 2.7329062 2.1883414 1.1687896
##                 Illinois   Indiana      Iowa    Kansas  Kentucky Louisiana
## Alaska                                                                    
## Arizona                                                                   
## Arkansas                                                                  
## California                                                                
## Colorado                                                                  
## Connecticut                                                               
## Delaware                                                                  
## Florida                                                                   
## Georgia                                                                   
## Hawaii                                                                    
## Idaho                                                                     
## Illinois                                                                  
## Indiana        2.2027081                                                  
## Iowa           3.7380070 1.7786548                                        
## Kansas         2.3228505 0.4287712 1.4699265                              
## Kentucky       2.8478883 1.1790552 1.9426473 1.3020180                    
## Louisiana      1.6535178 2.4957547 4.0359614 2.7284126 2.4221964          
## Maine          3.9342034 2.1029158 0.6457158 1.7913753 1.9925855 4.0901924
## Maryland       1.3429997 2.5430878 4.0642448 2.7400943 2.8229479 1.2739137
## Massachusetts  2.0080982 1.6615695 2.3510287 1.4343401 2.6284451 3.1524549
## Michigan       1.3959090 2.6118471 4.3248636 2.9020920 3.1163494 1.6677999
## Minnesota      3.1558788 1.3184866 0.7644384 0.9745872 1.9333640 3.6905974
## Mississippi    3.0869477 3.0859068 4.1603272 3.2683740 2.3898884 1.6268879
## Missouri       1.3552973 1.2203931 2.9398546 1.5192717 1.9677184 1.8362172
## Montana        2.9659043 1.0033431 1.2403561 0.9170466 0.8523702 2.9444756
## Nebraska       2.7962196 0.8570429 0.9821819 0.5279092 1.4219429 3.1706333
## Nevada         2.3891753 3.5278633 5.2227312 3.8391728 4.1644286 2.8410670
## New Hampshire  3.8490624 1.9278736 0.2058539 1.6084091 2.0093558 4.1168122
## New Jersey     1.4562775 1.7638332 2.9122979 1.7071034 2.6914828 2.6826380
## New Mexico     1.3393276 2.5909993 4.2131394 2.8356373 3.0007332 1.4911656
## New York       0.3502188 2.4628527 4.0411586 2.6096016 3.1213366 1.7495096
## North Carolina 3.0124311 3.3437548 4.2973973 3.4387635 2.8798080 1.9868618
## North Dakota   4.6139615 2.6587932 1.0534375 2.3970805 2.4482563 4.6977846
## Ohio           1.8124981 0.6976320 2.1610242 0.7817000 1.7726720 2.4996969
## Oklahoma       1.8439860 0.5303259 1.9391446 0.5198728 1.4623483 2.3535566
## Oregon         2.0743434 1.1780815 2.4662295 1.3426890 2.1388677 2.7490592
## Pennsylvania   2.3134187 0.8412900 1.5708895 0.5456840 1.5944097 2.8440845
## Rhode Island   2.5057761 2.3335609 2.5453686 2.0087021 3.0457816 3.5648047
## South Carolina 2.6163680 2.8469842 4.1015324 3.0609333 2.4166385 1.3151908
## South Dakota   3.8004708 1.8411735 0.9886706 1.6701106 1.5114990 3.7457555
## Tennessee      1.9478353 1.8100316 3.4176329 2.1533060 1.7489942 1.1298534
## Texas          0.8241352 2.0035762 3.6962443 2.2378289 2.5297839 1.3325285
## Utah           2.2771632 1.4019666 2.1682069 1.2751603 2.5461745 3.3440990
## Vermont        4.8624402 2.8667983 1.7298425 2.7298377 2.3888326 4.6795933
## Virginia       1.8624960 0.6127246 2.1704984 0.8351949 1.0918624 1.9554079
## Washington     2.0612962 1.1405746 2.2502832 1.1579118 2.2630242 2.9705622
## West Virginia  4.1148082 2.2478563 1.5256890 2.1244674 1.5236299 3.7947215
## Wisconsin      3.4790637 1.6806129 0.6318069 1.3242947 2.0950212 3.9559184
## Wyoming        2.2643574 0.8898783 1.7194683 0.7588728 1.0694408 2.3837077
##                    Maine  Maryland Massachusetts  Michigan Minnesota
## Alaska                                                              
## Arizona                                                             
## Arkansas                                                            
## California                                                          
## Colorado                                                            
## Connecticut                                                         
## Delaware                                                            
## Florida                                                             
## Georgia                                                             
## Hawaii                                                              
## Idaho                                                               
## Illinois                                                            
## Indiana                                                             
## Iowa                                                                
## Kansas                                                              
## Kentucky                                                            
## Louisiana                                                           
## Maine                                                               
## Maryland       4.1259083                                            
## Massachusetts  2.6920282 2.9743193                                  
## Michigan       4.5333420 1.0800988     3.0576915                    
## Minnesota      1.2980362 3.6448929     1.6587245 3.7995101          
## Mississippi    4.0014591 2.2992240     4.1217248 2.9722824 4.1067600
## Missouri       3.2055955 1.5705755     1.9810531 1.4068840 2.4088795
## Montana        1.3271199 3.0249456     2.2919046 3.3348908 1.2662635
## Nebraska       1.3218907 3.1309065     1.6863806 3.3478988 0.6083415
## Nevada         5.5153139 2.2551337     3.8556049 1.2609417 4.6391114
## New Hampshire  0.4995971 4.1663744     2.4573524 4.4646172 0.9279247
## New Jersey     3.2532459 2.6263456     0.7977642 2.5678440 2.2254151
## New Mexico     4.3460538 0.5353893     3.0274701 0.5782474 3.7377675
## New York       4.2595904 1.4362170     2.2479437 1.2897453 3.4391596
## North Carolina 4.0631653 2.0542355     4.0773401 3.0232021 4.2219622
## North Dakota   0.7305609 4.7423030     3.3446903 5.1171939 1.8065731
## Ohio           2.5455752 2.5061694     1.1567960 2.4459855 1.5216293
## Oklahoma       2.1929825 2.2492942     1.3383233 2.4336743 1.4198434
## Oregon         2.7813372 2.2466329     1.8709252 2.1626274 1.9270100
## Pennsylvania   1.9197571 2.9585539     1.1337883 3.1048542 1.0106613
## Rhode Island   2.7331079 3.4379146     0.9440940 3.7320501 2.0310592
## South Carolina 4.0015575 1.6165582     3.8310425 2.3233363 3.9484630
## South Dakota   0.7812991 3.7991896     2.8925136 4.1744724 1.4990317
## Tennessee      3.5420469 1.5202431     2.9678843 1.5970196 3.1023238
## Texas          3.9386296 1.5431868     2.2593978 1.2888621 3.1438264
## Utah           2.6218087 3.0338001     0.9015809 2.9441421 1.4177147
## Vermont        1.4253680 4.7430576     3.9277625 5.1250778 2.4019924
## Virginia       2.3474650 2.0124420     1.8503795 2.2439957 1.7932233
## Washington     2.6292546 2.5434911     1.3472994 2.4715215 1.5955418
## West Virginia  1.1818120 4.0251562     3.3782752 4.4668346 2.0791705
## Wisconsin      1.1485830 4.0091486     1.8882704 4.2034334 0.4940832
## Wyoming        1.7665064 2.4041294     1.8201580 2.8324573 1.4845967
##                Mississippi  Missouri   Montana  Nebraska    Nevada
## Alaska                                                            
## Arizona                                                           
## Arkansas                                                          
## California                                                        
## Colorado                                                          
## Connecticut                                                       
## Delaware                                                          
## Florida                                                           
## Georgia                                                           
## Hawaii                                                            
## Idaho                                                             
## Illinois                                                          
## Indiana                                                           
## Iowa                                                              
## Kansas                                                            
## Kentucky                                                          
## Louisiana                                                         
## Maine                                                             
## Maryland                                                          
## Massachusetts                                                     
## Michigan                                                          
## Minnesota                                                         
## Mississippi                                                       
## Missouri         2.8692946                                        
## Montana          3.0015255 2.0313649                              
## Nebraska         3.5269565 1.9651798 0.7389936                    
## Nevada           4.1064793 2.3489003 4.3243112 4.2628916          
## New Hampshire    4.1895936 3.0885710 1.3329504 1.1300720 5.3871427
## New Jersey       3.8894324 1.7079555 2.5912431 2.1246377 3.3464214
## New Mexico       2.6557350 1.4579057 3.1915871 3.2494088 1.7234839
## New York         3.2655822 1.5284764 3.2662661 3.0925340 2.1674148
## North Carolina   1.1826891 3.0224849 3.2209267 3.6500186 4.1773437
## North Dakota     4.4753078 3.7811273 1.8291157 1.9038740 6.0519445
## Ohio             3.4148987 1.1327425 1.6436336 1.2654510 3.2930712
## Oklahoma         3.0466140 1.0927654 1.2225315 0.9674809 3.4108696
## Oregon           3.5033774 0.9974171 1.8044622 1.5727910 2.8581280
## Pennsylvania     3.4971746 1.7793568 1.3246445 0.8483058 4.0392694
## Rhode Island     4.3875001 2.7475003 2.6888576 2.1303973 4.6185330
## South Carolina   0.7865674 2.3846001 2.9024302 3.3517226 3.4427701
## South Dakota     3.5355186 2.8862448 0.8857149 1.2591419 5.1416772
## Tennessee        1.8269569 1.2413874 2.2494023 2.5526834 2.6666268
## Texas            2.8431727 1.1654171 2.8298991 2.7568751 2.2765693
## Utah             4.2571173 1.7478909 2.0956369 1.4573012 3.5868975
## Vermont          4.2046660 3.8803394 1.9261350 2.2952287 6.0437845
## Virginia         2.5383053 0.9787310 1.1556682 1.2472262 3.3001850
## Washington       3.8140404 1.2502752 1.8442691 1.3859985 3.1570805
## West Virginia    3.3281129 3.2538044 1.2758193 1.8117833 5.4963193
## Wisconsin        4.2987974 2.8171535 1.4916365 0.9719877 5.0751736
## Wyoming          2.6813279 1.6073860 0.8150071 0.9268202 3.9202716
##                New Hampshire New Jersey New Mexico  New York North Carolina
## Alaska                                                                     
## Arizona                                                                    
## Arkansas                                                                   
## California                                                                 
## Colorado                                                                   
## Connecticut                                                                
## Delaware                                                                   
## Florida                                                                    
## Georgia                                                                    
## Hawaii                                                                     
## Idaho                                                                      
## Illinois                                                                   
## Indiana                                                                    
## Iowa                                                                       
## Kansas                                                                     
## Kentucky                                                                   
## Louisiana                                                                  
## Maine                                                                      
## Maryland                                                                   
## Massachusetts                                                              
## Michigan                                                                   
## Minnesota                                                                  
## Mississippi                                                                
## Missouri                                                                   
## Montana                                                                    
## Nebraska                                                                   
## Nevada                                                                     
## New Hampshire                                                              
## New Jersey         3.0269198                                               
## New Mexico         4.3360809  2.6208087                                    
## New York           4.1586415  1.6344744  1.3324096                         
## North Carolina     4.3157112  3.9418824  2.5348334 3.2163998               
## North Dakota       0.9231894  3.9166205  4.9450519 4.9325292      4.5836787
## Ohio               2.3095495  1.1099823  2.4960904 2.0434995      3.6205693
## Oklahoma           2.0697098  1.4711183  2.3426252 2.1367108      3.1366639
## Oregon             2.6377191  1.9738854  2.1553130 2.2727718      3.5095191
## Pennsylvania       1.6822035  1.4216058  3.0619915 2.5949374      3.6803956
## Rhode Island       2.5813199  1.4668378  3.6032966 2.7682543      4.2185789
## South Carolina     4.1596914  3.5826726  1.9596343 2.7755634      1.0476313
## South Dakota       0.9874611  3.3318222  3.9969513 4.1124693      3.6955387
## Tennessee          3.5298430  2.6339707  1.5528304 2.0847931      2.3374653
## Texas              3.8178258  1.6226525  1.4418241 0.8457697      3.0857436
## Utah               2.3304873  1.3141843  2.9843796 2.4826984      4.2680823
## Vermont            1.6716127  4.4005416  4.9416825 5.1704762      4.3880034
## Virginia           2.2878085  1.8255601  2.1341562 2.1439207      2.7517523
## Washington         2.4214987  1.5759539  2.4796057 2.2747965      3.8055684
## West Virginia      1.4648924  3.7402121  4.2681325 4.4279608      3.5978058
## Wisconsin          0.7155628  2.4671212  4.1327758 3.7687073      4.4429456
## Wyoming            1.7950754  2.0372127  2.6286722 2.5890441      2.7501141
##                North Dakota      Ohio  Oklahoma    Oregon Pennsylvania
## Alaska                                                                
## Arizona                                                               
## Arkansas                                                              
## California                                                            
## Colorado                                                              
## Connecticut                                                           
## Delaware                                                              
## Florida                                                               
## Georgia                                                               
## Hawaii                                                                
## Idaho                                                                 
## Illinois                                                              
## Indiana                                                               
## Iowa                                                                  
## Kansas                                                                
## Kentucky                                                              
## Louisiana                                                             
## Maine                                                                 
## Maryland                                                              
## Massachusetts                                                         
## Michigan                                                              
## Minnesota                                                             
## Mississippi                                                           
## Missouri                                                              
## Montana                                                               
## Nebraska                                                              
## Nevada                                                                
## New Hampshire                                                         
## New Jersey                                                            
## New Mexico                                                            
## New York                                                              
## North Carolina                                                        
## North Dakota                                                          
## Ohio              3.1448279                                           
## Oklahoma          2.8246690 0.6483903                                 
## Oregon            3.2862071 1.2407607 1.0734082                       
## Pennsylvania      2.5555137 0.7781298 0.8180221 1.7293732             
## Rhode Island      3.4042300 1.9659747 1.9746699 2.6621371    1.6369255
## South Carolina    4.5104172 3.1289884 2.7470931 3.0134453    3.3429642
## South Dakota      1.0324944 2.4394250 2.0340486 2.4988870    1.9790714
## Tennessee         4.0623149 2.0167804 1.8500296 2.0306758    2.4343114
## Texas             4.5749422 1.6711510 1.8312655 2.1053000    2.2460705
## Utah              3.1738212 1.0154223 1.2372916 1.2825152    1.2529078
## Vermont           0.9824857 3.4825859 3.1010306 3.4262789    3.0270572
## Virginia          2.9443461 0.9774388 0.5646254 1.2664430    1.1769236
## Washington        3.1725909 0.9725013 0.9586525 0.5935343    1.3993323
## West Virginia     1.2716808 2.8650371 2.4631736 3.0349855    2.3799278
## Wisconsin         1.6216339 1.8649801 1.7916829 2.4088700    1.2204658
## Wyoming           2.4170757 1.3086480 0.7366465 1.6013015    1.0684605
##                Rhode Island South Carolina South Dakota Tennessee     Texas
## Alaska                                                                     
## Arizona                                                                    
## Arkansas                                                                   
## California                                                                 
## Colorado                                                                   
## Connecticut                                                                
## Delaware                                                                   
## Florida                                                                    
## Georgia                                                                    
## Hawaii                                                                     
## Idaho                                                                      
## Illinois                                                                   
## Indiana                                                                    
## Iowa                                                                       
## Kansas                                                                     
## Kentucky                                                                   
## Louisiana                                                                  
## Maine                                                                      
## Maryland                                                                   
## Massachusetts                                                              
## Michigan                                                                   
## Minnesota                                                                  
## Mississippi                                                                
## Missouri                                                                   
## Montana                                                                    
## Nebraska                                                                   
## Nevada                                                                     
## New Hampshire                                                              
## New Jersey                                                                 
## New Mexico                                                                 
## New York                                                                   
## North Carolina                                                             
## North Dakota                                                               
## Ohio                                                                       
## Oklahoma                                                                   
## Oregon                                                                     
## Pennsylvania                                                               
## Rhode Island                                                               
## South Carolina    4.1861320                                                
## South Dakota      3.1262712      3.5215978                                 
## Tennessee         3.5743861      1.4375120    3.0589938                    
## Texas             2.8757996      2.4532276    3.7101039 1.4712840          
## Utah              1.7565845      3.8912317    2.6823382 2.8678113 2.4039834
## Vermont           4.1104165      4.2668977    1.0856574 3.9356721 4.7444455
## Virginia          2.4330133      2.2636538    2.0316897 1.3514491 1.6921625
## Washington        2.1743525      3.3802314    2.5083824 2.3809584 2.1635337
## West Virginia     3.5400858      3.4651680    0.7108812 3.1707450 3.9586581
## Wisconsin         2.0779526      4.2190973    1.5437375 3.4257189 3.4539515
## Wyoming           2.1726807      2.5059056    1.5644785 1.9298669 2.2564704
##                     Utah   Vermont  Virginia Washington West Virginia Wisconsin
## Alaska                                                                         
## Arizona                                                                        
## Arkansas                                                                       
## California                                                                     
## Colorado                                                                       
## Connecticut                                                                    
## Delaware                                                                       
## Florida                                                                        
## Georgia                                                                        
## Hawaii                                                                         
## Idaho                                                                          
## Illinois                                                                       
## Indiana                                                                        
## Iowa                                                                           
## Kansas                                                                         
## Kentucky                                                                       
## Louisiana                                                                      
## Maine                                                                          
## Maryland                                                                       
## Massachusetts                                                                  
## Michigan                                                                       
## Minnesota                                                                      
## Mississippi                                                                    
## Missouri                                                                       
## Montana                                                                        
## Nebraska                                                                       
## Nevada                                                                         
## New Hampshire                                                                  
## New Jersey                                                                     
## New Mexico                                                                     
## New York                                                                       
## North Carolina                                                                 
## North Dakota                                                                   
## Ohio                                                                           
## Oklahoma                                                                       
## Oregon                                                                         
## Pennsylvania                                                                   
## Rhode Island                                                                   
## South Carolina                                                                 
## South Dakota                                                                   
## Tennessee                                                                      
## Texas                                                                          
## Utah                                                                           
## Vermont        3.6546040                                                       
## Virginia       1.7612066 3.0638337                                             
## Washington     0.6940667 3.4804319 1.3809295                                   
## West Virginia  3.2680139 1.0380554 2.3353210  3.0846553                        
## Wisconsin      1.8082282 2.3518637 2.1266497  2.0637823     2.0308890          
## Wyoming        1.8552036 2.6299335 0.7038309  1.5929546     1.8821600 1.7446366

hc1=hclust(dist(zUSArrests),method="average") # hclust() : hierarchical clustering
plot(hc1, hang = -1)
rect.hclust(hc1, k=5)

cluster <- cutree(hc1, k=5) #cutree() : 군집별 번호 매김

cent <- NULL
for (k in 1:5){
  cent <- rbind(cent, colMeans(USArrests[cluster == k,]))
}
cent

##         Murder   Assault UrbanPop      Rape
## [1,] 14.671429 251.28571 54.28571 21.685714
## [2,] 10.000000 263.00000 48.00000 44.500000
## [3,] 10.883333 256.91667 78.33333 32.250000
## [4,]  5.530435 129.43478 68.91304 17.786957
## [5,]  2.700000  65.14286 46.28571  9.885714

K-means clustering

사전에 결정된 군집수 k에 기초하여 전체 데이터를 상대적으로 유사한 k개의 군집으로 구분

장점

- 신속한 계산결과로 대용량 데이터에 적합함

- 군집분석 이외에도 분류,예측을 위한 선행작업, 이상치 처리작업등 다양한 분석에 사용 : 정제 천처리 등

- 단독군집이 잘 안나옴


- 계산속도가 빠르다 : n에 비례

    - 일반적으로 3번 정도 반복 -> 3nk

단점

- 사전에 군집수 K를 결정하기 어려움 : 주관적 선택 필요

- 군집결과의 해석이 용이하지 않을 수 있음

- 초기값에 영향을 많이 받아 잘못 정하면 잘못된 결과 가능

    - 일반적으로 k-means는 군집수가 비슷하게 나옴 : 그래서 초기값에 영향을 많이 받음

과정

군집수 k를 결정한다.
초기 k개 군집의 중심을 선택한다.
각 관찰치를 그 중심과 가장 가까운 거리에 있는 군집에 할당한다.
형성된 군집의 중심을 계산 : K-means 는 mean을 사용
3-4의 과정을 기존의 중심과 새로운 중심의 차이가 없을 때까지 반복한다. -> 재할당

K 결정법

K-평균 군집분석법의 결과는 초기 군집수 k의 결정에 민감하게 반응

여러 가지의 k값으로 군집분석을 수행한 후 최적의 k값을 이용
- Elbow point 계산하여 K 선택
- Silhouette plot 으로 K 선택
시각화된 자료 그래프를 통하여 K를 결정
- 자료의 시각화를 위하여는 차원의 축소가 필수적 : PCA 사용
빅데이터에서 sampling한 데이터로 계층적 군집분석을 수행하여 K값을 선택
- 계층적 군집분석 : 소용량 데이터만 가능

Elbow Point

K를 결정하기 위한 한 방법으로 K-means에 적합함

elbow point : 군집중심과 군집내 관찰값 간의 거리제곱의 합이 급격히 감소하다 완만해지는 부분

silhouette

$$s(i)= {b(i)-a(i)\over \text{max}\{a(i),b(i)\}}= \begin{cases} 1-a(i)/b(i)\ , & \mbox{if }a(i)<b(i) \\ 0\ , & \mbox{if }a(i)=b(i) \\ b(i)/a(i)-1\ , & \mbox{if }a(i)>b(i) \end{cases}$$

- 분모의 max : 관찰치의 수가 많을 수록 s(i)가 커지는 것을 방지

- 1에 가까울수록 군집화가 잘 된 관찰값임


- a(i) : 개체 i로부터 같은 군집 내에 있는 모든 다른 개체들 사이의 평균 거리

- b(i) : 개체 i로부터 다른 군집 내에 있는 개체들 사이의 평균 거리 중 가장 작은 값 -> 클수록 좋다

코드

### K-means Clustering
set.seed(1234)
kmc1 = kmeans(zUSArrests,4)
kmc1

## K-means clustering with 4 clusters of sizes 13, 13, 8, 16
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  0.6950701  1.0394414  0.7226370  1.27693964
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3  1.4118898  0.8743346 -0.8145211  0.01927104
## 4 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              3              1              1              3              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              4              4              1              3 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              4              2              1              4              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              4              2              3              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              4              1              2              3              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              4 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              3              2              4 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              4              4              4              4              3 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              3              1              4              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              4              4              2              2              4 
## 
## Within cluster sum of squares by cluster:
## [1] 19.922437 11.952463  8.316061 16.212213
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

pairs(zUSArrests, col=kmc1$cluster, pch=16)

## Elbow point
wss = 0
for (i in 1:10) wss[i] = sum(kmeans(zUSArrests,center=i)$withinss)
plot(1:10, wss, type='b', xlab="Number of Clusters", ylab="Within group sum of squares")

# transformation
zdungratio = scale(dungratio)
summary(zdungratio)

##      fratio             lratio              sratio       
##  Min.   :-7.48199   Min.   :-2.868600   Min.   :-5.6639  
##  1st Qu.:-0.43604   1st Qu.:-0.670959   1st Qu.:-0.1740  
##  Median : 0.05938   Median : 0.003108   Median : 0.2042  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.57609   3rd Qu.: 0.667111   3rd Qu.: 0.5313  
##  Max.   : 2.23705   Max.   : 2.939617   Max.   : 1.4491

# k 찾기 : 3D plot
# library(rgl)
# plot3d(zdungratio[,1], zdungratio[,2], zdungratio[,3], col="blue", size=5) 

# k찾기 : hierarchical clustering
hc_dung=hclust(dist(zdungratio), method="average")
plot(hc_dung, hang = -1)
rect.hclust(hc_dung, k=3)

# initial points
tmp <- data.frame(zdungratio)
tmp$cluster <- cutree(hc_dung, k=3)
cent <- NULL
for (k in 1:3){  cent <- rbind(cent, colMeans(tmp[tmp$cluster == k,]))  }
cent

##           fratio       lratio     sratio cluster
## [1,]  0.07672644  0.001058849  0.2266506       1
## [2,]  0.13034493  0.135477070 -3.2094215       2
## [3,] -6.07115863 -0.767028418  0.4374348       3

x1 <- c(0,0,-6) ; x2 <- c(0,0,0) ; x3 <- c(0,-3,0)
kcenters <- data.frame(x1,x2,x3)

# K-means clustering
kmean_dung = kmeans(zdungratio,centers=kcenters) # centers = : 인위적인 초기값
kmean_dung

## K-means clustering with 3 clusters of sizes 635, 45, 9
## 
## Cluster means:
##        fratio       lratio     sratio
## 1  0.07848197  0.001229268  0.2239685
## 2  0.10676389  0.136059340 -3.2479316
## 3 -6.07115863 -0.767028418  0.4374348
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 2 1 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1
##  [75] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 3 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
## [112] 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1
## [186] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [223] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
## [260] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 1
## [297] 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1
## [445] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2
## [519] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [593] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [667] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 1084.96483  107.79811   20.67056
##  (between_SS / total_SS =  41.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

pairs(zdungratio, col=kmean_dung$cluster, pch=16)

K-medioid clustering

K-means와 비슷하나 중심화 할때 medoid 사용

medoid : 변수별 중앙값의 좌표(object whose average dissimilarity to all the objects in the cluster is minimal)

Euclidean distances가 아니기에 outlier에 robust함

코드

### K-medoids clustering
library(fpc)
zUSArrests=scale(USArrests)
kmed = pamk(zUSArrests) # pamk() : k=2~10을 해보고 silhouette을 바탕으로 최적을 찾음
kmed

## $pamobject
## Medoids:
##            ID     Murder    Assault   UrbanPop       Rape
## New Mexico 31  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
##    build     swap 
## 1.441358 1.368969 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 2
## 
## $crit
##  [1] 0.0000000 0.4084890 0.3143656 0.3389904 0.3105170 0.2629987 0.2243815
##  [8] 0.2386072 0.2466113 0.2447023

pairs(USArrests, col=kmed$pamobject$clustering, pch=16)

par(mfrow=c(1,2))
plot(kmed$pamobject)

explain ##.## % : PCA의 설명력의 의미, 즉 n차원을 2차원의 그래프로 축소 (PCA 그래프)

average silhouette wideth : 0.## -> 평균 silhouette를 의미

값이 클수록 좋은것임

Density-based clustering

밀도(densely populated area) 기반 군집 분석

장점

K를 미리 결정할 필요 없음
noise, outlier 에 영향 받지 않음

단점

밀도에만 의존하다 보니 군집의 해석이 어려울 수 있음

2개의 모수

Eps : size of neighborhood 반지름
- 반지름 안에 포함되는 point를 중심으로 또다시 반지름 -> 점들이 게속 연결되어 군집이 커짐
MinPts : minimum # of points 즉 최소 데이터 수

dense point 기준점 : Min Points보다 neighborhood 안에 점이 더 많을때
- 기준점에 속하는 neighborhood를 하나의 군집으로 분류
- 어떠한 군집에도 속하지 않은 데이터는 noise, outlier

코드

## DBSCAN
library(fpc)
dbscan(zUSArrests,eps=0.8)

## dbscan Pts=50 MinPts=5 eps=0.8
##         0  1 2
## border 33  6 4
## seed    0  5 2
## total  33 11 6

MinPtes=5가 기본값

eps는 시행착오를 통해 최적이 무엇인지 확인 필요

border : 군집 외각의 수

seed : 군집 중심의 수

데이터마이닝 - 이론정리

R 기반의 데이터 마이닝 정리

예측력 회귀 모형 비교 기준

예측력 분류 모형 비교 기준

Regression

모형 해석

모형의 타당성 검토

가법모형과 승법모형

코드

Logistic Regression

모형 해석

코드

Neural Network

sensitivity analysis

hidden nodes : combination function + activation (squashing) function

output layer nodes

training the networks

실용적인 tips

코드

Decision Tree

CART 의사결정 나무 특징

분류 나무의 분할 방법 : 불순도 측정

CART의 feture selection 기준

대안 분할 surrogate split

Pruning 가지치기

코드

Ensemble Methods

Bagging : Boostrapping AGGregatING

과정

코드

Boosting

AdaBoost

코드

Gradient Boost

Random Forest

과정

개별 분류 모델의 결과 aggregating 방법

OOB Error : Out Of Bag error

변수 중요도

코드

코드 : 회귀분석의 경우

Clustering : Unsupervised Learnging

표준화 standardization

군집화의 기준 : distance

distance measures

Hierarchical clustering 계층적 군집분석

과정

dendrogram 고드름 그림

코드

K-means clustering

과정

K 결정법

Elbow Point

silhouette

코드

K-medioid clustering

코드

Density-based clustering

2개의 모수

코드

See Also: