데이터마이닝 - 로지스틱 회귀분석

이터마이닝 연습으로 로지스틱 회귀분석을 사용한 예제이다.

자료 불러오기

library(readxl)
airbnb <- read_excel("data/airbnb.xlsx")
str(airbnb)
## tibble [74,111 x 29] (S3: tbl_df/tbl/data.frame)
##  $ id                    : num [1:74111] 6901257 6304928 7919400 13418779 3808709 ...
##  $ log_price             : num [1:74111] 5.01 5.13 4.98 6.62 4.74 ...
##  $ property_type         : chr [1:74111] "Apartment" "Apartment" "Apartment" "House" ...
##  $ room_type             : chr [1:74111] "Entire home/apt" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
##  $ amenities             : chr [1:74111] "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Essentials,\"Hair dryer\",I"| __truncated__ "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke detect"| __truncated__ "{TV,\"Cable TV\",\"Wireless Internet\",\"Air conditioning\",Kitchen,Breakfast,\"Buzzer/wireless intercom\",Heat"| __truncated__ "{TV,\"Cable TV\",Internet,\"Wireless Internet\",Kitchen,\"Indoor fireplace\",\"Buzzer/wireless intercom\",Heati"| __truncated__ ...
##  $ accommodates          : num [1:74111] 3 7 5 4 2 2 3 2 2 2 ...
##  $ bathrooms             : num [1:74111] 1 1 1 1 1 1 1 1 1 1 ...
##  $ bed_type              : chr [1:74111] "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ cancellation_policy   : chr [1:74111] "strict" "strict" "moderate" "flexible" ...
##  $ cleaning_fee          : logi [1:74111] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ city                  : chr [1:74111] "NYC" "NYC" "NYC" "SF" ...
##  $ description           : chr [1:74111] "Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and"| __truncated__ "Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Par"| __truncated__ "The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!!"| __truncated__ "This light-filled home-away-from-home is super clean and comes with all of the modern amenities travelers could"| __truncated__ ...
##  $ first_review          : POSIXct[1:74111], format: "2018-06-16" "2005-08-17" ...
##  $ host_has_profile_pic  : chr [1:74111] "t" "t" "t" "t" ...
##  $ host_identity_verified: chr [1:74111] "t" "f" "t" "t" ...
##  $ host_response_rate    : num [1:74111] NA 1 1 NA 1 1 1 1 1 1 ...
##  $ host_since            : POSIXct[1:74111], format: "2026-03-12" "2019-06-17" ...
##  $ instant_bookable      : chr [1:74111] "f" "t" "t" "f" ...
##  $ last_review           : POSIXct[1:74111], format: "2018-07-16" "2023-09-17" ...
##  $ latitude              : num [1:74111] 40.7 40.8 40.8 37.8 38.9 ...
##  $ longitude             : num [1:74111] -74 -74 -73.9 -122.4 -77 ...
##  $ name                  : chr [1:74111] "Beautiful brownstone 1-bedroom" "Superb 3BR Apt Located Near Times Square" "The Garden Oasis" "Beautiful Flat in the Heart of SF!" ...
##  $ neighbourhood         : chr [1:74111] "Brooklyn Heights" "Hell's Kitchen" "Harlem" "Lower Haight" ...
##  $ number_of_reviews     : num [1:74111] 2 6 10 0 4 3 15 9 159 2 ...
##  $ review_scores_rating  : num [1:74111] 100 93 92 NA 40 100 97 93 99 90 ...
##  $ thumbnail_url         : chr [1:74111] "https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/72208dad-9c86-41ea-a735-43d933111063.jpg?aki_policy=small" ...
##  $ zipcode               : chr [1:74111] "11201" "10019" "10027" "94117" ...
##  $ bedrooms              : num [1:74111] 1 3 1 2 0 1 1 1 1 1 ...
##  $ beds                  : num [1:74111] 1 3 3 2 1 1 1 1 1 1 ...

데이터 전처리

“property_type”은 ‘House’, ‘Apartment’, ’Other’ 등의 3범주로 변환하시오.

attach(airbnb)
airbnb$property_type[!(property_type=='House'| property_type=='Apartment')] = "Other"
airbnb$property_type <- as.factor(airbnb$property_type)
levels(airbnb$property_type)
## [1] "Apartment" "House"     "Other"

“bed_type”은 ‘Bed’, ‘Other’ 등의 2범주로 변환하시오.

Real Bed와 Airbed 는 Bed로, 나머지는 Other로 변환하였다.

airbnb$bed_type[(bed_type=='Airbed'| bed_type=='Real Bed')] = "Bed"
airbnb$bed_type[!(bed_type=='Airbed'| bed_type=='Real Bed')] = "Other"
airbnb$bed_type <- as.factor(airbnb$bed_type)
levels(airbnb$bed_type)
## [1] "Bed"   "Other"
detach(airbnb)

“number_of_reviews”가 11개 이상인 데이터만 추출해서 분석에 사용하시오.

airbnb <- airbnb[airbnb$number_of_reviews>=11 , 1:ncol(airbnb)]

‘가격비(price_ratio)’ 변수를 생성하시오.

airbnb$log_price <- exp(airbnb$log_price)
mean_price <- aggregate(log_price ~ city, airbnb, mean)
names(mean_price) <-c ("city","mean_price")
f_airbnb <- merge(x =airbnb,y=mean_price,by="city",all.x=T)
f_airbnb$price_ratio <- f_airbnb$log_price/f_airbnb$mean_price*100

데이터 분석

1. “가격비(price_ratio)” 변수의 평균과 표준편차를 답하시오.

mean(f_airbnb$price_ratio)
## [1] 100
sd(f_airbnb$price_ratio)
## [1] 83.72338

2. “가격비(price_ratio)”를 종속변수로 하여 선형회귀분석을 수행하시오.

먼저 시행에 앞서 불필요한 변수를 제거한다. 제거한 것들은 아래와 같다. amenities나 description의 경우 길이로 변환하여 분석하였으나 큰 의미가 없어 제거하고 분석하였다. (y변수 계산에 필요했던 것들 및 url이나 설명같이 의미없는 것들을 제외한다.)

  • description

  • amenities

  • name

  • neighbourhood

  • thumbnail_url

  • id

  • log_price

  • mean_price

f_df <- subset(f_airbnb, select = -c(description, amenities, name, neighbourhood, thumbnail_url, id, log_price, mean_price))

character를 factor로 변환한다. 순서가 있는 cancellation_policy의 경우 순서를 준다.

find_char_col<-NA
for (i in 1:ncol(f_df)){
  find_char_col[i] <- is.character(f_df[,i])
}
char_col <- colnames(f_df)[find_char_col]
f_df[char_col] <- lapply(f_df[char_col] , factor)

f_df$cleaning_fee <- as.factor(f_df$cleaning_fee)

cancellation_policy에서도 “super_strict_60”는 2개뿐이므로“super_strict_30”에 포함시켰다. (즉 super_strict_30는 super_strict_30+를 의미한다.)

which(f_df$cancellation_policy == "super_strict_60")
## [1]  6558 13642
f_df$cancellation_policy[which(f_df$cancellation_policy == "super_strict_60")] <- "super_strict_30"
f_df$cancellation_policy <- factor(f_df$cancellation_policy,levels=c("flexible", "moderate", "strict", "super_strict_30"), order=T)

먼저 각각의 plot을 그려본다. (결과는 생략 eval=FALSE)

for (i in 1:(ncol(f_df)-1)) {
plot(f_df$price_ratio~f_df[[i]],xlab=colnames(f_df)[i])
}

대략적으로 눈으로 보기에 차이가 나타나는 것은 다음과 같다. (진한 것은 더욱 두드러 진 것)

  • room_type
  • accommodates
  • bathrooms
  • bed_type
  • cancellation_policy
  • cleaning_fee
  • host_has_profile_pic
  • host_identity_verified
  • host_response_rate
  • instant_bookable
  • latitude
  • longitude
  • number_of_reviews
  • review_scores_rating
  • bedrooms
  • beds

그런데 latitude와 longitude는 사실상 zipcode에 그 데이터에 의미가 어느정도 포함된다고 판단하여 제거하였다.

또한 zipcode는 너무 자세하므로 분석의 편의상 앞의 2자리만 사용하였다. (city는 이 zipcode에 포함되므로 생략하였다.)

그리고 host_has_profile_pic의 경우 f는 28개뿐으로 매우 적으므로 변수를 사용하지 않는다.

summary( lm( price_ratio ~ latitude, data=f_df))
## 
## Call:
## lm(formula = price_ratio ~ latitude, data = f_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -92.71  -47.41  -22.56   19.19 1323.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.996e+01  6.061e+00  16.492   <2e-16 ***
## latitude    9.344e-04  1.574e-01   0.006    0.995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.72 on 29014 degrees of freedom
## Multiple R-squared:  1.215e-09,  Adjusted R-squared:  -3.446e-05 
## F-statistic: 3.525e-05 on 1 and 29014 DF,  p-value: 0.9953
summary( lm( price_ratio ~ longitude, data=f_df))
## 
## Call:
## lm(formula = price_ratio ~ longitude, data = f_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -92.67  -47.36  -22.63   19.12 1323.69 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.765008   2.164951  46.082   <2e-16 ***
## longitude   -0.002513   0.022547  -0.111    0.911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.72 on 29014 degrees of freedom
## Multiple R-squared:  4.281e-07,  Adjusted R-squared:  -3.404e-05 
## F-statistic: 0.01242 on 1 and 29014 DF,  p-value: 0.9113
f_df$zipcode <- as.factor(substr(f_airbnb$zipcode,1,2))

table(f_df$host_has_profile_pic)
## 
##     f     t 
##    28 28924

그래서 이를 제외한 다음을 분석변수로 한다.

  • room_type
  • property_type
  • accommodates
  • bathrooms
  • bed_type
  • cancellation_policy
  • cleaning_fee
  • host_identity_verified
  • host_response_rate
  • instant_bookable
  • first_review
  • number_of_reviews
  • host_since
  • review_scores_rating
  • last_review
  • bedrooms
  • beds
  • zipcode

정확히 하기 위해 trellis plot을 그려보았다.

library(lattice)
mypanel <- function(x, y) {
  panel.xyplot(x, y)
  panel.loess(x, y, col="red", lwd=2, lty=2) 
}

xyplot(price_ratio ~ room_type, data=f_df,panel=mypanel)

xyplot(price_ratio ~ accommodates, data=f_df,panel=mypanel)

xyplot(price_ratio ~ bathrooms, data=f_df,panel=mypanel)

xyplot(price_ratio ~ bed_type, data=f_df,panel=mypanel)

xyplot(price_ratio ~ cancellation_policy, data=f_df,panel=mypanel)

xyplot(price_ratio ~ cleaning_fee, data=f_df,panel=mypanel)

xyplot(price_ratio ~ host_identity_verified , data=f_df,panel=mypanel)

xyplot(price_ratio ~ host_response_rate, data=f_df,panel=mypanel)

xyplot(price_ratio ~ instant_bookable , data=f_df,panel=mypanel)

xyplot(price_ratio ~ number_of_reviews, data=f_df,panel=mypanel)

xyplot(price_ratio ~ review_scores_rating, data=f_df,panel=mypanel)

xyplot(price_ratio ~ bedrooms, data=f_df,panel=mypanel)

xyplot(price_ratio ~ beds, data=f_df,panel=mypanel)

xyplot(price_ratio ~ zipcode, data=f_df,panel=mypanel)

xyplot(price_ratio ~ last_review , data=f_df,panel=mypanel)

xyplot(price_ratio ~ host_since , data=f_df,panel=mypanel)

xyplot(price_ratio ~ first_review , data=f_df,panel=mypanel)

xyplot(price_ratio ~ property_type , data=f_df,panel=mypanel)

이제 18개의 값들을 이용해 선형회귀분석을 실시한다. 그전에 결측치NA가 포함된 행은 제거하고 실시한다.

f_df = na.omit(f_df)

lm1 = lm(price_ratio ~ room_type + property_type + accommodates + bathrooms + bed_type + cancellation_policy + cleaning_fee + host_identity_verified + host_response_rate + instant_bookable + first_review + number_of_reviews + host_since + review_scores_rating + last_review + bedrooms  + beds + zipcode , data = f_df)

summary(lm1)
## 
## Call:
## lm(formula = price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + bed_type + cancellation_policy + cleaning_fee + 
##     host_identity_verified + host_response_rate + instant_bookable + 
##     first_review + number_of_reviews + host_since + review_scores_rating + 
##     last_review + bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310.42  -25.81   -2.43   18.11  995.55 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.898e+01  8.209e+00 -12.058  < 2e-16 ***
## room_typePrivate room   -4.603e+01  8.669e-01 -53.089  < 2e-16 ***
## room_typeShared room    -7.641e+01  2.538e+00 -30.105  < 2e-16 ***
## property_typeHouse       4.183e+00  9.150e-01   4.572 4.85e-06 ***
## property_typeOther       6.771e+00  1.112e+00   6.087 1.17e-09 ***
## accommodates             8.450e+00  3.292e-01  25.671  < 2e-16 ***
## bathrooms                3.515e+01  7.626e-01  46.087  < 2e-16 ***
## bed_typeOther            1.908e+00  2.279e+00   0.837 0.402372    
## cancellation_policy.L    5.800e+01  6.962e+00   8.330  < 2e-16 ***
## cancellation_policy.Q    3.702e+01  5.201e+00   7.117 1.13e-12 ***
## cancellation_policy.C    1.434e+01  2.375e+00   6.039 1.57e-09 ***
## cleaning_feeTRUE        -3.377e+00  1.006e+00  -3.357 0.000789 ***
## host_identity_verifiedt  2.158e+00  8.533e-01   2.529 0.011431 *  
## host_response_rate      -1.935e+01  3.508e+00  -5.516 3.50e-08 ***
## instant_bookablet       -4.545e+00  7.645e-01  -5.944 2.81e-09 ***
## first_review            -3.548e-11  5.496e-10  -0.065 0.948532    
## number_of_reviews       -4.118e-02  7.218e-03  -5.705 1.17e-08 ***
## host_since               1.791e-10  5.406e-10   0.331 0.740412    
## review_scores_rating     2.022e+00  7.722e-02  26.187  < 2e-16 ***
## last_review              7.622e-10  4.587e-10   1.662 0.096593 .  
## bedrooms                 2.588e+01  6.572e-01  39.373  < 2e-16 ***
## beds                    -3.614e+00  4.786e-01  -7.551 4.46e-14 ***
## zipcode11               -4.252e+01  1.112e+00 -38.245  < 2e-16 ***
## zipcode1m               -3.478e+01  5.632e+01  -0.617 0.536912    
## zipcode20               -4.165e+01  1.547e+00 -26.924  < 2e-16 ***
## zipcode21               -3.428e+01  1.739e+00 -19.714  < 2e-16 ***
## zipcode22               -1.917e+01  8.438e+00  -2.272 0.023073 *  
## zipcode24               -4.208e+01  5.633e+01  -0.747 0.454991    
## zipcode60               -4.246e+01  1.562e+00 -27.184  < 2e-16 ***
## zipcode90               -3.486e+01  1.107e+00 -31.483  < 2e-16 ***
## zipcode91               -5.649e+01  1.742e+00 -32.428  < 2e-16 ***
## zipcode92               -6.254e+01  5.634e+01  -1.110 0.266986    
## zipcode93               -7.250e+01  1.567e+01  -4.628 3.71e-06 ***
## zipcode94               -3.287e+01  1.418e+00 -23.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.31 on 27002 degrees of freedom
## Multiple R-squared:  0.5462, Adjusted R-squared:  0.5456 
## F-statistic: 984.8 on 33 and 27002 DF,  p-value: < 2.2e-16

위를 보면 유의미하지 않은 p-value가 많다. 이제 stepwise를 해서 제거한다.

lm2 = step(lm1, direction = 'both')
## Start:  AIC=217990.5
## price_ratio ~ room_type + property_type + accommodates + bathrooms + 
##     bed_type + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + first_review + number_of_reviews + 
##     host_since + review_scores_rating + last_review + bedrooms + 
##     beds + zipcode
## 
##                          Df Sum of Sq      RSS    AIC
## - first_review            1        13 85615864 217989
## - host_since              1       348 85616199 217989
## - bed_type                1      2223 85618074 217989
## <none>                                85615851 217990
## - last_review             1      8755 85624606 217991
## - host_identity_verified  1     20286 85636137 217995
## - cleaning_fee            1     35736 85651586 218000
## - host_response_rate      1     96468 85712319 218019
## - number_of_reviews       1    103211 85719062 218021
## - instant_bookable        1    112043 85727894 218024
## - property_type           2    144262 85760113 218032
## - beds                    1    180785 85796636 218046
## - cancellation_policy     3    503380 86119231 218143
## - accommodates            1   2089553 87705404 218640
## - review_scores_rating    1   2174363 87790214 218667
## - bedrooms                1   4915304 90531154 219498
## - zipcode                12   6500593 92116444 219945
## - bathrooms               1   6734757 92350607 220036
## - room_type               2  10116125 95731976 221006
## 
## Step:  AIC=217988.5
## price_ratio ~ room_type + property_type + accommodates + bathrooms + 
##     bed_type + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     host_since + review_scores_rating + last_review + bedrooms + 
##     beds + zipcode
## 
##                          Df Sum of Sq      RSS    AIC
## - host_since              1       349 85616213 217987
## - bed_type                1      2224 85618088 217987
## <none>                                85615864 217989
## - last_review             1      8763 85624627 217989
## + first_review            1        13 85615851 217990
## - host_identity_verified  1     20283 85636147 217993
## - cleaning_fee            1     35747 85651611 217998
## - host_response_rate      1     96497 85712361 218017
## - number_of_reviews       1    103228 85719092 218019
## - instant_bookable        1    112052 85727916 218022
## - property_type           2    144250 85760114 218030
## - beds                    1    180804 85796668 218044
## - cancellation_policy     3    503371 86119235 218141
## - accommodates            1   2089584 87705448 218638
## - review_scores_rating    1   2174743 87790607 218665
## - bedrooms                1   4915310 90531174 219496
## - zipcode                12   6500955 92116819 219943
## - bathrooms               1   6734768 92350632 220034
## - room_type               2  10116180 95732044 221004
## 
## Step:  AIC=217986.6
## price_ratio ~ room_type + property_type + accommodates + bathrooms + 
##     bed_type + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + last_review + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq      RSS    AIC
## - bed_type                1      2238 85618451 217985
## <none>                                85616213 217987
## - last_review             1      8739 85624953 217987
## + host_since              1       349 85615864 217989
## + first_review            1        14 85616199 217989
## - host_identity_verified  1     20254 85636467 217991
## - cleaning_fee            1     35785 85651998 217996
## - host_response_rate      1     96522 85712735 218015
## - number_of_reviews       1    103088 85719301 218017
## - instant_bookable        1    112180 85728393 218020
## - property_type           2    144324 85760538 218028
## - beds                    1    180830 85797044 218042
## - cancellation_policy     3    503161 86119374 218139
## - accommodates            1   2089343 87705556 218636
## - review_scores_rating    1   2174719 87790932 218663
## - bedrooms                1   4916145 90532358 219494
## - zipcode                12   6500763 92116976 219941
## - bathrooms               1   6734710 92350923 220032
## - room_type               2  10120958 95737171 221003
## 
## Step:  AIC=217985.3
## price_ratio ~ room_type + property_type + accommodates + bathrooms + 
##     cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + last_review + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq      RSS    AIC
## <none>                                85618451 217985
## - last_review             1      8687 85627138 217986
## + bed_type                1      2238 85616213 217987
## + host_since              1       363 85618088 217987
## + first_review            1        15 85618435 217987
## - host_identity_verified  1     20365 85638815 217990
## - cleaning_fee            1     35751 85654202 217995
## - host_response_rate      1     96483 85714934 218014
## - number_of_reviews       1    102833 85721284 218016
## - instant_bookable        1    113036 85731487 218019
## - property_type           2    144046 85762497 218027
## - beds                    1    182147 85800598 218041
## - cancellation_policy     3    501866 86120317 218137
## - accommodates            1   2090219 87708670 218635
## - review_scores_rating    1   2175653 87794104 218662
## - bedrooms                1   4917159 90535610 219493
## - zipcode                12   6501272 92119723 219940
## - bathrooms               1   6732473 92350924 220030
## - room_type               2  10195031 95813482 221023
summary(lm2)
## 
## Call:
## lm(formula = price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + last_review + bedrooms + beds + zipcode, 
##     data = f_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310.29  -25.79   -2.43   18.08  995.53 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.878e+01  8.141e+00 -12.134  < 2e-16 ***
## room_typePrivate room   -4.600e+01  8.659e-01 -53.122  < 2e-16 ***
## room_typeShared room    -7.604e+01  2.499e+00 -30.431  < 2e-16 ***
## property_typeHouse       4.171e+00  9.148e-01   4.560 5.13e-06 ***
## property_typeOther       6.772e+00  1.112e+00   6.088 1.16e-09 ***
## accommodates             8.451e+00  3.291e-01  25.676  < 2e-16 ***
## bathrooms                3.513e+01  7.624e-01  46.081  < 2e-16 ***
## cancellation_policy.L    5.795e+01  6.961e+00   8.324  < 2e-16 ***
## cancellation_policy.Q    3.698e+01  5.200e+00   7.112 1.17e-12 ***
## cancellation_policy.C    1.434e+01  2.374e+00   6.039 1.57e-09 ***
## cleaning_feeTRUE        -3.377e+00  1.006e+00  -3.358 0.000786 ***
## host_identity_verifiedt  2.162e+00  8.532e-01   2.534 0.011269 *  
## host_response_rate      -1.935e+01  3.507e+00  -5.516 3.49e-08 ***
## instant_bookablet       -4.563e+00  7.642e-01  -5.971 2.39e-09 ***
## number_of_reviews       -4.110e-02  7.216e-03  -5.695 1.25e-08 ***
## review_scores_rating     2.023e+00  7.721e-02  26.196  < 2e-16 ***
## last_review              7.592e-10  4.587e-10   1.655 0.097875 .  
## bedrooms                 2.588e+01  6.572e-01  39.382  < 2e-16 ***
## beds                    -3.626e+00  4.784e-01  -7.580 3.58e-14 ***
## zipcode11               -4.252e+01  1.112e+00 -38.246  < 2e-16 ***
## zipcode1m               -3.483e+01  5.632e+01  -0.618 0.536327    
## zipcode20               -4.165e+01  1.547e+00 -26.930  < 2e-16 ***
## zipcode21               -3.428e+01  1.739e+00 -19.714  < 2e-16 ***
## zipcode22               -1.922e+01  8.438e+00  -2.278 0.022721 *  
## zipcode24               -4.215e+01  5.632e+01  -0.748 0.454229    
## zipcode60               -4.248e+01  1.562e+00 -27.202  < 2e-16 ***
## zipcode90               -3.486e+01  1.107e+00 -31.487  < 2e-16 ***
## zipcode91               -5.649e+01  1.742e+00 -32.430  < 2e-16 ***
## zipcode92               -6.251e+01  5.634e+01  -1.110 0.267158    
## zipcode93               -7.254e+01  1.566e+01  -4.631 3.66e-06 ***
## zipcode94               -3.287e+01  1.417e+00 -23.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.31 on 27005 degrees of freedom
## Multiple R-squared:  0.5462, Adjusted R-squared:  0.5457 
## F-statistic:  1083 on 30 and 27005 DF,  p-value: < 2.2e-16

이후 분석에서 p-value가 유의미하지 않은 last_review 를 제거하고 실시한다.


회귀진단

plot(lm2,which=1)

plot(lm2,which=2)

잔차에서 메가폰 형태가 나타남을 알수 있다. 또한 QQ플랏도 문제가 있음을 알 수 있다. Y변수의 변환의 필요성이 있으므로 Y’=logY로 변환하였다.

f_df$log_price_ratio <- log(f_df$price_ratio)

lm3 = lm(log_price_ratio ~ room_type + property_type + accommodates + 
    bathrooms + cancellation_policy + cleaning_fee + host_identity_verified + 
    host_response_rate + instant_bookable + number_of_reviews + 
    review_scores_rating + bedrooms + beds + zipcode, 
    data = f_df)

lm4 = step(lm3, direction = 'both')
## Start:  AIC=-54314.95
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## - cleaning_fee            1      0.00 3618.2 -54317
## - number_of_reviews       1      0.08 3618.2 -54316
## <none>                                3618.2 -54315
## - host_identity_verified  1      1.21 3619.4 -54308
## - host_response_rate      1      4.82 3623.0 -54281
## - instant_bookable        1      7.14 3625.3 -54264
## - property_type           2      7.92 3626.1 -54260
## - cancellation_policy     3     21.45 3639.6 -54161
## - beds                    1     21.10 3639.3 -54160
## - bathrooms               1     79.59 3697.7 -53729
## - accommodates            1    131.11 3749.3 -53355
## - review_scores_rating    1    188.21 3806.4 -52946
## - bedrooms                1    216.57 3834.7 -52745
## - zipcode                12    462.42 4080.6 -51087
## - room_type               2   1779.04 5397.2 -43507
## 
## Step:  AIC=-54316.91
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## - number_of_reviews       1      0.08 3618.2 -54318
## <none>                                3618.2 -54317
## + cleaning_fee            1      0.00 3618.2 -54315
## - host_identity_verified  1      1.21 3619.4 -54310
## - host_response_rate      1      4.84 3623.0 -54283
## - instant_bookable        1      7.16 3625.3 -54265
## - property_type           2      7.92 3626.1 -54262
## - beds                    1     21.11 3639.3 -54162
## - cancellation_policy     3     21.72 3639.9 -54161
## - bathrooms               1     79.62 3697.8 -53730
## - accommodates            1    131.42 3749.6 -53354
## - review_scores_rating    1    188.30 3806.5 -52947
## - bedrooms                1    216.56 3834.7 -52747
## - zipcode                12    462.51 4080.7 -51089
## - room_type               2   1827.08 5445.2 -43269
## 
## Step:  AIC=-54318.33
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## <none>                                3618.2 -54318
## + number_of_reviews       1      0.08 3618.2 -54317
## + cleaning_fee            1      0.00 3618.2 -54316
## - host_identity_verified  1      1.16 3619.4 -54312
## - host_response_rate      1      4.95 3623.2 -54283
## - instant_bookable        1      7.27 3625.5 -54266
## - property_type           2      7.97 3626.2 -54263
## - cancellation_policy     3     21.66 3639.9 -54163
## - beds                    1     21.15 3639.4 -54163
## - bathrooms               1     79.88 3698.1 -53730
## - accommodates            1    131.34 3749.6 -53356
## - review_scores_rating    1    188.57 3806.8 -52947
## - bedrooms                1    218.14 3836.4 -52738
## - zipcode                12    462.64 4080.9 -51089
## - room_type               2   1828.54 5446.8 -43264

그 결과는 아래와 같다.

summary(lm4)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51557 -0.22551 -0.00287  0.22588  2.52586 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2.8112298  0.0525988   53.447  < 2e-16 ***
## room_typePrivate room   -0.5855993  0.0055699 -105.136  < 2e-16 ***
## room_typeShared room    -1.1375240  0.0161312  -70.517  < 2e-16 ***
## property_typeHouse      -0.0183360  0.0059407   -3.087  0.00203 ** 
## property_typeOther       0.0422432  0.0072288    5.844 5.16e-09 ***
## accommodates             0.0668736  0.0021358   31.311  < 2e-16 ***
## bathrooms                0.1209281  0.0049524   24.418  < 2e-16 ***
## cancellation_policy.L    0.2933262  0.0452494    6.482 9.18e-11 ***
## cancellation_policy.Q    0.1616865  0.0337823    4.786 1.71e-06 ***
## cancellation_policy.C    0.0664165  0.0154332    4.303 1.69e-05 ***
## host_identity_verifiedt  0.0162382  0.0055087    2.948  0.00320 ** 
## host_response_rate      -0.1381528  0.0227264   -6.079 1.23e-09 ***
## instant_bookablet       -0.0365163  0.0049570   -7.367 1.80e-13 ***
## review_scores_rating     0.0188152  0.0005015   37.518  < 2e-16 ***
## bedrooms                 0.1719812  0.0042620   40.352  < 2e-16 ***
## beds                    -0.0390361  0.0031066  -12.566  < 2e-16 ***
## zipcode11               -0.3585074  0.0072259  -49.614  < 2e-16 ***
## zipcode1m               -0.3950436  0.3661121   -1.079  0.28059    
## zipcode20               -0.2976181  0.0100374  -29.651  < 2e-16 ***
## zipcode21               -0.2358253  0.0112927  -20.883  < 2e-16 ***
## zipcode22               -0.1539659  0.0548451   -2.807  0.00500 ** 
## zipcode24               -0.6528374  0.3661251   -1.783  0.07458 .  
## zipcode60               -0.3329465  0.0101308  -32.865  < 2e-16 ***
## zipcode90               -0.2835116  0.0071819  -39.476  < 2e-16 ***
## zipcode91               -0.5014514  0.0112977  -44.385  < 2e-16 ***
## zipcode92               -0.8951543  0.3661830   -2.445  0.01451 *  
## zipcode93               -0.6850067  0.1017839   -6.730 1.73e-11 ***
## zipcode94               -0.2315125  0.0091763  -25.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 27008 degrees of freedom
## Multiple R-squared:  0.6588, Adjusted R-squared:  0.6584 
## F-statistic:  1931 on 27 and 27008 DF,  p-value: < 2.2e-16
plot(lm4,which=1)

plot(lm4,which=2)
## Warning: not plotting observations with leverage one:
##   1413, 8437, 20636

잔차의 패턴이 어느정도 감소함을 알 수 있다.

summary(lm3)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51722 -0.22532 -0.00299  0.22586  2.52411 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2.813e+00  5.270e-02   53.368  < 2e-16 ***
## room_typePrivate room   -5.856e-01  5.629e-03 -104.034  < 2e-16 ***
## room_typeShared room    -1.138e+00  1.624e-02  -70.065  < 2e-16 ***
## property_typeHouse      -1.814e-02  5.946e-03   -3.051  0.00228 ** 
## property_typeOther       4.224e-02  7.230e-03    5.842 5.21e-09 ***
## accommodates             6.693e-02  2.140e-03   31.282  < 2e-16 ***
## bathrooms                1.208e-01  4.956e-03   24.374  < 2e-16 ***
## cancellation_policy.L    2.931e-01  4.525e-02    6.477 9.49e-11 ***
## cancellation_policy.Q    1.610e-01  3.380e-02    4.764 1.91e-06 ***
## cancellation_policy.C    6.631e-02  1.543e-02    4.296 1.74e-05 ***
## cleaning_feeTRUE        -1.157e-03  6.536e-03   -0.177  0.85946    
## host_identity_verifiedt  1.670e-02  5.546e-03    3.011  0.00261 ** 
## host_response_rate      -1.368e-01  2.280e-02   -5.998 2.02e-09 ***
## instant_bookablet       -3.627e-02  4.968e-03   -7.301 2.93e-13 ***
## number_of_reviews       -3.647e-05  4.691e-05   -0.778  0.43683    
## review_scores_rating     1.881e-02  5.018e-04   37.481  < 2e-16 ***
## bedrooms                 1.718e-01  4.272e-03   40.205  < 2e-16 ***
## beds                    -3.903e-02  3.110e-03  -12.549  < 2e-16 ***
## zipcode11               -3.585e-01  7.226e-03  -49.616  < 2e-16 ***
## zipcode1m               -3.950e-01  3.661e-01   -1.079  0.28064    
## zipcode20               -2.975e-01  1.004e-02  -29.637  < 2e-16 ***
## zipcode21               -2.355e-01  1.130e-02  -20.839  < 2e-16 ***
## zipcode22               -1.539e-01  5.485e-02   -2.805  0.00503 ** 
## zipcode24               -6.537e-01  3.661e-01   -1.785  0.07423 .  
## zipcode60               -3.330e-01  1.014e-02  -32.848  < 2e-16 ***
## zipcode90               -2.834e-01  7.186e-03  -39.434  < 2e-16 ***
## zipcode91               -5.019e-01  1.131e-02  -44.361  < 2e-16 ***
## zipcode92               -8.944e-01  3.662e-01   -2.442  0.01460 *  
## zipcode93               -6.861e-01  1.018e-01   -6.738 1.64e-11 ***
## zipcode94               -2.309e-01  9.213e-03  -25.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 27006 degrees of freedom
## Multiple R-squared:  0.6588, Adjusted R-squared:  0.6584 
## F-statistic:  1798 on 29 and 27006 DF,  p-value: < 2.2e-16

다만 p-value를 구해보면 유의미하지 않은 변수가 다시 나타나므로 stepwise를 실시해 본다. 그전에 number_of_reviews가 유의미 하지 않는 다고 나온다. 하지만 number_of_reviews는 다른 숫자자료에 비해 분산이 크다. 실제로 number_of_reviews와 비교해보면 10배 정도 차이가 난다. 이를 보정해 주기 위해 나눈값을 해봐도 결과는 동일하게 유의미 하지 않는다고 나온다.

sd(f_df$review_scores_rating, na.rm=TRUE)
## [1] 4.655876
sd(f_df$number_of_reviews) 
## [1] 48.80326
f_df$standard_number_of_reviews <- f_df$number_of_reviews/sd(f_df$review_scores_rating, na.rm=TRUE)


temp_lm = lm(log_price_ratio ~ room_type + accommodates + bathrooms + 
    cancellation_policy + cleaning_fee + host_response_rate + 
    instant_bookable + standard_number_of_reviews + review_scores_rating + 
    bedrooms + beds, data = f_df)

summary(temp_lm)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + accommodates + bathrooms + 
##     cancellation_policy + cleaning_fee + host_response_rate + 
##     instant_bookable + standard_number_of_reviews + review_scores_rating + 
##     bedrooms + beds, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59858 -0.25391 -0.01335  0.24299  2.54936 
## 
## Coefficients:
##                              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 2.8699063  0.0556688   51.553  < 2e-16 ***
## room_typePrivate room      -0.6022161  0.0057959 -103.904  < 2e-16 ***
## room_typeShared room       -1.1298223  0.0173091  -65.273  < 2e-16 ***
## accommodates                0.0634022  0.0022744   27.876  < 2e-16 ***
## bathrooms                   0.1074901  0.0052103   20.630  < 2e-16 ***
## cancellation_policy.L       0.3217257  0.0481716    6.679 2.46e-11 ***
## cancellation_policy.Q       0.1491561  0.0359783    4.146 3.40e-05 ***
## cancellation_policy.C       0.0559756  0.0164257    3.408 0.000656 ***
## cleaning_feeTRUE            0.0046765  0.0069336    0.674 0.500018    
## host_response_rate         -0.2032507  0.0242706   -8.374  < 2e-16 ***
## instant_bookablet          -0.0380744  0.0052724   -7.221 5.28e-13 ***
## standard_number_of_reviews  0.0000952  0.0002299    0.414 0.678769    
## review_scores_rating        0.0166253  0.0005285   31.456  < 2e-16 ***
## bedrooms                    0.1667024  0.0045023   37.026  < 2e-16 ***
## beds                       -0.0384741  0.0033131  -11.613  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3907 on 27021 degrees of freedom
## Multiple R-squared:  0.6111, Adjusted R-squared:  0.6109 
## F-statistic:  3032 on 14 and 27021 DF,  p-value: < 2.2e-16

이제 기존의 변수를 활용하여 stepwise를 실시해 본다

lm4 = step(lm3, direction = 'both')
## Start:  AIC=-54314.95
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + cleaning_fee + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## - cleaning_fee            1      0.00 3618.2 -54317
## - number_of_reviews       1      0.08 3618.2 -54316
## <none>                                3618.2 -54315
## - host_identity_verified  1      1.21 3619.4 -54308
## - host_response_rate      1      4.82 3623.0 -54281
## - instant_bookable        1      7.14 3625.3 -54264
## - property_type           2      7.92 3626.1 -54260
## - cancellation_policy     3     21.45 3639.6 -54161
## - beds                    1     21.10 3639.3 -54160
## - bathrooms               1     79.59 3697.7 -53729
## - accommodates            1    131.11 3749.3 -53355
## - review_scores_rating    1    188.21 3806.4 -52946
## - bedrooms                1    216.57 3834.7 -52745
## - zipcode                12    462.42 4080.6 -51087
## - room_type               2   1779.04 5397.2 -43507
## 
## Step:  AIC=-54316.91
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + number_of_reviews + 
##     review_scores_rating + bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## - number_of_reviews       1      0.08 3618.2 -54318
## <none>                                3618.2 -54317
## + cleaning_fee            1      0.00 3618.2 -54315
## - host_identity_verified  1      1.21 3619.4 -54310
## - host_response_rate      1      4.84 3623.0 -54283
## - instant_bookable        1      7.16 3625.3 -54265
## - property_type           2      7.92 3626.1 -54262
## - beds                    1     21.11 3639.3 -54162
## - cancellation_policy     3     21.72 3639.9 -54161
## - bathrooms               1     79.62 3697.8 -53730
## - accommodates            1    131.42 3749.6 -53354
## - review_scores_rating    1    188.30 3806.5 -52947
## - bedrooms                1    216.56 3834.7 -52747
## - zipcode                12    462.51 4080.7 -51089
## - room_type               2   1827.08 5445.2 -43269
## 
## Step:  AIC=-54318.33
## log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode
## 
##                          Df Sum of Sq    RSS    AIC
## <none>                                3618.2 -54318
## + number_of_reviews       1      0.08 3618.2 -54317
## + cleaning_fee            1      0.00 3618.2 -54316
## - host_identity_verified  1      1.16 3619.4 -54312
## - host_response_rate      1      4.95 3623.2 -54283
## - instant_bookable        1      7.27 3625.5 -54266
## - property_type           2      7.97 3626.2 -54263
## - cancellation_policy     3     21.66 3639.9 -54163
## - beds                    1     21.15 3639.4 -54163
## - bathrooms               1     79.88 3698.1 -53730
## - accommodates            1    131.34 3749.6 -53356
## - review_scores_rating    1    188.57 3806.8 -52947
## - bedrooms                1    218.14 3836.4 -52738
## - zipcode                12    462.64 4080.9 -51089
## - room_type               2   1828.54 5446.8 -43264

그 결과는 다음과 같다.

summary(lm4)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51557 -0.22551 -0.00287  0.22588  2.52586 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2.8112298  0.0525988   53.447  < 2e-16 ***
## room_typePrivate room   -0.5855993  0.0055699 -105.136  < 2e-16 ***
## room_typeShared room    -1.1375240  0.0161312  -70.517  < 2e-16 ***
## property_typeHouse      -0.0183360  0.0059407   -3.087  0.00203 ** 
## property_typeOther       0.0422432  0.0072288    5.844 5.16e-09 ***
## accommodates             0.0668736  0.0021358   31.311  < 2e-16 ***
## bathrooms                0.1209281  0.0049524   24.418  < 2e-16 ***
## cancellation_policy.L    0.2933262  0.0452494    6.482 9.18e-11 ***
## cancellation_policy.Q    0.1616865  0.0337823    4.786 1.71e-06 ***
## cancellation_policy.C    0.0664165  0.0154332    4.303 1.69e-05 ***
## host_identity_verifiedt  0.0162382  0.0055087    2.948  0.00320 ** 
## host_response_rate      -0.1381528  0.0227264   -6.079 1.23e-09 ***
## instant_bookablet       -0.0365163  0.0049570   -7.367 1.80e-13 ***
## review_scores_rating     0.0188152  0.0005015   37.518  < 2e-16 ***
## bedrooms                 0.1719812  0.0042620   40.352  < 2e-16 ***
## beds                    -0.0390361  0.0031066  -12.566  < 2e-16 ***
## zipcode11               -0.3585074  0.0072259  -49.614  < 2e-16 ***
## zipcode1m               -0.3950436  0.3661121   -1.079  0.28059    
## zipcode20               -0.2976181  0.0100374  -29.651  < 2e-16 ***
## zipcode21               -0.2358253  0.0112927  -20.883  < 2e-16 ***
## zipcode22               -0.1539659  0.0548451   -2.807  0.00500 ** 
## zipcode24               -0.6528374  0.3661251   -1.783  0.07458 .  
## zipcode60               -0.3329465  0.0101308  -32.865  < 2e-16 ***
## zipcode90               -0.2835116  0.0071819  -39.476  < 2e-16 ***
## zipcode91               -0.5014514  0.0112977  -44.385  < 2e-16 ***
## zipcode92               -0.8951543  0.3661830   -2.445  0.01451 *  
## zipcode93               -0.6850067  0.1017839   -6.730 1.73e-11 ***
## zipcode94               -0.2315125  0.0091763  -25.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 27008 degrees of freedom
## Multiple R-squared:  0.6588, Adjusted R-squared:  0.6584 
## F-statistic:  1931 on 27 and 27008 DF,  p-value: < 2.2e-16
plot(lm4,which=1)

plot(lm4,which=2)
## Warning: not plotting observations with leverage one:
##   1413, 8437, 20636

정규성과 등분산성에 문제가 없다. Y 변수가 아닌 잔차를 바탕으로 trellis plot을 그려보면 큰 문제는 보이지 않는다. (결과는 생략 eval=FALSE)

xyplot(lm4$residuals ~ room_type, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ cancellation_policy, data=f_df,panel=mypanel)

xyplot(lm4$residuals ~ property_type, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ accommodates, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ bathrooms, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ host_identity_verified, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ host_response_rate, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ instant_bookable , data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ review_scores_rating, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ bedrooms, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ beds, data=f_df,panel=mypanel)
xyplot(lm4$residuals ~ zipcode, data=f_df,panel=mypanel)

즉 우리는 교호작용 없이 총 12개의 변수를 이용해 회귀분석 하였다.

  • room_type
  • property_type
  • accommodates
  • bathrooms
  • cancellation_policy
  • host_identity_verified
  • host_response_rate
  • instant_bookable
  • review_scores_rating
  • bedrooms
  • beds
  • zipcode

이제 오버피팅을 막기 위해 데이터를 train set과 test set으로 나누어 모형 만들어 평가해 보도록 한다.

nobs=nrow(f_df)

set.seed(1234)
i = sample(1:nobs, round(nobs*0.6)) #60% for training data, 40% for testdata
train_df = f_df[i,] 
test_df = f_df[-i,]

lm5 = lm(log_price_ratio ~ room_type + property_type + accommodates + bathrooms 
          + cancellation_policy + host_identity_verified + host_response_rate + instant_bookable + 
            review_scores_rating + bedrooms + beds + zipcode, data = f_df)

summary(lm5)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51557 -0.22551 -0.00287  0.22588  2.52586 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2.8112298  0.0525988   53.447  < 2e-16 ***
## room_typePrivate room   -0.5855993  0.0055699 -105.136  < 2e-16 ***
## room_typeShared room    -1.1375240  0.0161312  -70.517  < 2e-16 ***
## property_typeHouse      -0.0183360  0.0059407   -3.087  0.00203 ** 
## property_typeOther       0.0422432  0.0072288    5.844 5.16e-09 ***
## accommodates             0.0668736  0.0021358   31.311  < 2e-16 ***
## bathrooms                0.1209281  0.0049524   24.418  < 2e-16 ***
## cancellation_policy.L    0.2933262  0.0452494    6.482 9.18e-11 ***
## cancellation_policy.Q    0.1616865  0.0337823    4.786 1.71e-06 ***
## cancellation_policy.C    0.0664165  0.0154332    4.303 1.69e-05 ***
## host_identity_verifiedt  0.0162382  0.0055087    2.948  0.00320 ** 
## host_response_rate      -0.1381528  0.0227264   -6.079 1.23e-09 ***
## instant_bookablet       -0.0365163  0.0049570   -7.367 1.80e-13 ***
## review_scores_rating     0.0188152  0.0005015   37.518  < 2e-16 ***
## bedrooms                 0.1719812  0.0042620   40.352  < 2e-16 ***
## beds                    -0.0390361  0.0031066  -12.566  < 2e-16 ***
## zipcode11               -0.3585074  0.0072259  -49.614  < 2e-16 ***
## zipcode1m               -0.3950436  0.3661121   -1.079  0.28059    
## zipcode20               -0.2976181  0.0100374  -29.651  < 2e-16 ***
## zipcode21               -0.2358253  0.0112927  -20.883  < 2e-16 ***
## zipcode22               -0.1539659  0.0548451   -2.807  0.00500 ** 
## zipcode24               -0.6528374  0.3661251   -1.783  0.07458 .  
## zipcode60               -0.3329465  0.0101308  -32.865  < 2e-16 ***
## zipcode90               -0.2835116  0.0071819  -39.476  < 2e-16 ***
## zipcode91               -0.5014514  0.0112977  -44.385  < 2e-16 ***
## zipcode92               -0.8951543  0.3661830   -2.445  0.01451 *  
## zipcode93               -0.6850067  0.1017839   -6.730 1.73e-11 ***
## zipcode94               -0.2315125  0.0091763  -25.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 27008 degrees of freedom
## Multiple R-squared:  0.6588, Adjusted R-squared:  0.6584 
## F-statistic:  1931 on 27 and 27008 DF,  p-value: < 2.2e-16
plot(lm5,which=1)

plot(lm5,which=2)
## Warning: not plotting observations with leverage one:
##   1413, 8437, 20636

p-value도 유의미하며(일부 zipcode는 유의미 하지 않으나 대부분이 유의미하므로 그대로 사용한다. 하지만 해당 zipcode에 대해서는 해석에 유의할 필요가 있다.), Adjusted R-squared도 0.6584이다. 잔차도 문제 없으므로 test데이터와 비교를 해본다.

그 결과 예측결정계수,평균절대오차, MAPE는 순서대로 다음과 같다.

## predicted values
pred = predict(lm5, newdata=test_df, type='response')
# predictive R^2
cor(test_df$log_price_ratio, pred)^2
## [1] 0.6595498
# MAE
mean(abs(test_df$log_price_ratio - pred))
## [1] 0.280805
# MAPE
mean(abs(test_df$log_price_ratio - pred)/abs(test_df$log_price_ratio))*100
## [1] 6.538482

최종 결과

마지막으로 우리의 최종 모형을 설명하고자 한다.

summary(lm5)
## 
## Call:
## lm(formula = log_price_ratio ~ room_type + property_type + accommodates + 
##     bathrooms + cancellation_policy + host_identity_verified + 
##     host_response_rate + instant_bookable + review_scores_rating + 
##     bedrooms + beds + zipcode, data = f_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51557 -0.22551 -0.00287  0.22588  2.52586 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              2.8112298  0.0525988   53.447  < 2e-16 ***
## room_typePrivate room   -0.5855993  0.0055699 -105.136  < 2e-16 ***
## room_typeShared room    -1.1375240  0.0161312  -70.517  < 2e-16 ***
## property_typeHouse      -0.0183360  0.0059407   -3.087  0.00203 ** 
## property_typeOther       0.0422432  0.0072288    5.844 5.16e-09 ***
## accommodates             0.0668736  0.0021358   31.311  < 2e-16 ***
## bathrooms                0.1209281  0.0049524   24.418  < 2e-16 ***
## cancellation_policy.L    0.2933262  0.0452494    6.482 9.18e-11 ***
## cancellation_policy.Q    0.1616865  0.0337823    4.786 1.71e-06 ***
## cancellation_policy.C    0.0664165  0.0154332    4.303 1.69e-05 ***
## host_identity_verifiedt  0.0162382  0.0055087    2.948  0.00320 ** 
## host_response_rate      -0.1381528  0.0227264   -6.079 1.23e-09 ***
## instant_bookablet       -0.0365163  0.0049570   -7.367 1.80e-13 ***
## review_scores_rating     0.0188152  0.0005015   37.518  < 2e-16 ***
## bedrooms                 0.1719812  0.0042620   40.352  < 2e-16 ***
## beds                    -0.0390361  0.0031066  -12.566  < 2e-16 ***
## zipcode11               -0.3585074  0.0072259  -49.614  < 2e-16 ***
## zipcode1m               -0.3950436  0.3661121   -1.079  0.28059    
## zipcode20               -0.2976181  0.0100374  -29.651  < 2e-16 ***
## zipcode21               -0.2358253  0.0112927  -20.883  < 2e-16 ***
## zipcode22               -0.1539659  0.0548451   -2.807  0.00500 ** 
## zipcode24               -0.6528374  0.3661251   -1.783  0.07458 .  
## zipcode60               -0.3329465  0.0101308  -32.865  < 2e-16 ***
## zipcode90               -0.2835116  0.0071819  -39.476  < 2e-16 ***
## zipcode91               -0.5014514  0.0112977  -44.385  < 2e-16 ***
## zipcode92               -0.8951543  0.3661830   -2.445  0.01451 *  
## zipcode93               -0.6850067  0.1017839   -6.730 1.73e-11 ***
## zipcode94               -0.2315125  0.0091763  -25.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 27008 degrees of freedom
## Multiple R-squared:  0.6588, Adjusted R-squared:  0.6584 
## F-statistic:  1931 on 27 and 27008 DF,  p-value: < 2.2e-16

Y변수 : log_price_ratio (가격비에 log를 취한 값)

X변수 : room_type + property_type + accommodates + bathrooms + cancellation_policy + host_identity_verified + host_response_rate + instant_bookable + review_scores_rating + bedrooms + beds + zipcod

위의 Coefficients의 Estimate를 보면 그 결과를 알 수 있으나 간단히 설명하면 다음과 같다.

  • cancellation_policy가 “super_strict_30”에서 “flexible”으로 변하면 0.2750713만큼 log_price_ratio가 증가한다.

  • cancellation_policy가 “super_strict_30”에서 “moderate”으로 변하면 0.1480741만큼 log_price_ratio가 증가한다.

  • cancellation_policy가 “super_strict_30”에서 “strict”으로 변하면 0.0603244만큼 log_price_ratio가 증가한다.

  • room_type가 Entire home/apt에서 “Private room”으로 변하면 0.5854706만큼 log_price_ratio가 감소한다.

  • room_type가 Entire home/apt에서 “Shared room”으로 변하면 1.1371925만큼 log_price_ratio가 감소한다.

  • property_type가 “Apartment”에서 “House”으로 변하면 0.0184556만큼 log_price_ratio가 감소한다.

  • property_type가 “Apartment”에서 “Other”으로 변하면 0.0422110만큼 log_price_ratio가 증가한다.

  • host_identity_verified가 “f”에서 “t”으로 변하면 0.0162358만큼 log_price_ratio가 증가한다.

  • bedrooms의 개수가 1개 증가할 수록 0.1718274만큼 log_price_ratio가 증가한다.

  • beds의 수가 1개 증가할수록 0.0392406만큼 log_price_ratio가 감소한다.

  • bathrooms의 개수가 1개 증가할 수록 0.1208864만큼 log_price_ratio가 증가한다.

  • accommodate가 1 증가할 수록 0.0670576만큼 log_price_ratio가 증가한다.

  • host_response_rate가 1 증가할 수록 0.1385022만큼 log_price_ratio가 감소한다.

  • instant_bookablet가 1 증가할 수록 0.0365038만큼 log_price_ratio가 감소한다.

  • review_scores_rating가 1 증가할 수록 0.0188218만큼 log_price_ratio가 증가한다.

  • zipcode의 경우 10###에서 각 해당 코드로 변했을떄 log_price_ratio가 얼마나 감소했는지를 의미한다. 다만 zipcode1m###과 zipcode24###는 설명하지 않는다.

updatedupdated2020-08-202020-08-20