This is an R Markdown document. Below listed the steps of how we used the data to predict who will buy the insurance. 這是整理後的步驟順序,並非我們一開始的順序,而是經過檢討後再次整理的順序。在後面將會補上檢討的部分。
#一開始我們先利用excel將train與test太多遺漏值的項目直接手動刪除,留下各83與84個變數
#我們將其命名為train test
#我們將類別資料裡頭的低中高分別給予相對應的數字
#將原始資料中的Y/N更改成 1,0
#mice::mice(train) 之後我們利用mice套件組 將遺漏直做填補 將資料作為 new_train 與 new_test
summary(train_58_full_1) #沒有na值了
## X AGE LAST_A_CCONTACT_DT OCCUPATION_CLASS_CD
## Min. : 1 Min. :20.00 Min. :0.000 Min. :0.000
## 1st Qu.: 25001 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:1.000
## Median : 50000 Median :35.00 Median :0.000 Median :1.000
## Mean : 50000 Mean :41.35 Mean :0.354 Mean :1.309
## 3rd Qu.: 75000 3rd Qu.:50.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :100000 Max. :65.00 Max. :1.000 Max. :6.000
## INSD_1ST_AGE RFM_R REBUY_TIMES_CNT LEVEL
## Min. : 0.0 Min. : 1.000 Min. : 5.000 Min. :1.000
## 1st Qu.: 0.0 1st Qu.: 4.000 1st Qu.: 5.000 1st Qu.:1.000
## Median :20.0 Median : 7.000 Median : 5.000 Median :1.000
## Mean :29.8 Mean : 5.799 Mean : 8.063 Mean :2.527
## 3rd Qu.:60.0 3rd Qu.: 7.000 3rd Qu.:10.000 3rd Qu.:5.000
## Max. :60.0 Max. :10.000 Max. :20.000 Max. :5.000
## LIFE_CNT IF_ISSUE_I_IND IF_ISSUE_J_IND IF_ISSUE_N_IND
## Min. : 5.0 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 5.0 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 5.0 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :11.2 Mean :0.05627 Mean :0.1097 Mean :0.1512
## 3rd Qu.:15.0 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :25.0 Max. :1.00000 Max. :1.0000 Max. :1.0000
## IF_ISSUE_P_IND IF_ISSUE_Q_IND IF_ADD_L_IND IF_ADD_Q_IND
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1502 Mean :0.1841 Mean :0.1846 Mean :0.1841
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## IF_ADD_R_IND IF_ADD_IND IF_ISSUE_INSD_A_IND IF_ISSUE_INSD_B_IND
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1462 Mean :0.2476 Mean :0.02781 Mean :0.02801
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## IF_ISSUE_INSD_C_IND IF_ISSUE_INSD_D_IND IF_ISSUE_INSD_E_IND
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.08147 Mean :0.1503 Mean :0.00196
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## IF_ISSUE_INSD_F_IND IF_ISSUE_INSD_G_IND IF_ISSUE_INSD_H_IND
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.05401 Mean :0.05055 Mean :0.00044
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## IF_ISSUE_INSD_I_IND IF_ISSUE_INSD_J_IND IF_ISSUE_INSD_K_IND
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1057 Mean :0.2475 Mean :0.01051
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## IF_ISSUE_INSD_L_IND IF_ISSUE_INSD_M_IND IF_ISSUE_INSD_N_IND
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.00995 Mean :0.00227 Mean :0.2898
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## IF_ISSUE_INSD_O_IND IF_ISSUE_INSD_P_IND IF_ISSUE_INSD_Q_IND
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.01872 Mean :0.3308 Mean :0.4195
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## X_A_IND X_B_IND X_C_IND X_D_IND
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.01256 Mean :0.3227 Mean :0.3285 Mean :0.08392
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## X_E_IND X_F_IND X_G_IND X_H_IND
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.2013 Mean :0.0023 Mean :0.00202 Mean :0.2322
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## L1YR_B_ISSUE_CNT L1YR_PAYMENT_REMINDER_IND L1YR_LAPSE_IND
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.00333 Mean :0.03528 Mean :0.08977
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :4.00000 Max. :1.00000 Max. :1.00000
## L1YR_GROSS_PRE_AMT ANNUAL_INCOME_AMT INSD_CNT
## Min. :0.0000000 Min. :0.0000000 Min. : 0.0000
## 1st Qu.:0.0000000 1st Qu.:0.0002500 1st Qu.: 0.0000
## Median :0.0000000 Median :0.0004167 Median : 0.0000
## Mean :0.0005254 Mean :0.0006033 Mean : 0.2836
## 3rd Qu.:0.0002650 3rd Qu.:0.0006667 3rd Qu.: 0.0000
## Max. :0.3278146 Max. :0.2500000 Max. :19.0000
## IM_IS_A_IND IM_IS_B_IND IM_IS_C_IND IM_IS_D_IND
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.00634 Mean :0.1248 Mean :0.07316 Mean :0.1795
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## IM_CNT Y1 REBUY_TIMES_CNT2 AGE2
## Min. :0.0000 Min. :0.00 Min. : 25.00 Min. : 400
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.: 25.00 1st Qu.: 400
## Median :0.0000 Median :0.00 Median : 25.00 Median :1225
## Mean :0.3838 Mean :0.02 Mean : 90.42 Mean :1995
## 3rd Qu.:1.0000 3rd Qu.:0.00 3rd Qu.:100.00 3rd Qu.:2500
## Max. :4.0000 Max. :1.00 Max. :400.00 Max. :4225
## RFM_R2 AGE3 REBUY_TIMES_CNT3 RFM_R3
## Min. : 1.00 Min. : 8000 Min. : 125 Min. : 1.0
## 1st Qu.: 16.00 1st Qu.: 8000 1st Qu.: 125 1st Qu.: 64.0
## Median : 49.00 Median : 42875 Median : 125 Median : 343.0
## Mean : 43.13 Mean :106629 Mean :1326 Mean : 353.7
## 3rd Qu.: 49.00 3rd Qu.:125000 3rd Qu.:1000 3rd Qu.: 343.0
## Max. :100.00 Max. :274625 Max. :8000 Max. :1000.0
## IM_CNT2 IM_CNT3 TERMINATION_RATE IF_2ND_GEN_IND
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 0.0000 Median : 0.000 Median : 0.00 Median :1.0000
## Mean : 0.6262 Mean : 1.232 Mean : 32.99 Mean :0.5432
## 3rd Qu.: 1.0000 3rd Qu.: 1.000 3rd Qu.:100.00 3rd Qu.:1.0000
## Max. :16.0000 Max. :64.000 Max. :100.00 Max. :1.0000
這裡我們開始挑選變數,我們使用了不同種的方法
####敘述統計
#以敘述統計的方式選出 17 個對 Y1 影響較關聯的變數。
####關聯規則
#先將 train 的 Y1 中為 Y 的選項篩選出來,透過 apriori 演算法找出單一項目出現頻率較高的。
#第一次使用關聯法則,我們找出了58個變數,再做了一次我們在篩出了43個變數
library(arules)
## Warning: package 'arules' was built under R version 3.5.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.5.2
## Loading required package: grid
re_train_58 = as(data.frame(lapply(train_58_full_1, as.character), stringsAsFactors=T), "transactions")
#轉換城可進行關聯規則分析的"transactions"的物件
itemFrequencyPlot(re_train_58, supp = 0.6 , cex.names = 0.8)
#找出出現頻率最高的項目
rules <- apriori(re_train_58 , parameter = list(supp = 0.6 , confidence = 0.9) )
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.6 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 60000
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[130085 item(s), 100000 transaction(s)] done [1.81s].
## sorting and recoding items ... [52 item(s)] done [0.10s].
## creating transaction tree ... done [0.14s].
## checking subsets of size 1 2 3 4
## Warning in apriori(re_train_58, parameter = list(supp = 0.6, confidence =
## 0.9)): Mining stopped (time limit reached). Only patterns up to a length of
## 4 returned!
## done [9.54s].
## writing ... [364178 rule(s)] done [0.04s].
## creating S4 object ... done [0.20s].
#使用apriori關聯規則演算法可以建構與視覺化
plot(rules , measure = c("confidence", "lift"),shading = "support")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
#找出具有相關性的變數
###繪製模型
train_43.lm <- lm(Y1~AGE+LAST_A_CCONTACT_DT+OCCUPATION_CLASS_CD+INSD_1ST_AGE+RFM_R+REBUY_TIMES_CNT+LEVEL+LIFE_CNT+IF_ISSUE_J_IND+IF_ISSUE_N_IND+IF_ISSUE_P_IND+IF_ISSUE_Q_IND+IF_ADD_L_IND+IF_ADD_Q_IND+IF_ADD_R_IND+IF_ADD_IND+IF_ISSUE_INSD_A_IND+IF_ISSUE_INSD_B_IND+IF_ISSUE_INSD_C_IND+IF_ISSUE_INSD_D_IND+IF_ISSUE_INSD_E_IND+IF_ISSUE_INSD_F_IND+IF_ISSUE_INSD_G_IND+IF_ISSUE_INSD_H_IND+IF_ISSUE_INSD_I_IND+IF_ISSUE_INSD_J_IND+IF_ISSUE_INSD_K_IND+IF_ISSUE_INSD_L_IND+IF_ISSUE_INSD_M_IND+IF_ISSUE_INSD_N_IND+IF_ISSUE_INSD_O_IND+IF_ISSUE_INSD_P_IND+IF_ISSUE_INSD_Q_IND+X_A_IND+X_B_IND+X_C_IND+X_D_IND+X_E_IND+X_F_IND+X_G_IND+X_H_IND, data = train_58_full_1)
#使用線性迴歸畫出預測的模型
#step_43.lm <- step(train_43.lm)
#使用step來篩出相關性較高的變數,但實測之後效果沒有特別顯著