Simpel Regresi Linier, Logistik dan K-Means Clustering Menggunakan R

Oleh nusa

Friday, 10 February 2017

Portal-Statistik | Hari ini kita akan berbagi bagaimana untuk melakukan analisis data dengan beberapa metode sekaligus, Analisis Regresi Linear Sederhana, Analisis Regresi Logistik dan K-Means Clustering dengan menggunakan aplikasi R.

Fungsi untuk visualisasi


add_title <- function(vis, ..., x_lab = "X units", title = "Plot Title") 
{
  add_axis(vis, "x", title = x_lab) %>% 
    add_axis("x", orient = "top", ticks = 0, title = title,
             properties = axis_props(
               axis = list(stroke = "white"),
               labels = list(fontSize = 0)
             ), ...)
}

Library yang diperlukan

library(ggplot2)
library(ggvis)
library(NbClust)#cluster
library(fpc)#analisis diskriminan
library(lmtest)
library(rattle)

1. Analisis Regresi

Pemodelan dengan dataset Trees
Langkah pertama cek distribusi data untuk setiap variabel Girth terhadap Volume:

trees %>% ggvis (~Girth,~Volume) %>% 
  layer_boxplots() %>% 
  add_title(title = "Indentifying Anomaly data", x_lab = "#Girth session")

dan Height terhadap Volume:

trees %>% ggvis (~Height,~Volume) %>% 
  layer_boxplots() %>% 
  add_title(title = "Indentifying Anomaly data", x_lab = "#Height")

Dari uji diagnostik terlihat ada pola linier (positif) untuk setiap variabel. Tampak juga ada beberapa variansi data yang besar dari “height” terhadap Volume.
Untuk melanjutkan analisis regresi agar menghasilkan model yang baik, maka beberapa yang terindentifikasi outlier (atau anomali) dapat dihilangkan dalam perhitungan model.
Metode yang digunakan untuk hal ini dengan menggunakan jarak Cook.

mod <- lm(data=trees, Volume~Girth+Height )
cooksd <- cooks.distance(mod)

#ploting cooksd
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") 
abline(h = 4*mean(cooksd, na.rm=T), col="red")
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")

Kemudian, row data yang teridentifikasi sebagai outlier dapat dieliminasi dalam model

influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))])  
clean.data <- data.frame(trees[-influential, ])

summary(model <-lm(data=clean.data, Volume~Girth+Height))

## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = clean.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6396 -1.2786 -0.4938  2.0262  6.4599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -52.2362     8.0390  -6.498 5.76e-07 ***
## Girth         4.4773     0.2518  17.781  < 2e-16 ***
## Height        0.2992     0.1179   2.538   0.0172 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.49 on 27 degrees of freedom
## Multiple R-squared:  0.9437, Adjusted R-squared:  0.9395 
## F-statistic: 226.3 on 2 and 27 DF,  p-value: < 2.2e-16

Kesimpulan yang didapat, bahwa Girth dan Height memiliki pengaruh yang kuat terhadap Volume pohon. Sehingga dapat dikatakan juga untuk menghitung Volume sebuah pohon 93% dapat digambarkan menggunakan nilai Girth dan Height nya.

Sebagai tambahan, dapat juga dilakukan uji kebaikan model dengan uji heteroskedastik error, model dikatakan baik jika H0 gagal ditolak (homoskedastik). Error yang homogen dapat mengidentifikasikan hasil estimasi stabil (selisih antara estimasi dan nilai asli tidak terlalu bervariasi)

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.8397, df = 2, p-value = 0.3986

Uji autokorelasi dapat diabaikan untuk kasus ini karena termasuk data cross section, bukan time series.
Normalitas error, jika error yang dihasilkan model berdistirbusi normal maka dapat dikatakan model baik

res=residuals(model,type="response")
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.96008, p-value = 0.3113

Kemudian uji multikoliniearitas. Di anggap baik Jika nilai VIF dibawah 5.

library(fmsb)
VIF(model)

## [1] 17.76253

Karena model teridentifikasi multikolinieritas, maka model dapat disusun ulang sebagai berikut dengan mempertimbangkan nilai t statistik terbesar untuk variabel independen.

summary(model1 <-lm(data=clean.data, Volume~Girth))

## 
## Call:
## lm(formula = Volume ~ Girth, data = clean.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5036 -2.3834 -0.0489  2.3951  6.3726 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.3104     3.2784  -10.16 6.76e-11 ***
## Girth         4.7619     0.2464   19.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.813 on 28 degrees of freedom
## Multiple R-squared:  0.9303, Adjusted R-squared:  0.9278 
## F-statistic: 373.6 on 1 and 28 DF,  p-value: < 2.2e-16

Kesimpulan terkahir didapat, untuk memperkirakan Volume (atau bahkan height) dapat digambarkan 93% menggunakan Girth.

2. Logistic Regression

library(datasets)
 dim(Titanic)

## [1] 4 2 2 2

Mempersiapkan data

 classes <- dimnames(Titanic)[[1]]
 genders <- dimnames(Titanic)[[2]]
 ages    <- dimnames(Titanic)[[3]]

 class <- gender <- age <- N <- Y <- rep(0,4*2*2)
 count<-0
 for(i1 in 1:4){
   for(i2 in 1:2){
     for(i3 in 1:2){
       count <- count + 1
       class[count]  <- classes[i1]
       gender[count] <- genders[i2]
       age[count]    <- ages[i3]
       N[count]      <- sum(Titanic[i1,i2,i3,])
       Y[count]      <- Titanic[i1,i2,i3,2]
     }
   }
  }

   Xc1 <- ifelse(class=="1st",1,0)
   Xc2 <- ifelse(class=="2nd",1,0)
   Xc3 <- ifelse(class=="3rd",1,0)
   Xg  <- ifelse(gender=="Female",1,0)
   Xa  <- ifelse(age=="Adult",1,0)
   n   <- 16

   cbind(class,Xc1,Xc2,Xc3,gender,Xg,age,Xa,N,Y)

##       class  Xc1 Xc2 Xc3 gender   Xg  age     Xa  N     Y    
##  [1,] "1st"  "1" "0" "0" "Male"   "0" "Child" "0" "5"   "5"  
##  [2,] "1st"  "1" "0" "0" "Male"   "0" "Adult" "1" "175" "57" 
##  [3,] "1st"  "1" "0" "0" "Female" "1" "Child" "0" "1"   "1"  
##  [4,] "1st"  "1" "0" "0" "Female" "1" "Adult" "1" "144" "140"
##  [5,] "2nd"  "0" "1" "0" "Male"   "0" "Child" "0" "11"  "11" 
##  [6,] "2nd"  "0" "1" "0" "Male"   "0" "Adult" "1" "168" "14" 
##  [7,] "2nd"  "0" "1" "0" "Female" "1" "Child" "0" "13"  "13" 
##  [8,] "2nd"  "0" "1" "0" "Female" "1" "Adult" "1" "93"  "80" 
##  [9,] "3rd"  "0" "0" "1" "Male"   "0" "Child" "0" "48"  "13" 
## [10,] "3rd"  "0" "0" "1" "Male"   "0" "Adult" "1" "462" "75" 
## [11,] "3rd"  "0" "0" "1" "Female" "1" "Child" "0" "31"  "14" 
## [12,] "3rd"  "0" "0" "1" "Female" "1" "Adult" "1" "165" "76" 
## [13,] "Crew" "0" "0" "0" "Male"   "0" "Child" "0" "0"   "0"  
## [14,] "Crew" "0" "0" "0" "Male"   "0" "Adult" "1" "862" "192"
## [15,] "Crew" "0" "0" "0" "Female" "1" "Child" "0" "0"   "0"  
## [16,] "Crew" "0" "0" "0" "Female" "1" "Adult" "1" "23"  "20"

Membuat fungsi logistik dengan metode estimasi Likelihood

logistic_model <- "model{
   # Likelihood
   for(i in 1:n){
    Y[i] ~ dbin(q[i],N[i])
    logit(q[i]) <- beta[1] + beta[2]*Xc1[i] + beta[3]*Xc2[i] + 
                   beta[4]*Xc3[i] + beta[5]*Xg[i] + beta[6]*Xa[i]
   }

   #Priors
   for(j in 1:6){
    beta[j] ~ dnorm(0,0.01)
   }
  }"

Modelling menggunakan logistik model

library(rjags)

  dat    <- list(Y=Y,n=n,N=N,Xc1=Xc1,Xc2=Xc2,Xc3=Xc3,Xg=Xg,Xa=Xa)
  model1 <- jags.model(textConnection(logistic_model),data = dat,n.chains=3, quiet=TRUE)

Menghitung nilai DIC

  dic1  <- dic.samples(model1, 
           variable.names=c("beta"), 
           n.iter=20000, progress.bar="none")

Konvergensi Diagnosis

samp <- coda.samples(model1, 
        variable.names=c("beta"), 
        n.iter=20000, progress.bar="none")
autocorr.plot(samp)

gelman.plot(samp)

summary(samp)

## 
## Iterations = 21001:41000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 20000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean     SD  Naive SE Time-series SE
## beta[1] -0.1643 0.2613 0.0010670       0.008718
## beta[2]  0.8569 0.1583 0.0006461       0.001258
## beta[3] -0.1651 0.1739 0.0007099       0.001575
## beta[4] -0.9280 0.1486 0.0006068       0.001882
## beta[5]  2.4314 0.1401 0.0005722       0.001060
## beta[6] -1.0711 0.2485 0.0010145       0.008102
## 
## 2. Quantiles for each variable:
## 
##            2.5%     25%     50%       75%   97.5%
## beta[1] -0.6739 -0.3353 -0.1651  0.008272  0.3573
## beta[2]  0.5465  0.7511  0.8572  0.963007  1.1664
## beta[3] -0.5081 -0.2825 -0.1631 -0.045948  0.1722
## beta[4] -1.2229 -1.0272 -0.9267 -0.827322 -0.6376
## beta[5]  2.1599  2.3363  2.4310  2.525828  2.7076
## beta[6] -1.5624 -1.2349 -1.0703 -0.907697 -0.5834

Kesimpulan:

Perempuan (mean Beta 5 >0) bertahan hidup lebih tinggi (peluang) dibandingkan laki-laki (mean Beta 4 <0)
Crew Kelas kedua (Mean Beta 2 > 0) bertahan hidup lebih baik dibandingkan Crew kelas Pertama (Mean Beta 1 <0) dan Crew kelas ketiga (mean beta 3 <0)
Orang dewasa (adults) memiliki kemungkinan bertahan lebih buruk (mean beta 6 <0) dibandingkan anak-anak (child) (mean beta 5> 0)

3. Clusteing Analisis
Langkah pertama melihat struktur data

head(wine,3)

##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185

str(wine)

## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

Melakukan standarisasi data, dan membuat kolom pertama (factor fields)

dataLatih <- scale(wine[-1])

Kemudian menentukan berapa jumlah cluster yang harus terbentuk hingga hasil optimal

nc <- NbClust(dataLatih,
              min.nc=2, max.nc=15,
              method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot.

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
## * According to the majority rule, the best number of clusters is  3 

## *******************************************************************

barplot(table(nc$Best.n[1,]),
        xlab="Jumlah Cluster",
        ylab="Jumlah Variabel",
        main="Jumlah Cluster berdasarkan 26 Variabel")

wss <- 0
for (i in 1:15){
  wss[i] <-
    sum(kmeans(dataLatih, centers=i)$withinss)
}
plot(1:15,
  wss,
  type="b",
  xlab="Jumlah Cluster",
  ylab="WSS")

Dari 2 metode diatas,ditetapkan 3 sebagai nilai Jumlah cluster yang akan dibentuk.
Kemudian membentuk cluster menggunakan kmeans

modelCluster <- kmeans(dataLatih,3)

kemudian divisualisasi menggunakan metode diskriminan untuk mereduksi dimensi, (menggambarkan banyak variabel menjadi 2 dimensi)

plotcluster(dataLatih, modelCluster$cluster)

Terakhir, mengevaluasi model dengan menggunakan tabel, dengan mencocokan variabel type pada dataset wine dengan hasil cluster yang terbentuk

table(wine$Type,modelCluster$cluster)

##      1  2  3
##   1 59  0  0
##   2  3  3 65
##   3  0 48  0

Dari Hasil didapatkan dengan model clustering yang digunakan, kemudian dibandingkan dengan data asli maka ditemukan terdapat 6 unit yang missclassfied. Sehingga didapat akurasi kebaikan hasil cluster dibandingkan dengan data asli maka:

(178-6)/178

## [1] 0.9662921

Thank You,
Have Fun.

Tag : DataMining, R

Simpel Regresi Linier, Logistik dan K-Means Clustering Menggunakan R

Next

1 Komentar untuk "Simpel Regresi Linier, Logistik dan K-Means Clustering Menggunakan R"