How to Scraping and Analyzing Data from Specific Website (Google Play Store)

Oleh nusa

Wednesday, 23 November 2016

Septianusa,

This project would be explaining about “how to scraping data from the specific website and analyzing it” using R. In some case we can obtain data from the website with APIs protocol or you can download it directly if possible. But not all website provide API or download facilities. This condition usually has happened in data analysis scope. So, with this project, i will show you one of technique that can be used for solving this problem.

Before continuing this discussion, let me tell you about this project flow. Let’s say we have some case about football manager games on Google Play Store. Now we want to know, Which better between a game feature with “Single Character Building” and “Multiple Character Building”. Data that will be used can be obtaining with scraping from https://play.google.com/store?hl=en and here our outline:

Data and package preparation

Functions needed

Scraping phase

PCA and regression analysis

Visualization

1. Data and package preparation
First step we need game’s URL and its feature that we want to compare it. (In real case we need more urls, not only five for best result)

data <- read.csv("soccerURL.csv",header=T)
urls <- data$URL
data

##                                                                                  URL
## 1      https://play.google.com/store/apps/details?id=com.bloodstone.fantasista&hl=en
## 2  https://play.google.com/store/apps/details?id=com.generamobile.soccerheroes&hl=en
## 3      https://play.google.com/store/apps/details?id=com.firsttouchgames.story&hl=en
## 4 https://play.google.com/store/apps/details?id=com.newstargames.newstarsoccer&hl=en
## 5       https://play.google.com/store/apps/details?id=com.firsttouchgames.dls3&hl=en
##   character
## 1       SCB
## 2       MCB
## 3       SCB
## 4       SCB
## 5       MCB

Packages needed

library(curl)
library(rvest)
library(RCurl)
library(foreach)
library(psych) #for statistics purpose
library(fmsb) #for create radar chart

2. Function Needed
Here i’ve written simple function for scraping purpose. Function below will be used for our analysis,

#function for data scraping from google playstore webpage
ScrapPlaystore <- function (url){
#'@param: url (char): this is game's url from google play store
#'@return: (list) function return meta data game and it reviews (included title) 
options(warn = -1)
#read_html() is for download of content
htmlpage <- read_html(curl(url))
#html_node() is for selecting node(s) from the downloaded content of a page
#html_text() is for extracting text from a previously selected node

#basic scraping such as title, developer name, category, rating, etc
title <- html_text(html_node(htmlpage,".id-app-title"))
dev <- html_text(html_node(htmlpage, "#body-content > div.outer-container > div > div.main-content > div:nth-child(1) > div > div.details-info > div.info-container > div:nth-child(2) > a > span"))
category <- html_text(html_node(htmlpage,".category span"))
score <- as.numeric(html_text(html_node(htmlpage,".score")))
ratingCount <- as.numeric(gsub(",", "",html_text(html_node(htmlpage,".reviews-num")) ))
mstar <- matrix(gsub(",", "",html_text(html_nodes(htmlpage,".bar-number"))))
mstar <- as.data.frame(t(mstar))
colnames(mstar) <- c("star5","star4","star3","star2","star1")

#title and review Scraping
reviewTitle <- html_text(html_nodes(htmlpage,".review-title"))
review <- html_text(html_nodes(htmlpage,".with-review-wrapper"))

#return list to review, title review, basic information
results <- return(list(review=review, reviewTitle=reviewTitle,
basic=cbind(title, dev, category, score, ratingCount,mstar)))
}

3. Scraping Phase
Now we can scraping scraping data for each game data from google play store using its url. we will using url in urls (url from many game).

dataBasic <- data.frame() #basic data location
gameReview <- list() #Review Game Location
for (aurl in urls){
basic <- ScrapPlaystore(aurl)
dataBasic <- rbind(basic$basic,dataBasic)
gameReview <- append(basic$review,gameReview)
}

We have 2 data frame, data basic on dataBasic and review data ongameReview,

dataBasic

##                      title                       dev category score
## 1      Dream League Soccer               First Touch   Sports   4.5
## 2          New Star Soccer Five Aces Publishing Ltd.   Sports   4.6
## 3              Score! Hero               First Touch   Sports   4.6
## 4        Soccer Heroes RPG              Genera Games   Sports   4.1
## 5 Football Saga Fantasista               Agate Games   Sports   4.1
##   ratingCount   star5  star4  star3 star2  star1
## 1     3607278 2682568 511523 191308 66905 154974
## 2     1569517 1209877 209592  65548 21147  63352
## 3     3116860 2408648 470005 118994 34083  85130
## 4       32612   20362   4207   3044  1472   3527
## 5        3488    2171    402    382   155    378

tail(gameReview)

## [[1]]
## [1] " Update terbaru Dalam sehari update 2x -.- tamatlah yang ga pake wifi :v   Full Review   "
## 
## [[2]]
## [1] " Live my dream I want to be the best player yeahh   Full Review   "
## 
## [[3]]
## [1] " Ok Sejauh ini cukup menyenangkan   Full Review   "
## 
## [[4]]
## [1] " Best game Best game you can try if you want to experience being a professional footballer   Full Review   "
## 
## [[5]]
## [1] "  owsum!!   Full Review   "
## 
## [[6]]
## [1] " Loved it I will best player ever!   Full Review   "

4. PCA analysis
For comparing which better between “single character building (SCB)” and “multiple character building (MCB)” we will using liniear model Regression. But before we doing this one, we have create laten variabel that can be used as alternative variabel for game Performance measuring using PCA analysis.

Data has obtained and saved in dataBasic then we will create new latent variable using dataBasic$score and dataBasic$ratingCount as partition of latent variable Performance (it will be saved in dat data.frame)

attach(dataBasic)
dat <- cbind(score,ratingCount)
head(dat)

##      score ratingCount
## [1,]   4.5     3607278
## [2,]   4.6     1569517
## [3,]   4.6     3116860
## [4,]   4.1       32612
## [5,]   4.1        3488

Before we going to create latent variable (using PCA), we need to check some assumptions. Adequacy sampling using KMO test and Matric correlation with Bartlett Test

KMO(dat)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = dat)
## Overall MSA =  0.5
## MSA for each item = 
##       score ratingCount 
##         0.5         0.5

cortest.bartlett(dat)

## $chisq
## [1] 2.984755
## 
## $p.value
## [1] 0.08405203
## 
## $df
## [1] 1

We can continue this analysis (PCA analysis for creating latent variable) if KMO value greater than 0.5 (sample has adequated) and p-value of Barlett test lower than 0.05 (matrix has correlated). (Because this case only for an example, so I still continue this analysis)

PCA analysis

library(psych)
pcadata<- principal(dat,nfactor=2,rotate="none")
pcadata

## Principal Components Analysis
## Call: principal(r = dat, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              PC1   PC2 h2      u2 com
## score       0.96 -0.29  1 1.1e-16 1.2
## ratingCount 0.96  0.29  1 1.1e-16 1.2
## 
##                        PC1  PC2
## SS loadings           1.83 0.17
## Proportion Var        0.92 0.08
## Cumulative Var        0.92 1.00
## Proportion Explained  0.92 0.08
## Cumulative Proportion 0.92 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

Look at eigenvalue (see on SS Loading) for each PC (PC1, PC2, … PCn) we can choose how many factor that can be made. Because eigenvalue that has greater than 1 (>1) only PC1, so we just can create 1 latent variable (if PC2 has eigenvalue has greater than 1 also, we can create 2 latent variables, and so on)

In this phase, we will create new latent variable (Performance) with PC1 to identify how that game more likely by user. This variable will be used as dependent variable.

Our laten score saved in PCA$scores

data <- cbind(data,pcadata$scores)
attach(data)
cb.data <- data.frame(cbind(character=as.factor(character),PC1)) #Character building data
summary(lm(PC1~character,data=cb.data))

## 
## Call:
## lm(formula = PC1 ~ character, data = cb.data)
## 
## Residuals:
##       1       2       3       4       5 
##  0.6222  0.7472  0.6717 -1.2939 -0.7472 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.8892     1.6801  -0.529    0.633
## character     0.5558     1.0041   0.554    0.618
## 
## Residual standard error: 1.1 on 3 degrees of freedom
## Multiple R-squared:  0.09267,    Adjusted R-squared:  -0.2098 
## F-statistic: 0.3064 on 1 and 3 DF,  p-value: 0.6185

Linier Model (regression)

summary(lm(PC1~data$character))

## 
## Call:
## lm(formula = PC1 ~ data$character)
## 
## Residuals:
##       1       2       3       4       5 
##  0.6222  0.7472  0.6717 -1.2939 -0.7472 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)        -0.3335     0.7777  -0.429    0.697
## data$characterSCB   0.5558     1.0041   0.554    0.618
## 
## Residual standard error: 1.1 on 3 degrees of freedom
## Multiple R-squared:  0.09267,    Adjusted R-squared:  -0.2098 
## F-statistic: 0.3064 on 1 and 3 DF,  p-value: 0.6185

Resampling
Because number of sample is small, we can resampling for maximizing our regression tools for liniear model

N <- length(cb.data[,1])
N.resample <- 30
idx = sample(1:N,N.resample,replace=TRUE)
cb.data.resample <- data.frame(cb.data[idx,])

here new liniear model

summary(lm(PC1~character,data=cb.data.resample))

## 
## Call:
## lm(formula = PC1 ~ character, data = cb.data.resample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0215 -1.0215 -0.4076  0.9317  1.0869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.2961     0.5812  -2.230   0.0339 *
## character     0.6230     0.3413   1.825   0.0786 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9009 on 28 degrees of freedom
## Multiple R-squared:  0.1063, Adjusted R-squared:  0.07443 
## F-statistic: 3.332 on 1 and 28 DF,  p-value: 0.07863

Coefficient of character variabel (2 = SCB; 1 = MCB) are positive, this value meaning SCB is much better than MCB. So, statistically, football manager game with “single character building” is much better than football game with “multiple character builiding”.

Vizualization
Create radar chart for comparing 5 football games,

rescale <- function(x) (x-min(x))/(max(x) - min(x)) * 20
rc.data <- data.frame(cbind(review=rescale(dataBasic$ratingCount),
                 star5= rescale(as.numeric(dataBasic$star5)),
                 star4= rescale(as.numeric(dataBasic$star4)),
                 performance=rescale(data$PC1),
                 ratingScore =rescale(dataBasic$score)
                 ))
rownames(rc.data)=dataBasic$title

colors_border=c( rgb(0.2,0.5,0.5,0.9), rgb(0.8,0.2,0.5,0.9) , rgb(0.7,0.5,0.1,0.9) )
colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) , rgb(0.7,0.5,0.1,0.4) )
radarchart(rc.data  , axistype=0 , maxmin=F,
    #custom polygon
    pcol=colors_border , pfcol=colors_in , plwd=4 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="black", cglwd=0.8, 
    #custom labels
    vlcex=0.8 
    )
op <- par(cex = 0.5)# legend text size
legend(x=0.84, y=.2, legend = rownames(rc.data), bty = "n", pch = 20 , col=colors_in , text.col = "black", text.font=0.1,cex=1.2, pt.cex=2)

We also can do sentiment analysis on,gameReview with Datumbox API :

Keys <- "YOUR_KEY" #get your key here http://www.datumbox.com/machine-learning-api/

### local function
getSentiment <- function (text, key){
  #' @param: text (char) : row text that want to be classified 
  #' @param: key(char): API key for datumbox
  #' @return: sentiment (dataframe) with colnames text, sentiment, topic, gender
  text <- URLencode(text);

  #save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
  text <- str_replace_all(text, "%20", " ");
  text <- str_replace_all(text, "%\\d\\d", "");
  text <- str_replace_all(text, " ", "%20");

  if (str_length(text) > 360){
    text <- substr(text, 0, 359);
  }
  ##########################################
  data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))
  js <- fromJSON(data, asText=TRUE);
  # get mood probability
  sentiment = js$output$result

  ###################################
  data <- getURL(paste("http://api.datumbox.com/1.0/SubjectivityAnalysis.json?api_key=", key, "&text=",text, sep=""))
  js <- fromJSON(data, asText=TRUE);
  
  # get mood probability
  subject = js$output$result
  ##################################
  
  data <- getURL(paste("http://api.datumbox.com/1.0/TopicClassification.json?api_key=", key, "&text=",text, sep=""))
  js <- fromJSON(data, asText=TRUE);
  # get mood probability
  topic = js$output$result
  
  ##################################
  data <- getURL(paste("http://api.datumbox.com/1.0/GenderDetection.json?api_key=", key, "&text=",text, sep=""))
  js <- fromJSON(data, asText=TRUE);
  # get mood probability
  gender = js$output$result
  return(list(sentiment=sentiment,subject=subject,topic=topic,gender=gender))
}

clean.text <- function(some_txt) {
  some_txt = gsub("[[:punct:]]", "", some_txt)
  some_txt = gsub("[[:digit:]]", "", some_txt)

  # define "tolower error handling" function
  try.tolower = function(x)
  {
    y = NA
    try_error = tryCatch(tolower(x), error=function(e) e)
    if (!inherits(try_error, "error"))
      y = tolower(x)
    return(y)
  }
  some_txt = sapply(some_txt, try.tolower)
  some_txt = some_txt[some_txt != ""]
  names(some_txt) = NULL
  return(some_txt)
}

Further discussion
If you found this interesting, you can try something else that has related with “Data Scraping” such as:

-Return Geocoding based on Place name

#Packages need
library(RCurl)
library(RJSONIO)
library(plyr)

url <- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

geoCode <- function(address,verbose=FALSE) {
  if(verbose) cat(address,"\n")
  u <- url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  if(x$status=="OK") {
    lat <- x$results[[1]]$geometry$location$lat
    lng <- x$results[[1]]$geometry$location$lng
    location_type  <- x$results[[1]]$geometry$location_type
    formatted_address  <- x$results[[1]]$formatted_address
    return(c(lat, lng, location_type, formatted_address))
    Sys.sleep(0.5)
  } else {
    return(c(NA,NA,NA, NA))
  }
}

address <- geoCode("Universitas islam indonesia")
address

## [1] "-7.7773117"                                                                                                                                   
## [2] "110.3929638"                                                                                                                                  
## [3] "ROOFTOP"                                                                                                                                      
## [4] "Universitas Islam Indonesia, Jl. Demangan Baru No.24, Caturtunggal, Kec. Depok, Kabupaten Sleman, Daerah Istimewa Yogyakarta 55281, Indonesia"

-Looking for your competitor on Play Store based on particular keywords. (For this function you need RSelenium and Browser Driver)

#' This function can be used for scraping competitor data based on keyword
library(RSelenium)

getCompetitor <- function(keywords) {
  #'@param: keyword is your keyword that you looking for its competitor
  #'@return: app list and number of competitor with that "keywords" in title
  #Generating URL with some keywords
  root <- "https://play.google.com/store/search?hl=en&c=apps&q="#you can subtitute apps with other file such as "books" or movies
  u <- paste(root,keywords,sep="")
  generatedURL <- URLencode(u)

  #RSelenium
  session <- checkForServer()
  session <- startServer(invisible = TRUE)
  remDr <- remoteDriver(browserName="chrome")
  session <- remDr$open()

  #navigate to page
  session <- remDr$navigate(generatedURL)

  #scroll down 5 times, waiting for the page to load at each time
  for(i in 1:5){      
    remDr$executeScript(paste("scroll(0,",i*10000,");"))
    Sys.sleep(3)    
  }

  #get the page html
  page_source<-remDr$getPageSource()
  options(warn = -1)
  #'read_html() is for download of content
  htmlpage <- read_html(page_source[[1]])
  #'names of competitiors scraping 
  competitorDeveloper <- html_text(html_nodes(htmlpage,".subtitle"))
  competitorAppName <- html_text(html_nodes(htmlpage,".title"))
  competitorRating <- html_text(html_nodes(htmlpage,"current-rating"))

  #close webdriver
  remDr$closeall()
  #'return value from competitor variabel
  results <- return (list(competitorAppName=competitorAppName,
                          competitorDeveloper=competitorDeveloper,
                          competitorRating=competitorRating))
}

-Scraping Autofill Keyword from Play Store

getListAutoCompletePlayStore <- function(...) {
    argument <- list(...)
    root <- "https://market.android.com/suggest/SuggRequest?json=1&c=0&hl=en&gl=US&query="#you can subtitute apps with other file such as "books" or movies
    generatedURLs <- c()
    for (keywords in argument) {
      generatedURL <- URLencode(paste(root,keywords,sep=""))
      generatedURLs <- c(generatedURL,generatedURLs)
    }
  listsSuggested <- list()
    for (generatedURL in generatedURLs){
    list <- getURL(generatedURL)
    listsSuggested <- c(list,listsSuggested)
    }
  return(listsSuggested)
  }