How to Scraping and Analyzing Data from Specific Website (Google Play Store)

How to Scraping and Analyzing Data from Specific Website (Google Play Store) - Hallo sahabat Chord Gitar Indonesia, Pada sharing Kunci gitar kali ini yang berjudul How to Scraping and Analyzing Data from Specific Website (Google Play Store), saya telah menyediakan lirik lagu lengkap dengan kord gitarnya dari awal lagi sampai akhir lagu. mudah-mudahan isi postingan kunci gitar yang saya tulis ini dapat anda pahami. okelah, ini dia chord gitarnya.

Penyanyi : How to Scraping and Analyzing Data from Specific Website (Google Play Store)
Judul lagu : How to Scraping and Analyzing Data from Specific Website (Google Play Store)

lihat juga


How to Scraping and Analyzing Data from Specific Website (Google Play Store)

Septianusa,

This project would be explaining about “how to scraping data from the specific website and analyzing it” using R. In some case we can obtain data from the website with APIs protocol or you can download it directly if possible. But not all website provide API or download facilities. This condition usually has happened in data analysis scope. So, with this project, i will show you one of technique that can be used for solving this problem.

Before continuing this discussion, let me tell you about this project flow. Let’s say we have some case about football manager games on Google Play Store. Now we want to know, Which better between a game feature with “Single Character Building” and “Multiple Character Building”. Data that will be used can be obtaining with scraping from https://play.google.com/store?hl=en and here our outline:

  1. Data and package preparation
  2. Functions needed
  3. Scraping phase
  4. PCA and regression analysis
  5. Visualization 
1. Data and package preparation
First step we need game’s URL and its feature that we want to compare it. (In real case we need more urls, not only five for best result)

data <- read.csv("soccerURL.csv",header=T)
urls <- data$URL
data

##                                                                                  URL
## 1 https://play.google.com/store/apps/details?id=com.bloodstone.fantasista&hl=en
## 2 https://play.google.com/store/apps/details?id=com.generamobile.soccerheroes&hl=en
## 3 https://play.google.com/store/apps/details?id=com.firsttouchgames.story&hl=en
## 4 https://play.google.com/store/apps/details?id=com.newstargames.newstarsoccer&hl=en
## 5 https://play.google.com/store/apps/details?id=com.firsttouchgames.dls3&hl=en
## character
## 1 SCB
## 2 MCB
## 3 SCB
## 4 SCB
## 5 MCB

Packages needed

library(curl)
library(rvest)

library(RCurl)
library(foreach)
library(psych) #for statistics purpose
library(fmsb) #for create radar chart

2. Function Needed
Here i’ve written simple function for scraping purpose. Function below will be used for our analysis,

#function for data scraping from google playstore webpage
ScrapPlaystore <- function (url){
#'@param: url (char): this is game's url from google play store
#'@return: (list) function return meta data game and it reviews (included title)
options(warn = -1)
#read_html() is for download of content
htmlpage <- read_html(curl(url))
#html_node() is for selecting node(s) from the downloaded content of a page
#html_text() is for extracting text from a previously selected node

#basic scraping such as title, developer name, category, rating, etc
title <- html_text(html_node(htmlpage,".id-app-title"))
dev <- html_text(html_node(htmlpage, "#body-content > div.outer-container > div > div.main-content > div:nth-child(1) > div > div.details-info > div.info-container > div:nth-child(2) > a > span"))
category <- html_text(html_node(htmlpage,".category span"))
score <- as.numeric(html_text(html_node(htmlpage,".score")))
ratingCount <- as.numeric(gsub(",", "",html_text(html_node(htmlpage,".reviews-num")) ))
mstar <- matrix(gsub(",", "",html_text(html_nodes(htmlpage,".bar-number"))))
mstar <- as.data.frame(t(mstar))
colnames(mstar) <- c("star5","star4","star3","star2","star1")

#title and review Scraping
reviewTitle <- html_text(html_nodes(htmlpage,".review-title"))
review <- html_text(html_nodes(htmlpage,".with-review-wrapper"))

#return list to review, title review, basic information
results <- return(list(review=review, reviewTitle=reviewTitle,
basic=cbind(title, dev, category, score, ratingCount,mstar)))
}

3. Scraping Phase
Now we can scraping scraping data for each game data from google play store using its url. we will using url in urls (url from many game).

dataBasic <- data.frame() #basic data location
gameReview <- list() #Review Game Location
for (aurl in urls){
basic <- ScrapPlaystore(aurl)
dataBasic <- rbind(basic$basic,dataBasic)
gameReview <- append(basic$review,gameReview)
}

We have 2 data frame, data basic on dataBasic and review data ongameReview,

dataBasic

##                      title                       dev category score
## 1 Dream League Soccer First Touch Sports 4.5
## 2 New Star Soccer Five Aces Publishing Ltd. Sports 4.6
## 3 Score! Hero First Touch Sports 4.6
## 4 Soccer Heroes RPG Genera Games Sports 4.1
## 5 Football Saga Fantasista Agate Games Sports 4.1
## ratingCount star5 star4 star3 star2 star1
## 1 3607278 2682568 511523 191308 66905 154974
## 2 1569517 1209877 209592 65548 21147 63352
## 3 3116860 2408648 470005 118994 34083 85130
## 4 32612 20362 4207 3044 1472 3527
## 5 3488 2171 402 382 155 378

tail(gameReview)

## [[1]]
## [1] " Update terbaru Dalam sehari update 2x -.- tamatlah yang ga pake wifi :v Full Review "
##
## [[2]]
## [1] " Live my dream I want to be the best player yeahh Full Review "
##
## [[3]]
## [1] " Ok Sejauh ini cukup menyenangkan Full Review "
##
## [[4]]
## [1] " Best game Best game you can try if you want to experience being a professional footballer Full Review "
##
## [[5]]
## [1] " owsum!! Full Review "
##
## [[6]]
## [1] " Loved it I will best player ever! Full Review "

4. PCA analysis
For comparing which better between “single character building (SCB)” and “multiple character building (MCB)” we will using liniear model Regression. But before we doing this one, we have create laten variabel that can be used as alternative variabel for game Performance measuring using PCA analysis.

Data has obtained and saved in dataBasic then we will create new latent variable using dataBasic$score and dataBasic$ratingCount as partition of latent variable Performance (it will be saved in dat data.frame)

attach(dataBasic)
dat <- cbind(score,ratingCount)
head(dat)

##      score ratingCount
## [1,] 4.5 3607278
## [2,] 4.6 1569517
## [3,] 4.6 3116860
## [4,] 4.1 32612
## [5,] 4.1 3488

Before we going to create latent variable (using PCA), we need to check some assumptions. Adequacy sampling using KMO test and Matric correlation with Bartlett Test

KMO(dat)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = dat)
## Overall MSA = 0.5
## MSA for each item =
## score ratingCount
## 0.5 0.5

cortest.bartlett(dat)

## $chisq
## [1] 2.984755
##
## $p.value
## [1] 0.08405203
##
## $df
## [1] 1

We can continue this analysis (PCA analysis for creating latent variable) if KMO value greater than 0.5 (sample has adequated) and p-value of Barlett test lower than 0.05 (matrix has correlated). (Because this case only for an example, so I still continue this analysis)

PCA analysis

library(psych)
pcadata<- principal(dat,nfactor=2,rotate="none")
pcadata

## Principal Components Analysis
## Call: principal(r = dat, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## score 0.96 -0.29 1 1.1e-16 1.2
## ratingCount 0.96 0.29 1 1.1e-16 1.2
##
## PC1 PC2
## SS loadings 1.83 0.17
## Proportion Var 0.92 0.08
## Cumulative Var 0.92 1.00
## Proportion Explained 0.92 0.08
## Cumulative Proportion 0.92 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
## with the empirical chi square 0 with prob < NA
##
## Fit based upon off diagonal values = 1

Look at eigenvalue (see on SS Loading) for each PC (PC1, PC2, … PCn) we can choose how many factor that can be made. Because eigenvalue that has greater than 1 (>1) only PC1, so we just can create 1 latent variable (if PC2 has eigenvalue has greater than 1 also, we can create 2 latent variables, and so on)

In this phase, we will create new latent variable (Performance) with PC1 to identify how that game more likely by user. This variable will be used as dependent variable.

Our laten score saved in PCA$scores

data <- cbind(data,pcadata$scores)
attach(data)
cb.data <- data.frame(cbind(character=as.factor(character),PC1)) #Character building data
summary(lm(PC1~character,data=cb.data))

## 
## Call:
## lm(formula = PC1 ~ character, data = cb.data)
##
## Residuals:
## 1 2 3 4 5
## 0.6222 0.7472 0.6717 -1.2939 -0.7472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8892 1.6801 -0.529 0.633
## character 0.5558 1.0041 0.554 0.618
##
## Residual standard error: 1.1 on 3 degrees of freedom
## Multiple R-squared: 0.09267, Adjusted R-squared: -0.2098
## F-statistic: 0.3064 on 1 and 3 DF, p-value: 0.6185

Linier Model (regression)

summary(lm(PC1~data$character))

## 
## Call:
## lm(formula = PC1 ~ data$character)
##
## Residuals:
## 1 2 3 4 5
## 0.6222 0.7472 0.6717 -1.2939 -0.7472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3335 0.7777 -0.429 0.697
## data$characterSCB 0.5558 1.0041 0.554 0.618
##
## Residual standard error: 1.1 on 3 degrees of freedom
## Multiple R-squared: 0.09267, Adjusted R-squared: -0.2098
## F-statistic: 0.3064 on 1 and 3 DF, p-value: 0.6185

Resampling
Because number of sample is small, we can resampling for maximizing our regression tools for liniear model

N <- length(cb.data[,1])
N.resample <- 30
idx = sample(1:N,N.resample,replace=TRUE)
cb.data.resample <- data.frame(cb.data[idx,])

here new liniear model

summary(lm(PC1~character,data=cb.data.resample))

## 
## Call:
## lm(formula = PC1 ~ character, data = cb.data.resample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0215 -1.0215 -0.4076 0.9317 1.0869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2961 0.5812 -2.230 0.0339 *
## character 0.6230 0.3413 1.825 0.0786 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9009 on 28 degrees of freedom
## Multiple R-squared: 0.1063, Adjusted R-squared: 0.07443
## F-statistic: 3.332 on 1 and 28 DF, p-value: 0.07863

Coefficient of character variabel (2 = SCB; 1 = MCB) are positive, this value meaning SCB is much better than MCB. So, statistically, football manager game with “single character building” is much better than football game with “multiple character builiding”.

Vizualization
Create radar chart for comparing 5 football games,

rescale <- function(x) (x-min(x))/(max(x) - min(x)) * 20
rc.data <- data.frame(cbind(review=rescale(dataBasic$ratingCount),
star5= rescale(as.numeric(dataBasic$star5)),
star4= rescale(as.numeric(dataBasic$star4)),
performance=rescale(data$PC1),
ratingScore =rescale(dataBasic$score)
))
rownames(rc.data)=dataBasic$title

colors_border=c( rgb(0.2,0.5,0.5,0.9), rgb(0.8,0.2,0.5,0.9) , rgb(0.7,0.5,0.1,0.9) )
colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) , rgb(0.7,0.5,0.1,0.4) )
radarchart(rc.data , axistype=0 , maxmin=F,
#custom polygon
pcol=colors_border , pfcol=colors_in , plwd=4 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="black", cglwd=0.8,
#custom labels
vlcex=0.8
)
op <- par(cex = 0.5)# legend text size
legend(x=0.84, y=.2, legend = rownames(rc.data), bty = "n", pch = 20 , col=colors_in , text.col = "black", text.font=0.1,cex=1.2, pt.cex=2)


We also can do sentiment analysis on,gameReview with Datumbox API :

Keys <- "YOUR_KEY" #get your key here http://www.datumbox.com/machine-learning-api/

### local function
getSentiment <- function (text, key){
#' @param: text (char) : row text that want to be classified
#' @param: key(char): API key for datumbox
#' @return: sentiment (dataframe) with colnames text, sentiment, topic, gender
text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\\d\\d", "");
text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){
text <- substr(text, 0, 359);
}
##########################################
data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))
js <- fromJSON(data, asText=TRUE);
# get mood probability
sentiment = js$output$result

###################################
data <- getURL(paste("http://api.datumbox.com/1.0/SubjectivityAnalysis.json?api_key=", key, "&text=",text, sep=""))
js <- fromJSON(data, asText=TRUE);

# get mood probability
subject = js$output$result
##################################

data <- getURL(paste("http://api.datumbox.com/1.0/TopicClassification.json?api_key=", key, "&text=",text, sep=""))
js <- fromJSON(data, asText=TRUE);
# get mood probability
topic = js$output$result

##################################
data <- getURL(paste("http://api.datumbox.com/1.0/GenderDetection.json?api_key=", key, "&text=",text, sep=""))
js <- fromJSON(data, asText=TRUE);
# get mood probability
gender = js$output$result
return(list(sentiment=sentiment,subject=subject,topic=topic,gender=gender))
}

clean.text <- function(some_txt) {
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)

# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}
some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

Further discussion
If you found this interesting, you can try something else that has related with “Data Scraping” such as:

-Return Geocoding based on Place name

#Packages need
library(RCurl)
library(RJSONIO)
library(plyr)

url <- function(address, return.call = "json", sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/"
u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
return(URLencode(u))
}

geoCode <- function(address,verbose=FALSE) {
if(verbose) cat(address,"\n")
u <- url(address)
doc <- getURL(u)
x <- fromJSON(doc,simplify = FALSE)
if(x$status=="OK") {
lat <- x$results[[1]]$geometry$location$lat
lng <- x$results[[1]]$geometry$location$lng
location_type <- x$results[[1]]$geometry$location_type
formatted_address <- x$results[[1]]$formatted_address
return(c(lat, lng, location_type, formatted_address))
Sys.sleep(0.5)
} else {
return(c(NA,NA,NA, NA))
}
}

address <- geoCode("Universitas islam indonesia")
address

## [1] "-7.7773117"                                                                                                                                   
## [2] "110.3929638"
## [3] "ROOFTOP"
## [4] "Universitas Islam Indonesia, Jl. Demangan Baru No.24, Caturtunggal, Kec. Depok, Kabupaten Sleman, Daerah Istimewa Yogyakarta 55281, Indonesia"

-Looking for your competitor on Play Store based on particular keywords. (For this function you need RSelenium and Browser Driver)

#' This function can be used for scraping competitor data based on keyword
library(RSelenium)

getCompetitor <- function(keywords) {
#'@param: keyword is your keyword that you looking for its competitor
#'@return: app list and number of competitor with that "keywords" in title
#Generating URL with some keywords
root <- "https://play.google.com/store/search?hl=en&c=apps&q="#you can subtitute apps with other file such as "books" or movies
u <- paste(root,keywords,sep="")
generatedURL <- URLencode(u)

#RSelenium
session <- checkForServer()
session <- startServer(invisible = TRUE)
remDr <- remoteDriver(browserName="chrome")
session <- remDr$open()

#navigate to page
session <- remDr$navigate(generatedURL)

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}

#get the page html
page_source<-remDr$getPageSource()
options(warn = -1)
#'read_html() is for download of content
htmlpage <- read_html(page_source[[1]])
#'names of competitiors scraping
competitorDeveloper <- html_text(html_nodes(htmlpage,".subtitle"))
competitorAppName <- html_text(html_nodes(htmlpage,".title"))
competitorRating <- html_text(html_nodes(htmlpage,"current-rating"))

#close webdriver
remDr$closeall()
#'return value from competitor variabel
results <- return (list(competitorAppName=competitorAppName,
competitorDeveloper=competitorDeveloper,
competitorRating=competitorRating))
}

-Scraping Autofill Keyword from Play Store

getListAutoCompletePlayStore <- function(...) {
argument <- list(...)
root <- "https://market.android.com/suggest/SuggRequest?json=1&c=0&hl=en&gl=US&query="#you can subtitute apps with other file such as "books" or movies
generatedURLs <- c()
for (keywords in argument) {
generatedURL <- URLencode(paste(root,keywords,sep=""))
generatedURLs <- c(generatedURL,generatedURLs)
}
listsSuggested <- list()
for (generatedURL in generatedURLs){
list <- getURL(generatedURL)
listsSuggested <- c(list,listsSuggested)
}
return(listsSuggested)
}

Thank You,
Have Fun.


Demikianlah Artikel How to Scraping and Analyzing Data from Specific Website (Google Play Store)

Sekian Kunci gitar How to Scraping and Analyzing Data from Specific Website (Google Play Store), mudah-mudahan bisa memberi manfaat untuk anda semua. baiklah, sekian postingan Chord gitar lagu kali ini.

Anda sedang membaca artikel How to Scraping and Analyzing Data from Specific Website (Google Play Store) dan artikel ini url permalinknya adalah https://ikazumdammahum.blogspot.com/2016/11/how-to-scraping-and-analyzing-data-from.html Semoga artikel ini bisa bermanfaat.

0 Response to "How to Scraping and Analyzing Data from Specific Website (Google Play Store)"

Post a Comment