Hello readers, this post will be a good one I promise. I’ve started using R Markdown which seems like it will greatly increase the speed with which I can give you analyses.

Diving Back In

We left off last time after having downloaded a ton of html files containing drink recipe data. The first thing to do is to have a look inside the html file. This step is critical for looking at how to pull out the information that we want

Exerpts

Found the drink name

<title> 73 Bus #2 recipe</title> 

Found some key words

<meta content="73 bus #2, 73, bus, #2, gin,<br/>
  triple sec, lime juice, cranberry juice, drink recipe, drink, recipe,<br/>
  alcoholic drink recipe, cocktail recipe, cocktail, mixed drink, martini"<br/> 
  name="keywords">

Found the hierarchy (drink class)

<div class="pm" style="margin-top:20px;"><a href="/cat/1/
  ">Cocktails</a>
  > <a href="/cat/14/">Short drinks</a> 
  > <a href="/cat/141/">by base-ingredient</a> 
  > <a href="/cat/40/">gin-based</a></div>

Pulling out the information

So at this point I’ve identified seven variables I’d like to track for each drink

  1. Drink name
  2. Drink hierarchy(class)
  3. Ingredients
  4. Keywords
  5. Number of ratings
  6. Average rating (out of 10)
  7. URL

In order to get at these values I need to design a regular expression that will only capture the tag of interest. I discussed regular expressions briefly in my previous post, we will rely heavily on the non-greedy quantifier “.+?” I discussed here.

Data Extraction

In order to pull the data out I used R (surprise surprise). I wrote two small accessory functions. One to make string manipulation easier, and one to remove html tags.

#Use either regexpr(default) or grexepr to match elements of interest
#Extract and return them using regmatches
matchPull <- function(pattern, text, invert = FALSE, global = FALSE, ...){
  if(global){
    match <- gregexpr(pattern, text, ...)
  } else {
    match <- regexpr(pattern, text, ...)
  }
  
  pulled <- regmatches(text, match, invert)
  if(length(pulled) == 0) pulled <- NA
  
  pulled
}

# Remove html tags, note the use of the .*? quantifier, 
# a cousin of .+? that can match 0 characters
# Where .+? matches 1 of more.
stripTags <- function(text){
  gsub("<.*?>", "", text, perl = TRUE)
}

First step lets bring in one of the many html files we downloaded and try to extract all of the important data. We’ll use the examples noted above to practice.

fileCon <- file(siteName, blocking = FALSE)
site <- paste0(readLines(con = fileCon), collapse = "\n")
close(fileCon)

# Pull out the whole title, remove the tages
# and remove the word recipe which flanks each recipe name
name <-  matchPull("<title>.*?</title>", site,
                   ignore.case = TRUE, perl = TRUE)
name <- stripTags(name)
name <- sub(" recipe$", "", name, perl = TRUE)
  
# Pull out just the meta tag with the name keywords,
# the pull out the contents, and remove quotes
keywords <- matchPull("<meta content=.*?name=\"keywords\"", site, 
                      ignore.case = TRUE, perl = TRUE)
keywords <- matchPull("\".*?\"", keywords, perl = TRUE)
keywords <- gsub("\"", "", keywords)

# Pull out the division of class "pm" style "yadda-yadda"
# and remove all tags
hierarchy <- matchPull("<div class=\"pm\" style=\"margin-top:20px;\">.*?</div>", 
                       site, ignore.case = TRUE, perl = TRUE)
hierarchy <- stripTags(hierarchy)

Once we’ve figured out how to get all the useful data out of one file, we can encase it in a function that returns one row of data, and apply that to all the files we downloaded (after testing it on a much smaller subset). After that we’ll have a data frame containing all the juicy data, which is much easier to work with. Suppose we’ve encased our processing in {processSite <- function(siteName)} we can apply it many sites all at once by wrapping it in another function

processSites <- function(siteList){
  frameSeed <- processSite(siteList[1])
  drinkFrame <- frameSeed[rep(1, length(siteList)),]
  
  #This sapply structure is basically just a for loop
  sapply(1:length(siteList), function(i){
    drinkFrame[i,] <<- processSite(siteList[i]) 
  })
  
  drinkFrame
}

fileNames <- list.files()#Make sure you've set your working directory before this
fileNames <- fileNames[grepl("\\.html", fileNames)] #grab just .html files

#Try it out on the first 6 files
practiceNames <- head(fileNames)
practiceData <- processSites(practiceNames)

#After inspecting practice data for quality, process them all
drinkData <- processSites(fileNames)

Processing ~17000 sites took R around 6 minutes (wow) on my computer, producing a 10Mb data.frame, which I saved so that I never have to run this code again. Now that the data is in, we can begin with the fun parts. I’ll some skip the quality control steps to get right to the meaty stuff.

Exploratory Data Analysis (EDA)

The cornerstone of any data related project is poking and proding the data to figure out what’s in there. Make some histograms, correlation matrices, and any other simple data visualizations you think might be informative. This step is probably my favourite because it

  1. Helps recognize general patterns
  2. Identifies issues with data quality
  3. It’s fun to watch your data set begin to tell its first story

I’ll present one EDA that I thought was fun. I began getting interested in what people liked to name their drinks, surely there would be some cool patterns in that. I decided I wanted to try my hand at making a word cloud with some of the most common words in drink names. To make a word cloud using R I used the wordcloud package and the tm (text mining) package.

library(wordcloud)
library(tm)

load("drinkData.rda") #Bring in our drinkData from the last step

# Filter out drinks that are neither cocktails nor shots 
# by looking in their hierarchy
drinksFrame <- drinksTable[grepl("(cocktails)|(shots)", 
                                 drinksTable$hierarchy, 
                                 ignore.case = TRUE, 
                                 perl = TRUE),]

# I had a few cases of multiple duplicates, 
# this loop keeps tacks on Alt to duplicated names
# Repeats until no duplicates are found
while(anyDuplicated(drinksFrame$name) != 0){
  dupeNamed <- duplicated(drinksFrame$name)
  drinksFrame$name[dupeNamed] <- 
    paste(drinksFrame$name[dupeNamed],"Alt",sep=" ")
}

#Clean up now empty dupeNamed vector
rm(dupeNamed)

#Convert from character vector to one long string of words
nameVector <- paste0(tolower(drinksFrame$name), collapse = " ") 

# Use tm's built in functions to remove stopwords, see below for a note,
# Also remove alt (because I put it there)
# As well as punctuation. I removed numbers because the site named 
# duplicates with sequential numbers
# And "2" was one of the most popular words 
nameVectorCleaned <- removeWords(nameVector, c(stopwords("english"), "alt"))
nameVectorCleaned <- removePunctuation(nameVectorCleaned)
nameVectorCleaned <- removeNumbers(nameVectorCleaned)

# Split the cleaned string back into a vector of words (tokens really)
# Separated by white space
nameVector <- unlist(strsplit(nameVectorCleaned, " "))

#Then use table to count instances of each word
#Remove number one which was an empty string
#(an unfortunate consequence of our splitting algorithm)
nameFreqs <- table(nameVector)[-1]
nameWords <- as.character(names(nameFreqs))
namesFreqs <- as.numeric(nameFreqs)

freqOrder <- order(nameFreqs, decreasing = TRUE) #Create an ordering vector
top100 <- head(freqOrder, 100) #Indices of the 100 most popular words

#Make a 10 inch by 10 inch pdf to hold the wordcloud  
pdf("boozeNameCloud.pdf", width = 10, height = 10)

#Plot the words, ordered and sized by frequency
wordcloud(nameWords[top100], namesFreqs[top100], 
          scale = c(12,1), random.order = FALSE)

#Close up shop and admire our work
dev.off()

I re-ran the code to make the wordcloud pdf multiple times, because there is something stochastic in the word placement. After a few tries the words aligned and suited my aesthetic tastes. And so I give you

Booze Cloud!

Click to embignify