Here’s the generator!
I have been looking at a data set containing names and recipes for ~17000 cocktails. The original goal of the project was to use publicly available recipe data to inform a descision of what drinks to buy for a small home bar. The project since spun out of control because I realized the wealth of interesting trivia in the data.
Conceptually it might be worthwhile to separate out drinks into several naming classes based on their popularity. There are popular drinks that can be found at nearly any bar worth its salt, there are specialty drinks which are often variants of a popular drink with ingredient tweaks, and there are unique drinks that are unlikely to be found at any bar and will require a recipe if you want one. Each class is generally named differently. Popular drinks have readily recognizable names. Specialty drinks often have parts of a popular drink name with additional words indicating the variations involved. The unique drinks do not follow a pattern as closely as the other two, reflecting the heterogeneity of people creating these unique drinks.
In order to learn a bit about how words are distributed within drink names I collated all the drink names in my data set and performed some simple analyses. After clean-up, I made a word cloud containing the top 100 words used in drink names. This analysis does not consider the popularity of the drink itself, but the data does contain some indication of popularity (e.g. more popular drinks are more likely to have multiple recipes with the same name).
An interesting follow up to this analysis would be to re-weight the word frequencies according to how popular the drinks they are found in are. This will make the data look more like what you might see on a cocktail menu.
Part of the data exploration I calculated the frequencies of the top 500 words, as well as the distribution of the number of words in a cocktail name. The largest name in my data set was 8 words, with two being the most frequent. I contrained my drink name generator to produce drink names between 2 and 5 words. The generator generates a random number between 0 and 1, it compares that number to a look-up table corresponding to the cummulative distribution function (cdf) of word number frequencies. That number of words is generated using the same method but from the cdf of word name frequencies. The generator then returns your new drink name. The app was written using shiny, a way to write small apps using R, which I will probably talk about in a future post.