# Accio libraries
library(tidyverse)
library(here)
# Accio data
<- read.csv(file = here("materials", "data", "movies.csv"))
movies <- read.csv(file = here("materials", "data", "Characters.csv"))
characters <- read.csv(file = here("materials", "data", "quidditch.csv"))
quidditch <- read.csv(file = here("materials", "data", "quidditch_players.csv")) quidditch_players
Session 2: Spells 
Introduction
Well done on getting through the sorting stage, you should feel settled in now and ready to start learning some tidyverse spells. I also want to hard sell two messages:
- Don’t scroll down to the bottom of this script and go “No way, this looks really complicated - I’m out!”. Instead
trust that it will make sense if I approach it one step at a time
. I have to remind myself of this regularly and it applies just as much on day one as on day one thousand. - The examples in this script are all you need to get out there and start solving real world problems.
Once you get to the end of this script, you will have built a house with a foundation.
After that, it’s only wallpaper, flowerbeds and extensions from then on out.
Let’s load in the data we will need for this session up at the top of the quarto doc or script, like this:
Spells
OK let’s start learning some spells. I have chosen these ones as they are the ones I use almost daily. There are of course many more but you really can just figure out the rest as you go.
Reminder: we aren't asking our poor brains to remember where all the commas and squiggly lines are.
Instead, we will try and remember, for example, that there exists a function within the tidyverse called rename
, and trust that if we don’t remember where the commas and squiggly lines go, then we can just refresh our memories by looking it up on Google or by asking our favourite AI companion.
The rest of this demonstration will have the following simple structure: 8 categories of spell, and for each example there is also an example of how you might inspect/interrogate your output (Lumos!).
Spell 1: Rename
Tip: whenever you see this “|>
” I want you to say "and then"
in your head.
One of the great things about the tidyverse is that it is a very spoken way of writing code. Here I read the code below out in my head as:
Let's create an object called movies_spell_1. movies_spell_1 is what you get after you to take the movies dataframe and then rename the columns with the periods in the name
”.
Note as well that you could have just called the new object “movies” again. This would overwrite the old one. Don’t think that you always have to create a new object every time, I’m just doing it here for demonstrative purposes. I’ll leave you to form your own habits here.
<- movies |>
movies_spell_1 rename(
"id" = "Movie.ID",
"title" = "Movie.Title",
"year" = "Release.Year",
"box_office" = "Box.Office"
)
# Lumos!
names(movies_spell_1)
[1] "id" "title" "year" "Runtime" "Budget"
[6] "box_office"
While I’m here, it looks a bit funky that some of the names are in small case and some have capitals. Let’s use rename again here to fix this (note with much larger data sets you might look for a more dynamic, i.e., less manual way of renaming these):
<- movies_spell_1 |>
movies_spell_1 rename(
"runtime" = "Runtime",
"budget" = "Budget"
)
# Lumos!
names(movies_spell_1)
[1] "id" "title" "year" "runtime" "budget"
[6] "box_office"
Spell 2: Select
Next we have select(). When you think select(), you think let's get some columns.
I’ve shown a couple of examples of select() in action down below, but as with all of these spells, there is so much you can do with them, so if you run into a problem in R that is broadly to do with selecting columns, you can be confident that select() will have some role to play in the solution.
Spoken translation:
“Let's create a new object called characters_spell_2_example_1. What characters_spell_2_example_1 is is what you get after you take the characters dataframe and then select only the Character.Name, Gender, and House columns
”.
<- characters |>
characters_spell_2_example_1 select(Character.Name, Gender, House)
# Lumos!
names(characters_spell_2_example_1)
[1] "Character.Name" "Gender" "House"
Note that you can get the exact same results by using the column numbers:
<- characters |>
characters_spell_2_example_2 select(2, 4:5)
# Lumos!
names(characters_spell_2_example_2)
[1] "Character.Name" "Gender" "House"
Here is an example of using select() with another useful function inside it. There are lots of things that can be used within select(). Here you use starts_with() to select every column that starts with “Wand”. Other notable ones include ends_with() and contains().
<- characters |>
characters_spell_2_example_3 select(starts_with("Wand"))
# Lumos!
names(characters_spell_2_example_3)
[1] "Wand..Wood." "Wand..Core."
A final useful thing to point out is that you don’t always need to reduce the number of columns when using Select, you can also use it to do a bit of tidying, here by re-ordering the columns.
Spoken translation:
“Let's create a new object called movies_spell_2_example_4. What that object is is what you get after taking the movies dataframe and then first selecting the Movie.Title column, then the Release.Year column and then just selecting everything else as it was before
”.
<- movies |>
movies_spell_2_example_4 select(Movie.Title, Release.Year, everything())
# Lumos!
names(movies_spell_2_example_4)
[1] "Movie.Title" "Release.Year" "Movie.ID" "Runtime" "Budget"
[6] "Box.Office"
Spell 3: Filter
So far so good? Remember, one step at a time. The next spell in our tidyverse spell book is Filter(). All you need to remember is that where you have select() for columns, you have filter() for rows.
Spoken translation:
"Let's create an object called characters_spell_3_example_1. What that object is is what you get after taking the characters dataframe and then filtering it to contain only rows in the House column that exactly equal the value "Gryffindor
“.
<- characters |>
characters_spell_3_example_1 filter(House == "Gryffindor")
# Lumos!
head(characters_spell_3_example_1 |> select(House))
House
1 Gryffindor
2 Gryffindor
3 Gryffindor
4 Gryffindor
5 Gryffindor
6 Gryffindor
Spoken translation:
“Let's create an object called characters_spell_3_example_2. That object is what is created after taking the character dataframe and then filtering the rows so that the only rows left are those where the values in the Gender column are exactly equal to Female and the values in the Wand..Core. column are exactly equal to Dragon Heartstring
”.
<- characters |>
characters_spell_3_example_2 filter(Gender == "Female" & Wand..Core. == "Dragon Heartstring")
# Lumos!
head(characters_spell_3_example_2 |> select(Gender, Wand..Core.))
Gender Wand..Core.
1 Female Dragon Heartstring
2 Female Dragon Heartstring
3 Female Dragon Heartstring
4 Female Dragon Heartstring
For this example coming up, we are going to create a useful object called a vector
to help us write our next bit of code that uses filter. Think of this as being similar to creating a helper column in Excel. We know from watching the films that the cool houses are Gryffindor, Hufflepuff, and Ravenclaw, so let’s pop them in a vector for storage and call it cool_houses. The benefit of this is that whenever you want to refer to all of these at once, all we need to do is refer to cool_houses (imagine a vector with 500 names in it and you see why it becomes a more and more useful thing at scale).
Spoken translation:
“Let's create an object called cool_houses. What cool_houses is is a vector containing the three names of the cool houses. Now that I have these ready to go, let's create a thing called characters_spell_3_example_3. What characters_spell_3_example_3 is what you get after taking characters and filtering the rows so that only rows in the House column that match the values in cool_houses remain
”.
<- c("Gryffindor", "Hufflepuff", "Ravenclaw")
cool_houses <- characters |>
characters_spell_3_example_3 filter(House %in% cool_houses)
# Lumos!
unique(characters_spell_3_example_3$House)
[1] "Gryffindor" "Ravenclaw" "Hufflepuff"
We can also use filter() to look for things to exclude (as opposed to looking for things to include above with the cool_houses). Let’s follow on with that example and assign the value “Slytherin” to the object uncool_houses. From here we can use Filter with the != (“not equal to”) operator.
Spoken translation:
“Let's create an object called uncool_houses. Let's assign the value, a string, Slytherin, to it. With this at our disposal, let's create an object called characters_spell_3_example_4. characters_spell_3_example_4 is what you get by taking the characters dataframe and filtering so that only rows that don't equal the value assigned to uncool_houses remain
”.
<- "Slytherin"
uncool_houses <- characters |>
characters_spell_3_example_4 filter(House != uncool_houses)
# Lumos!
unique(characters_spell_3_example_4$House)
[1] "Gryffindor" ""
[3] "Ravenclaw" "Hufflepuff"
[5] "Beauxbatons Academy of Magic" "Durmstrang Institute"
Spell 4: Mutate
When you hear mutate() I want you to think about creating a new column in Excel which uses the data from the other columns. Think "let's create some new stuff from some old stuff
“.
Spoken translation:
“Let's create an object called movies_spell_4_example_1, which is what you get after taking the movies dataframe and then creating a new column called profit. Each row in that column is what you get when you subtract the Budget value from the Box.Office value for that row
”.
<- movies |>
movies_spell_4_example_1 mutate(profit = Box.Office - Budget)
# Lumos!
head(movies_spell_4_example_1 |> select(Budget, Box.Office, profit))
Budget Box.Office profit
1 125000000 1002000000 877000000
2 100000000 880300000 780300000
3 130000000 796700000 666700000
4 150000000 896400000 746400000
5 150000000 942000000 792000000
6 250000000 943200000 693200000
Mutate with case_when
Similarly to all of the other tidyverse spells, there’s a lot of things you can do within mutate(). Just knowing of and remembering that mutate() exists and that it creates new columns is half the battle - that’s your ticket to finding the resources you need whilst solving problems on the go because you can use it in searches and prompts. Here’s one very common example though that will come up a fair bit - using mutate() alongside case_when(). Just think of creating a new column in Excel and using an if-else of if-else-if-else etc to fill it.
Spoken translation:
“Let's store all of the main characters in vector called main_characters so that I can refer to it easily. Let's then create an object called characters_spell_4_example_2. characters_spell_4_example_2 is what you get by taking the characters dataframe and then creating a new column called main_character_yn. To create main_character_yn, let's go through each row and if that row has a value that's in our main_characters vector, let's fill that row with y. For anything else that could possibly ever come up, let's call that n
”.
<- c(
main_characters "Harry Potter", "Ron Weasley", "Hermione Granger",
"Albus Dumbledore", "Severus Snape", "Voldemort"
)
<- characters |>
characters_spell_4_example_2 mutate(main_character_yn = case_when(
%in% main_characters ~ "y",
Character.Name TRUE ~ "n"
))
# Lumos!
table(characters_spell_4_example_2$main_character_yn)
n y
160 6
Spell 5: Pivot
This will make sense to you if you have used pivot tables in Excel or Google Sheets. All you need to know is:
- You can arrange your data long or wide.
- Just like pivot tables in Excel.
- Long format by default is probably the tidiest, but you don’t need to think about this much unless the problem dictates it. A common example is data visualisation tools needing the data behind a plot to be in a specific format.
Pivot_longer
Spoken translation:
“Let's create a new object called quidditch_spell_5_example_1. This new object is what you get after taking the quidditch dataframe and then converting the four house columns from wide format to long format. Let's present the information on which house the points refer to in a column called house, and let's pop all of the values, i.e., quidditch points scored into a column called points
”.
<- quidditch |>
quidditch_spell_5_example_1 pivot_longer(
cols = gryffindor:slytherin,
names_to = "house",
values_to = "points"
)
# Lumos!
head(quidditch_spell_5_example_1, n = 10)
# A tibble: 10 × 6
id Movie.ID month top_scorer house points
<int> <int> <chr> <chr> <chr> <int>
1 1 1 jan Harry Potter gryffindor 100
2 1 1 jan Harry Potter hufflepuff 70
3 1 1 jan Harry Potter ravenclaw 100
4 1 1 jan Harry Potter slytherin 20
5 2 1 feb Cedric Diggory gryffindor 100
6 2 1 feb Cedric Diggory hufflepuff 100
7 2 1 feb Cedric Diggory ravenclaw 120
8 2 1 feb Cedric Diggory slytherin 60
9 3 1 mar Oliver Wood gryffindor 100
10 3 1 mar Oliver Wood hufflepuff 20
Pivot_wider
The easiest way to demonstrate pivot_wider() is just to undo what we did above with pivot_longer()!
Spoken translation:
“Let's create an object called quidditch_spell_5_example_2. This new object quidditch_spell_5_example_2 is what you get after you take quidditch_spell_5_example_1 and then pivot the house column wider. This means that all of the unique values in that column, the four houses, will have their own column. The values in that column will come from the points column
”.
<- quidditch_spell_5_example_1 |>
quidditch_spell_5_example_2 pivot_wider(names_from = house, values_from = points)
# Lumos!
head(quidditch_spell_5_example_2, n = 10)
# A tibble: 10 × 8
id Movie.ID month top_scorer gryffindor hufflepuff ravenclaw slytherin
<int> <int> <chr> <chr> <int> <int> <int> <int>
1 1 1 jan Harry Potter 100 70 100 20
2 2 1 feb Cedric Diggory 100 100 120 60
3 3 1 mar Oliver Wood 100 20 110 120
4 4 1 apr Walburga Black 60 40 140 60
5 5 1 may Oliver Wood 90 130 50 60
6 6 1 jun Seamus Finnig… 0 150 20 60
7 7 1 jul Draco Malfoy 120 30 100 70
8 8 1 aug Fred Weasley 90 140 90 90
9 9 1 sep Oliver Wood 80 20 10 110
10 10 1 oct Pomona Sprout 50 20 130 80
Spell 6: Join
If you have worked with databases before you will know the principle of joining very well. If you have used VLOOKUP and its many off-shots in Excel you will also have a good idea. We use joins when you have two sources of data and we want to use the content of one to expand upon the other. This usually means having one smaller look-up table and one larger main data set. In the example below we have two sources of data.
quidditch_players
: a small list of characters and a yn indicator as to whether or not they are quidditch players.characters
: a much larger data set with information on all of the characters.
Wouldn’t it be cool if we could make our character dataframe even more informative by also including whether or not a given character is a quidditch player? Well all we need to do that is some common source of information between the two data sources and we are ready to rock!
Since both sources of data contain the character names in a column we can use this information to join them together. It’s worth noting that there are several different types of joins that exist, but I almost never use them. For some reason I almost always use left joins, as in the example below.
Here’s the look-up table:
# Lumos!
head(quidditch_players)
Character.Name quidditch_player
1 Harry Potter y
2 Cedric Diggory y
3 Hermoine Granger n
4 Oliver Wood y
5 Percy Weasley n
6 Walburga Black n
And here’s the left join. Spoken translation:
“Let's create an object called characters_spell_6_example_1. This new object is what you get after you take the characters dataframe and join the quidditch_players dataframe to it. This in practice means creating a new column in the characters dataframe called quidditch_player as that is the only other column in the lookup table. As such, use the Character.Name column to join them as they both have this in common, and wherever a row in the Character.Name column in the characters dataframe matches a row in the quidditch_players lookup, fill the value in the quidditch_player column with whatever is in the lookup table. If there isn't a match, just return NA
”.
<- characters |>
characters_spell_6_example_1 left_join(quidditch_players, by = "Character.Name")
# Lumos!
head(characters_spell_6_example_1 |>
select(Character.Name, quidditch_player) |>
filter(!is.na(quidditch_player)), n = 10)
Character.Name quidditch_player
1 Harry Potter y
2 Draco Malfoy y
3 Fred Weasley y
4 Ginny Weasley y
5 Luna Lovegood n
6 Seamus Finnigan y
7 Cedric Diggory y
8 Percy Weasley n
9 Oliver Wood y
10 Lavender Brown y
Spell 7: Piping
Just like Dumbledore in the Deathly Hallows Part Two, I’ve been with-holding information from you all along (I’m sorry if these metaphors are too much but I’m really enjoying them). What I’ve been holding back is that sword of Gryffindor reveals itself not to anyone who needs it, but instead to anyone that presses “shift + \ + >
”. This shortcut reveals the sword of Gryffindor, otherwise known as the native R pipe |>.
Remember I asked you to say "and then
” in your head every time you seen it? Well you were subconsciously training yourself to call the sword all along. You can call on the sword to chain together as many other tidyverse spells as you want in one bit of code, unlocking the true power of the tidyverse.
Spoken translation:
“Let's create an object called characters_spell_7_example_1. This new object is what is created when you take the characters dataframe AND THEN filter the rows so that no rows where Species is Human or Species is a blank cell remain AND THEN create a new column called character_morals. To fill the rows of character_morals, go row-by-row and if the value for Character.Name is one of Kreacher, Fenrir Greyback, or Bogrod, then fill the value is 'baddie'. IF the value for Character.Name is one of Griphook, Aragog, or Bane, then the value is 'its_complicated'. For everything and anything else that could possibly come up, then the value is 'goodie'....AND THEN keep only columns in my new object that contain the string "character", irrespective or case
”.
<- characters |> # AND THEN
characters_spell_7_example_1 filter(Species != "Human" & Species != "") |> # AND THEN
mutate(character_morals = case_when(
%in% c("Kreacher", "Fenrir Greyback", "Bogrod") ~ "baddie",
Character.Name %in% c("Griphook", "Aragog", "Bane") ~ "its_complicated",
Character.Name TRUE ~ "goodie"
|> # AND THEN
)) select(contains("character"))
# Lumos!
head(characters_spell_7_example_1, n = 10)
Character.ID Character.Name character_morals
1 5 Rubeus Hagrid goodie
2 11 Remus Lupin goodie
3 27 Dobby goodie
4 30 Griphook its_complicated
5 39 Moaning Myrtle goodie
6 47 Nearly Headless Nick goodie
7 48 Kreacher baddie
8 54 Helena Ravenclaw goodie
9 59 Filius Flitwick goodie
10 64 Bogrod baddie
Spell 8: Summarise
Almost there! Imagine you are in an Excel workbook and you are wanting to generate descriptive statistics, means, medians, percentages, counts, and so on and so forth. In the tidyverse, you can do all of that under the summarise() umbrella. This is useful in two ways:
- It fits nicely in with all your other tidyverse code.
- It gives you an anchor to start any Googling or prompting you need to do to find whatever variation you need for a given problem.
Spoken translation:
“Let's create an object called movies_spell_8_example_1. This new object is what you get after taking the movies dataframe and calculating the mean and the median of the Box.Office column. When calculating these, if any NAs pop up, let's just ignore them
”.
<- movies |>
movies_spell_8_example_1 summarise(
mean_box_office = mean(Box.Office, na.rm = TRUE),
median_box_office = median(Box.Office, na.rm = TRUE)
)
# Lumos!
print(movies_spell_8_example_1)
mean_box_office median_box_office
1 972437500 942600000
I’m adding this one just to show that you can do lots of funky stuff within summarise(), to hammer home that you routinely start off with summarise() and then have a bit of faith that the solution will more often that not exist within it.
Spoken translation:
“Let's create an object called movies_spell_8_example_2. This is what you get after taking the movies dataframe and then calculating the percentage of films that meet the definition of being long (i.e., has Runtime greater than 140 minutes). In order to calculate that let's create a vector of values, one for every row in Runtime, with a 1 being assigned to the row if the statement Runtime > 140 is TRUE and a 0 if FALSE. After that, let's average the 1 and 0s in this vector and ignore any NAs as we go. And finally, let's multiply this by 100 so as to make it a percentage
”.
<- movies |>
movies_spell_8_example_2 # Let's call a "long" film one that is over 140 mins
summarise(percentage_long = mean(Runtime > 140, na.rm = TRUE) * 100)
# Lumos!
print(movies_spell_8_example_2)
percentage_long
1 75
A very common function to use alongside summarise is group_by(). Here we get to point all of that work that summarise() does for us to any number of groups that we specify in the group_by() expression. I’ve also added a couple of extra lines at the end to tidy up the results.
Spoken translation:
“Let's create an object called quidditch_spell_8_example_3. This new object is what you get when you take the quidditch dataframe AND THEN create a new column called total_score. total_score is filled by adding the points for all four houses for a given row...AND THEN we are going to get ready for averaging by first grouping our data by Movie.ID so that we can get an average of each...AND THEN let's go ahead and find the mean total score for each film, ignoring any NAs as we go along....AND THEN after we have our results let's round the values in any numeric column we come across to 0 decimal places AND THEN let's finish up by sorting these values from largest to smallest so that we can easily see which films have the highest average quidditch scores
”.
<- quidditch |>
quidditch_spell_8_example_3 mutate(total_score = gryffindor + hufflepuff + ravenclaw + slytherin) |>
group_by(Movie.ID) |>
summarise(mean_total_score = mean(total_score, na.rm = TRUE)) |>
mutate_if(is.numeric, round, 0) |>
arrange(desc(mean_total_score))
# Lumos!
print(quidditch_spell_8_example_3)
# A tibble: 8 × 2
Movie.ID mean_total_score
<dbl> <dbl>
1 2 342
2 4 324
3 8 304
4 7 299
5 1 295
6 6 293
7 3 280
8 5 276