install.packages(c("tidyr", "dplyr", "reshape2"))Shaping and Reshaping Data
Requirements for this session
Run the following code in your local system to install the packages required for this session. It is also recommended to run update.packages(ask = FALSE) to update all your packages.
We will also revisit the data set fuel prices.
download.file(
url = "https://kamapu.gitlab.io/coachingR/Topics/Data/fuel-prices.csv",
destfile = "fuel-prices.csv", method = "curl")Data structures
While we have previously treated the vectors as the basic data structure in R, the most common way to structure data sets in R is data frame (class data.frame). Therefore, this session is dedicated to data frames and similar structures.
List
A list in R is a list of objects. At the same time, lists are objects themselves (class list) and can contain lists within them, allowing hierarchical structures (e.g. trees). This type of object is versatile and is often used by functions to collect complex outputs (e.g. outputs from models such as linear models lm()).
# List with three
many_objects <- list(A = iris, B = co2, C = airquality)
summary(many_objects) Length Class Mode
A 5 data.frame list
B 468 ts numeric
C 6 data.frame list
Lists have a length attribute and can be indexed using square brackets ([]). Double square brackets ([[]]) can be used to extract individual items from a list.
# A subset of the previous list
two_objects <- many_objects[c("A", "C")]
summary(two_objects) Length Class Mode
A 5 data.frame list
C 6 data.frame list
# Extract a single item
item_B <- many_objects[["B"]]
summary(item_B) Min. 1st Qu. Median Mean 3rd Qu. Max.
313.2 323.5 335.2 337.1 350.3 366.8
Alternatively, individual items can be accessed using the dollar symbol ($).
summary(many_objects$A) Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Data Frame
The data frame is probably the most common data structure used in R and represents a column-oriented table, aka. database list. Data frames are actually lists of vectors, all of the same length. Therefore, we can use the same ways to index columns in data frames as items in lists.
# Select some columns of a data frame
head(iris[c("Species", "Sepal.Length")]) Species Sepal.Length
1 setosa 5.1
2 setosa 4.9
3 setosa 4.7
4 setosa 4.6
5 setosa 5.0
6 setosa 5.4
# Extract the content of one column
summary(iris[["Species"]]) setosa versicolor virginica
50 50 50
# Alternative with dollar symbol
summary(iris$Species) setosa versicolor virginica
50 50 50
In data frames (as in matrices) it is possible to use two indexes in square brackets, separated by a comma. The first index refers to the rows, the second to the columns.
# Tow vectors as indices
iris[c(5, 75, 125), c("Species", "Petal.Width")] Species Petal.Width
5 setosa 0.2
75 versicolor 1.3
125 virginica 2.1
Although a data frame in R resembles the structure of a matrix, they are completely different objects. A matrix (class matrix) is actually a vector with two dimensions as an attribute (number of rows and number of columns) and all other properties shared with vectors.
Thus, while a data frame can contain columns of different vector classes, the matrix can have only one class assigned to the whole object.
Curiously, while vectors are special cases of matrices in mathematics, the reverse is true in R.
Arrays
Arrays in R are 3D matrices. You can also think of them as several matrices of the same dimension stacked on top of each other.
Arrays are indexed by three positions between square brackets (rows, columns and slices or layers) and only one data type, the vector class, is allowed for the whole object. A classic example of an array is the data set iris3, which contains the same units (cm) in all numerical records.
# Select rows in all columns and slices
iris3[1:5, , ], , Setosa
Sepal L. Sepal W. Petal L. Petal W.
[1,] 5.1 3.5 1.4 0.2
[2,] 4.9 3.0 1.4 0.2
[3,] 4.7 3.2 1.3 0.2
[4,] 4.6 3.1 1.5 0.2
[5,] 5.0 3.6 1.4 0.2
, , Versicolor
Sepal L. Sepal W. Petal L. Petal W.
[1,] 7.0 3.2 4.7 1.4
[2,] 6.4 3.2 4.5 1.5
[3,] 6.9 3.1 4.9 1.5
[4,] 5.5 2.3 4.0 1.3
[5,] 6.5 2.8 4.6 1.5
, , Virginica
Sepal L. Sepal W. Petal L. Petal W.
[1,] 6.3 3.3 6.0 2.5
[2,] 5.8 2.7 5.1 1.9
[3,] 7.1 3.0 5.9 2.1
[4,] 6.3 2.9 5.6 1.8
[5,] 6.5 3.0 5.8 2.2
# Select rows, columns and slices
iris3[1:5, c("Petal L.", "Sepal L."), "Virginica"] Petal L. Sepal L.
[1,] 6.0 6.3
[2,] 5.1 5.8
[3,] 5.9 7.1
[4,] 5.6 6.3
[5,] 5.8 6.5
Here is a code that converts the iris data frame into an array in case it is needed for similar data sets.
# Extract vector with species correspondence
species <- iris$Species
# Split the data frame excluding the column 'Species'
cm_dimensions <- split(iris[names(iris) != "Species"], species)
# Define empty array
iris_array <- array(dim = c(50, 4, 3))
dimnames(iris_array) <- list(
NULL, # No row names
names(iris[names(iris) != "Species"]), # Column names (without column 'Species')
names(cm_dimensions) # Names for slices
)
# Insert the slices in a loop
for (i in names(cm_dimensions))
iris_array[ , , i] <- as.matrix(cm_dimensions[[i]])
# See the result
iris_array[1:5, , ], , setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 5.1 3.5 1.4 0.2
[2,] 4.9 3.0 1.4 0.2
[3,] 4.7 3.2 1.3 0.2
[4,] 4.6 3.1 1.5 0.2
[5,] 5.0 3.6 1.4 0.2
, , versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 7.0 3.2 4.7 1.4
[2,] 6.4 3.2 4.5 1.5
[3,] 6.9 3.1 4.9 1.5
[4,] 5.5 2.3 4.0 1.3
[5,] 6.5 2.8 4.6 1.5
, , virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] 6.3 3.3 6.0 2.5
[2,] 5.8 2.7 5.1 1.9
[3,] 7.1 3.0 5.9 2.1
[4,] 6.3 2.9 5.6 1.8
[5,] 6.5 3.0 5.8 2.2
Transposing Arrays
Since arrays can be considered as 3d matrices, it is obvious that they can be transposed along their three axes. To do this we can use the function aperm().
# Exchange columns with slices
iris3_tr <- aperm(iris3, c(1, 3, 2))
iris3_tr[1:5, , ], , Sepal L.
Setosa Versicolor Virginica
[1,] 5.1 7.0 6.3
[2,] 4.9 6.4 5.8
[3,] 4.7 6.9 7.1
[4,] 4.6 5.5 6.3
[5,] 5.0 6.5 6.5
, , Sepal W.
Setosa Versicolor Virginica
[1,] 3.5 3.2 3.3
[2,] 3.0 3.2 2.7
[3,] 3.2 3.1 3.0
[4,] 3.1 2.3 2.9
[5,] 3.6 2.8 3.0
, , Petal L.
Setosa Versicolor Virginica
[1,] 1.4 4.7 6.0
[2,] 1.4 4.5 5.1
[3,] 1.3 4.9 5.9
[4,] 1.5 4.0 5.6
[5,] 1.4 4.6 5.8
, , Petal W.
Setosa Versicolor Virginica
[1,] 0.2 1.4 2.5
[2,] 0.2 1.5 1.9
[3,] 0.2 1.5 2.1
[4,] 0.2 1.3 1.8
[5,] 0.2 1.5 2.2
# Exchange columns with rows
iris3_tr <- aperm(iris3, c(2, 1, 3))
iris3_tr[ , 1:5, ], , Setosa
[,1] [,2] [,3] [,4] [,5]
Sepal L. 5.1 4.9 4.7 4.6 5.0
Sepal W. 3.5 3.0 3.2 3.1 3.6
Petal L. 1.4 1.4 1.3 1.5 1.4
Petal W. 0.2 0.2 0.2 0.2 0.2
, , Versicolor
[,1] [,2] [,3] [,4] [,5]
Sepal L. 7.0 6.4 6.9 5.5 6.5
Sepal W. 3.2 3.2 3.1 2.3 2.8
Petal L. 4.7 4.5 4.9 4.0 4.6
Petal W. 1.4 1.5 1.5 1.3 1.5
, , Virginica
[,1] [,2] [,3] [,4] [,5]
Sepal L. 6.3 5.8 7.1 6.3 6.5
Sepal W. 3.3 2.7 3.0 2.9 3.0
Petal L. 6.0 5.1 5.9 5.6 5.8
Petal W. 2.5 1.9 2.1 1.8 2.2
# Exchange rows with slices
iris3_tr <- aperm(iris3, c(3, 2, 1))
iris3_tr[ , , 1:5], , 1
Sepal L. Sepal W. Petal L. Petal W.
Setosa 5.1 3.5 1.4 0.2
Versicolor 7.0 3.2 4.7 1.4
Virginica 6.3 3.3 6.0 2.5
, , 2
Sepal L. Sepal W. Petal L. Petal W.
Setosa 4.9 3.0 1.4 0.2
Versicolor 6.4 3.2 4.5 1.5
Virginica 5.8 2.7 5.1 1.9
, , 3
Sepal L. Sepal W. Petal L. Petal W.
Setosa 4.7 3.2 1.3 0.2
Versicolor 6.9 3.1 4.9 1.5
Virginica 7.1 3.0 5.9 2.1
, , 4
Sepal L. Sepal W. Petal L. Petal W.
Setosa 4.6 3.1 1.5 0.2
Versicolor 5.5 2.3 4.0 1.3
Virginica 6.3 2.9 5.6 1.8
, , 5
Sepal L. Sepal W. Petal L. Petal W.
Setosa 5.0 3.6 1.4 0.2
Versicolor 6.5 2.8 4.6 1.5
Virginica 6.5 3.0 5.8 2.2
Tibbles
Tibbles are an extended version of data frames and provide an even more informative summary. In addition, the tibble is the common structure in the context of Tidyverse.
library(tibble)
iris_tibble <- as_tibble(iris)
iris_tibble# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# i 140 more rows
Together with the dplyr package, you can add more features to working with tibbles, such as the use of the pipe operator (%>%), which passes arguments between functions.
library(dplyr)
iris_tibble %>% filter(Species == "virginica")# A tibble: 50 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.3 3.3 6 2.5 virginica
2 5.8 2.7 5.1 1.9 virginica
3 7.1 3 5.9 2.1 virginica
4 6.3 2.9 5.6 1.8 virginica
5 6.5 3 5.8 2.2 virginica
6 7.6 3 6.6 2.1 virginica
7 4.9 2.5 4.5 1.7 virginica
8 7.3 2.9 6.3 1.8 virginica
9 6.7 2.5 5.8 1.8 virginica
10 7.2 3.6 6.1 2.5 virginica
# i 40 more rows
This option is useful when running complex workflows at once.
sampled_data <- iris_tibble %>%
filter(Sepal.Length > 5) %>% # Select sepals longer than 5 cm
sample_n(size = 10) %>% # Random sample of 10 rows
select(Species, Sepal.Length, Sepal.Width) %>% # Select some columns
arrange(desc(Sepal.Width)) # Sepal widths in decreasing order
# See the result
sampled_data# A tibble: 10 x 3
Species Sepal.Length Sepal.Width
<fct> <dbl> <dbl>
1 virginica 6.7 3.3
2 virginica 6.9 3.2
3 versicolor 6.9 3.1
4 virginica 6.1 3
5 versicolor 5.6 3
6 versicolor 5.7 3
7 versicolor 6.1 2.9
8 versicolor 6.5 2.8
9 versicolor 5.8 2.7
10 versicolor 6.3 2.5
Finally, it is possible to nest tibbles using the nest() function from the tidyr package.
library(tidyr)
nested_data <- iris_tibble %>%
select(Species, Sepal.Length, Sepal.Width) %>% # Select two columns
nest(.by = Species) # Nest tibble by Species
# See the result
nested_data# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 x 2]>
2 versicolor <tibble [50 x 2]>
3 virginica <tibble [50 x 2]>
Reshaping data
Everything you need to know about reshaping data is in these cheat sheets:
Alternatively, you can use the package reshape2 as shown in this tutorial.