Shaping and Reshaping Data

How to exchange data across different data structures.

Requirements for this session

Run the following code in your local system to install the packages required for this session. It is also recommended to run update.packages(ask = FALSE) to update all your packages.

install.packages(c("tidyr", "dplyr", "reshape2"))

We will also revisit the data set fuel prices.

download.file(
    url = "https://kamapu.gitlab.io/coachingR/Topics/Data/fuel-prices.csv",
    destfile = "fuel-prices.csv", method = "curl")

Data structures

While we have previously treated the vectors as the basic data structure in R, the most common way to structure data sets in R is data frame (class data.frame). Therefore, this session is dedicated to data frames and similar structures.

List

A list in R is a list of objects. At the same time, lists are objects themselves (class list) and can contain lists within them, allowing hierarchical structures (e.g. trees). This type of object is versatile and is often used by functions to collect complex outputs (e.g. outputs from models such as linear models lm()).

# List with three
many_objects <- list(A = iris, B = co2, C = airquality)
summary(many_objects)

  Length Class      Mode   
A   5    data.frame list   
B 468    ts         numeric
C   6    data.frame list

Lists have a length attribute and can be indexed using square brackets ([]). Double square brackets ([[]]) can be used to extract individual items from a list.

# A subset of the previous list
two_objects <- many_objects[c("A", "C")]
summary(two_objects)

  Length Class      Mode
A 5      data.frame list
C 6      data.frame list

# Extract a single item
item_B <- many_objects[["B"]]
summary(item_B)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  313.2   323.5   335.2   337.1   350.3   366.8

Alternatively, individual items can be accessed using the dollar symbol ($).

summary(many_objects$A)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Data Frame

The data frame is probably the most common data structure used in R and represents a column-oriented table, aka. database list. Data frames are actually lists of vectors, all of the same length. Therefore, we can use the same ways to index columns in data frames as items in lists.

# Select some columns of a data frame
head(iris[c("Species", "Sepal.Length")])

  Species Sepal.Length
1  setosa          5.1
2  setosa          4.9
3  setosa          4.7
4  setosa          4.6
5  setosa          5.0
6  setosa          5.4

# Extract the content of one column
summary(iris[["Species"]])

    setosa versicolor  virginica 
        50         50         50

# Alternative with dollar symbol
summary(iris$Species)

    setosa versicolor  virginica 
        50         50         50

In data frames (as in matrices) it is possible to use two indexes in square brackets, separated by a comma. The first index refers to the rows, the second to the columns.

# Tow vectors as indices
iris[c(5, 75, 125), c("Species", "Petal.Width")]

       Species Petal.Width
5       setosa         0.2
75  versicolor         1.3
125  virginica         2.1

A data frame is not a matrix

Although a data frame in R resembles the structure of a matrix, they are completely different objects. A matrix (class matrix) is actually a vector with two dimensions as an attribute (number of rows and number of columns) and all other properties shared with vectors.

Thus, while a data frame can contain columns of different vector classes, the matrix can have only one class assigned to the whole object.

Curiously, while vectors are special cases of matrices in mathematics, the reverse is true in R.

Arrays

Arrays in R are 3D matrices. You can also think of them as several matrices of the same dimension stacked on top of each other.

Arrays are indexed by three positions between square brackets (rows, columns and slices or layers) and only one data type, the vector class, is allowed for the whole object. A classic example of an array is the data set iris3, which contains the same units (cm) in all numerical records.

# Select rows in all columns and slices
iris3[1:5, , ]

, , Setosa

     Sepal L. Sepal W. Petal L. Petal W.
[1,]      5.1      3.5      1.4      0.2
[2,]      4.9      3.0      1.4      0.2
[3,]      4.7      3.2      1.3      0.2
[4,]      4.6      3.1      1.5      0.2
[5,]      5.0      3.6      1.4      0.2

, , Versicolor

     Sepal L. Sepal W. Petal L. Petal W.
[1,]      7.0      3.2      4.7      1.4
[2,]      6.4      3.2      4.5      1.5
[3,]      6.9      3.1      4.9      1.5
[4,]      5.5      2.3      4.0      1.3
[5,]      6.5      2.8      4.6      1.5

, , Virginica

     Sepal L. Sepal W. Petal L. Petal W.
[1,]      6.3      3.3      6.0      2.5
[2,]      5.8      2.7      5.1      1.9
[3,]      7.1      3.0      5.9      2.1
[4,]      6.3      2.9      5.6      1.8
[5,]      6.5      3.0      5.8      2.2

# Select rows, columns and slices
iris3[1:5, c("Petal L.", "Sepal L."), "Virginica"]

     Petal L. Sepal L.
[1,]      6.0      6.3
[2,]      5.1      5.8
[3,]      5.9      7.1
[4,]      5.6      6.3
[5,]      5.8      6.5

From iris to iris3

Here is a code that converts the iris data frame into an array in case it is needed for similar data sets.

# Extract vector with species correspondence
species <- iris$Species

# Split the data frame excluding the column 'Species'
cm_dimensions <- split(iris[names(iris) != "Species"], species)

# Define empty array
iris_array <- array(dim = c(50, 4, 3))
dimnames(iris_array) <- list(
  NULL, # No row names
  names(iris[names(iris) != "Species"]), # Column names (without column 'Species')
  names(cm_dimensions) # Names for slices
)

# Insert the slices in a loop
for (i in names(cm_dimensions))
    iris_array[ , , i] <- as.matrix(cm_dimensions[[i]])

# See the result
iris_array[1:5, , ]

, , setosa

     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          5.1         3.5          1.4         0.2
[2,]          4.9         3.0          1.4         0.2
[3,]          4.7         3.2          1.3         0.2
[4,]          4.6         3.1          1.5         0.2
[5,]          5.0         3.6          1.4         0.2

, , versicolor

     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          7.0         3.2          4.7         1.4
[2,]          6.4         3.2          4.5         1.5
[3,]          6.9         3.1          4.9         1.5
[4,]          5.5         2.3          4.0         1.3
[5,]          6.5         2.8          4.6         1.5

, , virginica

     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]          6.3         3.3          6.0         2.5
[2,]          5.8         2.7          5.1         1.9
[3,]          7.1         3.0          5.9         2.1
[4,]          6.3         2.9          5.6         1.8
[5,]          6.5         3.0          5.8         2.2

Transposing Arrays

Since arrays can be considered as 3d matrices, it is obvious that they can be transposed along their three axes. To do this we can use the function aperm().

# Exchange columns with slices
iris3_tr <- aperm(iris3, c(1, 3, 2))
iris3_tr[1:5, , ]

, , Sepal L.

     Setosa Versicolor Virginica
[1,]    5.1        7.0       6.3
[2,]    4.9        6.4       5.8
[3,]    4.7        6.9       7.1
[4,]    4.6        5.5       6.3
[5,]    5.0        6.5       6.5

, , Sepal W.

     Setosa Versicolor Virginica
[1,]    3.5        3.2       3.3
[2,]    3.0        3.2       2.7
[3,]    3.2        3.1       3.0
[4,]    3.1        2.3       2.9
[5,]    3.6        2.8       3.0

, , Petal L.

     Setosa Versicolor Virginica
[1,]    1.4        4.7       6.0
[2,]    1.4        4.5       5.1
[3,]    1.3        4.9       5.9
[4,]    1.5        4.0       5.6
[5,]    1.4        4.6       5.8

, , Petal W.

     Setosa Versicolor Virginica
[1,]    0.2        1.4       2.5
[2,]    0.2        1.5       1.9
[3,]    0.2        1.5       2.1
[4,]    0.2        1.3       1.8
[5,]    0.2        1.5       2.2

# Exchange columns with rows
iris3_tr <- aperm(iris3, c(2, 1, 3))
iris3_tr[ , 1:5, ]

, , Setosa

         [,1] [,2] [,3] [,4] [,5]
Sepal L.  5.1  4.9  4.7  4.6  5.0
Sepal W.  3.5  3.0  3.2  3.1  3.6
Petal L.  1.4  1.4  1.3  1.5  1.4
Petal W.  0.2  0.2  0.2  0.2  0.2

, , Versicolor

         [,1] [,2] [,3] [,4] [,5]
Sepal L.  7.0  6.4  6.9  5.5  6.5
Sepal W.  3.2  3.2  3.1  2.3  2.8
Petal L.  4.7  4.5  4.9  4.0  4.6
Petal W.  1.4  1.5  1.5  1.3  1.5

, , Virginica

         [,1] [,2] [,3] [,4] [,5]
Sepal L.  6.3  5.8  7.1  6.3  6.5
Sepal W.  3.3  2.7  3.0  2.9  3.0
Petal L.  6.0  5.1  5.9  5.6  5.8
Petal W.  2.5  1.9  2.1  1.8  2.2

# Exchange rows with slices
iris3_tr <- aperm(iris3, c(3, 2, 1))
iris3_tr[ ,  , 1:5]

, , 1

           Sepal L. Sepal W. Petal L. Petal W.
Setosa          5.1      3.5      1.4      0.2
Versicolor      7.0      3.2      4.7      1.4
Virginica       6.3      3.3      6.0      2.5

, , 2

           Sepal L. Sepal W. Petal L. Petal W.
Setosa          4.9      3.0      1.4      0.2
Versicolor      6.4      3.2      4.5      1.5
Virginica       5.8      2.7      5.1      1.9

, , 3

           Sepal L. Sepal W. Petal L. Petal W.
Setosa          4.7      3.2      1.3      0.2
Versicolor      6.9      3.1      4.9      1.5
Virginica       7.1      3.0      5.9      2.1

, , 4

           Sepal L. Sepal W. Petal L. Petal W.
Setosa          4.6      3.1      1.5      0.2
Versicolor      5.5      2.3      4.0      1.3
Virginica       6.3      2.9      5.6      1.8

, , 5

           Sepal L. Sepal W. Petal L. Petal W.
Setosa          5.0      3.6      1.4      0.2
Versicolor      6.5      2.8      4.6      1.5
Virginica       6.5      3.0      5.8      2.2

Tibbles

Tibbles are an extended version of data frames and provide an even more informative summary. In addition, the tibble is the common structure in the context of Tidyverse.

library(tibble)
iris_tibble <- as_tibble(iris)
iris_tibble

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# i 140 more rows

Together with the dplyr package, you can add more features to working with tibbles, such as the use of the pipe operator (%>%), which passes arguments between functions.

library(dplyr)
iris_tibble %>% filter(Species == "virginica")

# A tibble: 50 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
 1          6.3         3.3          6           2.5 virginica
 2          5.8         2.7          5.1         1.9 virginica
 3          7.1         3            5.9         2.1 virginica
 4          6.3         2.9          5.6         1.8 virginica
 5          6.5         3            5.8         2.2 virginica
 6          7.6         3            6.6         2.1 virginica
 7          4.9         2.5          4.5         1.7 virginica
 8          7.3         2.9          6.3         1.8 virginica
 9          6.7         2.5          5.8         1.8 virginica
10          7.2         3.6          6.1         2.5 virginica
# i 40 more rows

This option is useful when running complex workflows at once.

sampled_data <- iris_tibble %>%
  filter(Sepal.Length > 5) %>% # Select sepals longer than 5 cm
  sample_n(size = 10) %>% # Random sample of 10 rows
  select(Species, Sepal.Length, Sepal.Width) %>% # Select some columns
  arrange(desc(Sepal.Width)) # Sepal widths in decreasing order

# See the result
sampled_data

# A tibble: 10 x 3
   Species    Sepal.Length Sepal.Width
   <fct>             <dbl>       <dbl>
 1 setosa              5.1         3.5
 2 virginica           6.7         3.3
 3 versicolor          6.4         3.2
 4 virginica           7.2         3.2
 5 virginica           6.5         3  
 6 virginica           6.1         3  
 7 versicolor          6.1         3  
 8 virginica           5.8         2.8
 9 versicolor          5.5         2.6
10 versicolor          6           2.2

Finally, it is possible to nest tibbles using the nest() function from the tidyr package.

library(tidyr)
nested_data <- iris_tibble %>%
  select(Species, Sepal.Length, Sepal.Width) %>% # Select two columns
  nest(.by = Species) # Nest tibble by Species

# See the result
nested_data

# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 x 2]>
2 versicolor <tibble [50 x 2]>
3 virginica  <tibble [50 x 2]>

Reshaping data

Everything you need to know about reshaping data is in these cheat sheets:

Alternatively, you can use the package reshape2 as shown in this tutorial.