Introduction
As per tidyr definition unnesting (or rectangling):
... is the art and craft of taking a deeply nested list (often sourced
from wild caught JSON or XML) and taming it into a tidy data set of rows and
columns. There are three functions from tidyr that are particularly useful for
rectangling:
tidyr
core functions for unnesting are unnest_longer()
,
unnest_wider()
, hoist()
. This guide follows
the steps from tidyr
vignette and translates them into unnest
’s language.
With tidyr you
have to unnest lists in several steps by using one of the three core
functions. With unnest
you do all at once in one step.
unnest
doesn’t produce intermediate list columns.
We’ll use the repurrrsive
package as the source of our
nested lists:
GitHub repos
With tidyr you start by putting a list into a data.frame column. With unnest this is not necessary.
gh_repos
is a nested list with maximal depth of 4
“user”>“repo”>“owner”>“[xyz]”.
str(gh_repos[[1]][[1]][["owner"]])
Let’s say that we want a data.frame
with 3 columns,
“name”, “homepage” and “watchers_count”, from level 3 of repo
characteristics and one,“login”, from level 4 of owner characteristics.
This is how it’s done with tidyr:
repos <- tibble(repo = gh_repos)
repos <- unnest_longer(repos, repo)
hoist(repos, repo,
login = c("owner", "login"),
name = "name",
homepage = "homepage",
watchers = "watchers_count") %>%
select(-repo)
#> # A tibble: 176 × 4
#> login name homepage watchers
#> <chr> <chr> <chr> <int>
#> 1 gaborcsardi after NA 5
#> 2 gaborcsardi argufy NA 19
#> 3 gaborcsardi ask NA 5
#> 4 gaborcsardi baseimports NA 0
#> 5 gaborcsardi citest NA 0
#> 6 gaborcsardi clisymbols "" 18
#> 7 gaborcsardi cmaker NA 0
#> 8 gaborcsardi cmark NA 0
#> 9 gaborcsardi conditions NA 0
#> 10 gaborcsardi crayon NA 52
#> # ℹ 166 more rows
With unnest:
spec <- s(stack = TRUE,
s(stack = TRUE,
s("name"),
s("homepage"),
s("watchers_count", as = "watchers"),
s("owner",
s("login"))))
unnest(gh_repos, spec)
#> # A tibble: 176 × 4
#> homepage name owner.login watchers
#> <chr> <chr> <chr> <int>
#> 1 NA after gaborcsardi 5
#> 2 NA argufy gaborcsardi 19
#> 3 NA ask gaborcsardi 5
#> 4 NA baseimports gaborcsardi 0
#> 5 NA citest gaborcsardi 0
#> 6 "" clisymbols gaborcsardi 18
#> 7 NA cmaker gaborcsardi 0
#> 8 NA cmark gaborcsardi 0
#> 9 NA conditions gaborcsardi 0
#> 10 NA crayon gaborcsardi 52
#> # ℹ 166 more rows
unnest selectors
(s()
) apply to corresponding levels of the hierarchy and
describe which elements should be selected and how. The
stack = TRUE
says that the result of the extraction should
be stacked row-wise (aka rbind
ed).
stack = FALSE
, means spread it across multiple columns (aka
cbind
ed). The as
argument provides the name of
the output. By default it’s the entire path name to the selected
leaf.
Now assume that you want the 3 components of “repos” and all components of the owner at once:
tibble(repo = gh_repos) %>%
unnest_longer(repo) %>%
hoist(repo,
name = "name",
homepage = "homepage",
watchers = "watchers_count") %>%
hoist(repo, owner = "owner") %>%
unnest_wider(owner)
With unnest
spec <- s(stack = TRUE,
s(stack = TRUE,
s("name"),
s("homepage"),
s("watchers_count", as = "watchers"),
s("owner")))
unnest(gh_repos, spec) %>% tibble()
#> # A tibble: 176 × 20
#> homepage name owner.avatar_url owner.events_url owner.followers_url
#> <chr> <chr> <chr> <chr> <chr>
#> 1 NA after https://avatars.gi… https://api.git… https://api.github…
#> 2 NA argufy https://avatars.gi… https://api.git… https://api.github…
#> 3 NA ask https://avatars.gi… https://api.git… https://api.github…
#> 4 NA baseimports https://avatars.gi… https://api.git… https://api.github…
#> 5 NA citest https://avatars.gi… https://api.git… https://api.github…
#> 6 "" clisymbols https://avatars.gi… https://api.git… https://api.github…
#> 7 NA cmaker https://avatars.gi… https://api.git… https://api.github…
#> 8 NA cmark https://avatars.gi… https://api.git… https://api.github…
#> 9 NA conditions https://avatars.gi… https://api.git… https://api.github…
#> 10 NA crayon https://avatars.gi… https://api.git… https://api.github…
#> # ℹ 166 more rows
#> # ℹ 15 more variables: owner.following_url <chr>, owner.gists_url <chr>,
#> # owner.gravatar_id <chr>, owner.html_url <chr>, owner.id <int>,
#> # owner.login <chr>, owner.organizations_url <chr>,
#> # owner.received_events_url <chr>, owner.repos_url <chr>,
#> # owner.site_admin <lgl>, owner.starred_url <chr>,
#> # owner.subscriptions_url <chr>, owner.type <chr>, owner.url <chr>, …
Note that unnest
produces namespaced column names, while [tidyr’[s is not. This is a good
thing as you don’t have to worry about conflicting names. tidyr
provides a “fix” for duplicated names in the form of
names_repair
argument to its functions.
Game of Thrones characters
What do you do with non-singleton leafs? Those are normally stacked, spread or melted depending on the analysis. For example the Game of Thrones dataset contains non-singleton leafs “titles”, “aliases”, “books” etc.
str(got_chars[[1]])
Let’s have a look at some common scenarios.
Stacking
Assume that we want a row for every book and TV series that the character appears in. That is, we want a long table with all combinations (aka cross product) of books and TV series.
tibble(char = got_chars) %>%
unnest_wider(char) %>%
select(name, books, tvSeries) %>%
unnest_longer(books) %>%
unnest_longer(tvSeries)
#> # A tibble: 230 × 3
#> name books tvSeries
#> <chr> <chr> <chr>
#> 1 Theon Greyjoy A Game of Thrones Season 1
#> 2 Theon Greyjoy A Game of Thrones Season 2
#> 3 Theon Greyjoy A Game of Thrones Season 3
#> 4 Theon Greyjoy A Game of Thrones Season 4
#> 5 Theon Greyjoy A Game of Thrones Season 5
#> 6 Theon Greyjoy A Game of Thrones Season 6
#> 7 Theon Greyjoy A Storm of Swords Season 1
#> 8 Theon Greyjoy A Storm of Swords Season 2
#> 9 Theon Greyjoy A Storm of Swords Season 3
#> 10 Theon Greyjoy A Storm of Swords Season 4
#> # ℹ 220 more rows
unnest(got_chars,
s(stack = T,
s("name"),
s("books,tvSeries/", stack = T)))
#> # A tibble: 236 × 3
#> books name tvSeries
#> <chr> <chr> <chr>
#> 1 A Game of Thrones Theon Greyjoy Season 1
#> 2 A Storm of Swords Theon Greyjoy Season 2
#> 3 A Feast for Crows Theon Greyjoy Season 3
#> 4 A Game of Thrones Theon Greyjoy Season 4
#> 5 A Storm of Swords Theon Greyjoy Season 5
#> 6 A Feast for Crows Theon Greyjoy Season 6
#> 7 A Game of Thrones Theon Greyjoy Season 1
#> 8 A Storm of Swords Theon Greyjoy Season 2
#> 9 A Feast for Crows Theon Greyjoy Season 3
#> 10 A Game of Thrones Theon Greyjoy Season 4
#> # ℹ 226 more rows
Implementation aside, [tidyr’[s intermediary steps are generally
costly for two reasons. First, because intermediary data.frames are
created during the processing. Second, because intermediary objects
might contain columns that are not needed in the subsequent processing.
In the above examples unnest_wider()
produced man more
columns than we need. A better approach would be to replace it with a
bit more verbose hoist
call.
In contrast unnest doesn’t produce intermediary data structures. In fact, unnest follows a 0-intermediary-copy semantics. The input vectors are directly copied into the output, no matter how complex the nesting is.
Cross-product is commonly useful when only one non-singleton variable is extracted. For example, let’s match title to name:
tibble(char = got_chars) %>%
hoist(char, name = "name", title = "titles") %>%
select(-char) %>%
unnest_longer(title)
#> # A tibble: 59 × 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy "Prince of Winterfell"
#> 2 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)"
#> 3 Tyrion Lannister "Acting Hand of the King (former)"
#> 4 Tyrion Lannister "Master of Coin (former)"
#> 5 Victarion Greyjoy "Lord Captain of the Iron Fleet"
#> 6 Victarion Greyjoy "Master of the Iron Victory"
#> 7 Will ""
#> 8 Areo Hotah "Captain of the Guard at Sunspear"
#> 9 Chett ""
#> 10 Cressen "Maester"
#> # ℹ 49 more rows
unnest(got_chars,
s(stack = T,
s("name"),
s("titles/", stack = T)))
#> # A tibble: 59 × 2
#> name titles
#> <chr> <chr>
#> 1 Theon Greyjoy "Prince of Winterfell"
#> 2 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)"
#> 3 Tyrion Lannister "Acting Hand of the King (former)"
#> 4 Tyrion Lannister "Master of Coin (former)"
#> 5 Victarion Greyjoy "Lord Captain of the Iron Fleet"
#> 6 Victarion Greyjoy "Master of the Iron Victory"
#> 7 Will ""
#> 8 Areo Hotah "Captain of the Guard at Sunspear"
#> 9 Chett ""
#> 10 Cressen "Maester"
#> # ℹ 49 more rows
Id-value long tables (aka long pivoting, or melting)
A common scenario is to stack the non-scalar leafs and replicate id
labels in a separate “key” column. This is called “melting”
(reshape2
) or “long pivoting” (tidyr
).
tibble(char = got_chars) %>%
unnest_wider(char) %>%
select(name, books, tvSeries) %>%
pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>%
unnest_longer(value)
unnest(got_chars,
s(stack = T,
s("name"),
s("books,tvSeries", stack = "media", as = "value",
s(stack = T))))
Id-value wide tables
One might want to stack id vars (media) but spread the measures (books, tvSeries) horizontally such that each row would contain all measurement for each media.
Wide Tables (aka spreading)
This strategy is commonly used in machine learning scenarios when large sparse tables are plugged into black-box ML algorithms. This is the default behavior in unnest.
Sharla Gelfand’s discography
Finally, the most complex transformation from [tidyr’[s vignette can be achieved with unnest in a single step.
Typical entry of disog
collection looks like this
str(discog[[3]])
#> List of 5
#> $ instance_id : int 354091476
#> $ date_added : chr "2019-02-13T14:07:23-08:00"
#> $ basic_information:List of 11
#> ..$ labels :List of 1
#> .. ..$ :List of 6
#> .. .. ..$ name : chr "La Vida Es Un Mus"
#> .. .. ..$ entity_type : chr "1"
#> .. .. ..$ catno : chr "MUS118"
#> .. .. ..$ resource_url : chr "https://api.discogs.com/labels/38322"
#> .. .. ..$ id : int 38322
#> .. .. ..$ entity_type_name: chr "Label"
#> ..$ year : int 2017
#> ..$ master_url : chr "https://api.discogs.com/masters/1109943"
#> ..$ artists :List of 1
#> .. ..$ :List of 7
#> .. .. ..$ join : chr ""
#> .. .. ..$ name : chr "S.H.I.T. (3)"
#> .. .. ..$ anv : chr ""
#> .. .. ..$ tracks : chr ""
#> .. .. ..$ role : chr ""
#> .. .. ..$ resource_url: chr "https://api.discogs.com/artists/2769828"
#> .. .. ..$ id : int 2769828
#> ..$ id : int 9827276
#> ..$ thumb : chr "https://img.discogs.com/x6GUri3hXAcfzF2wz5jQloomOoY=/fit-in/150x150/filters:strip_icc():format(jpeg):mode_rgb()"| __truncated__
#> ..$ title : chr "I"
#> ..$ formats :List of 1
#> .. ..$ :List of 3
#> .. .. ..$ descriptions:List of 3
#> .. .. .. ..$ : chr "7\""
#> .. .. .. ..$ : chr "45 RPM"
#> .. .. .. ..$ : chr "EP"
#> .. .. ..$ name : chr "Vinyl"
#> .. .. ..$ qty : chr "1"
#> ..$ cover_image : chr "https://img.discogs.com/7aJPlo2phtFL-T2Kt6MTBc0uftY=/fit-in/600x600/filters:strip_icc():format(jpeg):mode_rgb()"| __truncated__
#> ..$ resource_url: chr "https://api.discogs.com/releases/9827276"
#> ..$ master_id : int 1109943
#> $ id : int 9827276
#> $ rating : int 0
We want to extract artists
metadata and
formats
into separate tables.
tibble(disc = discog) %>%
unnest_wider(disc) %>%
hoist(basic_information, artist = "artists") %>%
select(disc_id = id, artist) %>%
unnest_longer(artist) %>%
unnest_wider(artist)
#> # A tibble: 167 × 8
#> disc_id join name anv tracks role resource_url id
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 7496378 "" Mollot "" "" "" https://api… 4.62e6
#> 2 4490852 "" Una Bèstia Incontrolab… "" "" "" https://api… 3.19e6
#> 3 9827276 "" S.H.I.T. (3) "" "" "" https://api… 2.77e6
#> 4 9769203 "" Rata Negra "" "" "" https://api… 4.28e6
#> 5 7237138 "" Ivy (18) "" "" "" https://api… 3.60e6
#> 6 13117042 "" Tashme "" "" "" https://api… 5.21e6
#> 7 7113575 "" Desgraciados "" "" "" https://api… 4.45e6
#> 8 10540713 "" Phantom Head "" "" "" https://api… 4.27e6
#> 9 11260950 "" Sub Space (2) "" "" "" https://api… 5.69e6
#> 10 11726853 "" Small Man (2) "" "" "" https://api… 6.37e6
#> # ℹ 157 more rows
tibble(disc = discog) %>%
unnest_wider(disc) %>%
hoist(basic_information, format = "formats") %>%
select(disc_id = id, format) %>%
unnest_longer(format) %>%
unnest_wider(format) %>%
unnest_longer(descriptions)
#> # A tibble: 258 × 5
#> disc_id descriptions text name qty
#> <int> <chr> <chr> <chr> <chr>
#> 1 7496378 "Numbered" Black Cassette 1
#> 2 4490852 "LP" NA Vinyl 1
#> 3 9827276 "7\"" NA Vinyl 1
#> 4 9827276 "45 RPM" NA Vinyl 1
#> 5 9827276 "EP" NA Vinyl 1
#> 6 9769203 "LP" NA Vinyl 1
#> 7 9769203 "Album" NA Vinyl 1
#> 8 7237138 "7\"" NA Vinyl 1
#> 9 7237138 "45 RPM" NA Vinyl 1
#> 10 13117042 "7\"" NA Vinyl 1
#> # ℹ 248 more rows
With unnest you can achieve this in two separate passes through the list, or in a single pass with a grouped children specification. The single pass extraction returns a list of data.frames, but scans the data only once.
Separate unnest calls:
unnest(discog,
s(stack = T,
s("id", as = "disc_id"),
s("basic_information/artists", as = "artist",
s(stack = T))))
unnest(discog,
s(stack = T,
s("id", as = "disc_id"),
s("basic_information/formats", as = "format",
s(stack = T,
s(exclude = "descriptions"),
s("descriptions/", stack = T)))))
Single unnest pass:
unnest(discog,
s(stack = T,
groups =
list(artists =
list(s("id", as = "disc_id"),
s("basic_information/artists", as = "artist",
s(stack = T))),
formats =
list(s("id", as = "disc_id"),
s("basic_information/formats", as = "format",
s(stack = T,
s(exclude = "descriptions"),
s("descriptions/", stack = T)))))))
The unnest specs inside groups
is the same as in the
separate-calls case. The groups
argument is just like
children
argument with the difference that the output of
the extraction is not cross-joined, but simply returned as list.1
The benefit is grouped extraction is twofold. First, it’s faster
because the list is traversed only once. Second, the de-duplication
works across groups. That is, when dedupe = TRUE
(not shown
in the above examples), the fields extracted by the preceding specs are
not extracted by the specs that follow.