• Home
  • About
    • Colin TB photo

      Colin TB

      Microeconomics, Industrial Organization, Behavioural Economics, Data Science

    • Learn More
    • LinkedIn
    • Github
    • StackOverflow
  • Posts
    • All Posts
    • All Tags

Overlapping Groups with Dplyr

07 Aug 2017

Reading time ~2 minutes

Overlapping Groups with Dplyr

This notebook shows some simple code to create and work with overlapping groups in a tidyverse framework.

This is based on a questions I originally asked here.


1. Example data

Lets use the famous iris data set as an example. This is data on flowers and their measurements.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats
glimpse(iris)
## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

2. Simple Summary

It is easy to summarize across Species. For example the average sepal and petal length by species.

iris %>% 
  group_by(Species) %>% 
  summarize_at(vars(Sepal.Length, Petal.Length), mean)
## # A tibble: 3 x 3
##      Species Sepal.Length Petal.Length
##       <fctr>        <dbl>        <dbl>
## 1     setosa        5.006        1.462
## 2 versicolor        5.936        4.260
## 3  virginica        6.588        5.552

3. Overlapping group summaries

Now lets suppose we want to summarize with overlapping groups. We now want summary figures across all groups, and just Setosa and Versicolor.

iris %>% 
  mutate(`All Species` = T,
         `Setosa & Versicolour` = Species %in% c("setosa", "versicolor")) %>% 
  gather(group, keep, `All Species`, `Setosa & Versicolour`) %>% 
  filter(keep) %>% 
  group_by(group) %>% 
  summarize_at(vars(Sepal.Length, Petal.Length), mean)
## # A tibble: 2 x 3
##                  group Sepal.Length Petal.Length
##                  <chr>        <dbl>        <dbl>
## 1          All Species     5.843333        3.758
## 2 Setosa & Versicolour     5.471000        2.861

And that’s it. Done!



dplyrtidyverseR Share Tweet +1