16 dplyr

Link: https://dplyr.tidyverse.org/
Index of Functions: https://dplyr.tidyverse.org/reference/index.html
Cheat Sheet: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf
Chapter of R for Data Science: https://r4ds.had.co.nz/transform.html
Unofficial Example: https://rpubs.com/williamsurles/293454

suppressPackageStartupMessages(
  library(dplyr)
)

16.0.1 Example Data

Import the example data. This data represents benthic macroinvertebrate data collected in the littoral zone of Onondaga, Otisco, and Cazenovia lakes

taxa.df <- file.path("data",
                     "zms_thesis-macro_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

DT::datatable(taxa.df, options = list(scrollX = TRUE))

env.df <- file.path("data",
                     "zms_thesis-env_2017-06-18.csv") %>% 
  read.csv(stringsAsFactors = FALSE)

DT::datatable(env.df, options = list(scrollX = TRUE))

16.0.2 Rename

Definition: give a new name to a specified column(s).
Link: https://dplyr.tidyverse.org/reference/select.html

In the example below, the columns lat and long are renamed to latitude and longitude, respectively. If names(taxa.df) is called, elements 5 and 6 are now represented by “latitude” and “longitude”, respectively.

taxa.df <- taxa.df %>% 
  rename(latitude = lat,
         longitude = long)

names(taxa.df)

##  [1] "unique_id"     "lake"          "station_id"    "sample_number"
##  [5] "latitude"      "longitude"     "agency_code"   "method"       
##  [9] "date"          "count"         "life_stage"    "final_id"     
## [13] "taxon_level"   "phylum"        "subphylum"     "class"        
## [17] "subclass"      "order"         "suborder"      "family"       
## [21] "subfamily"     "tribe"         "genus"         "species"      
## [25] "picked"        "squares"

16.0.3 Filter

Definition: subset rows using logical statements.
Link: https://dplyr.tidyverse.org/reference/filter.html

In the example below, taxa.df is subset to only include rows where the unique_id column matches the string “caz_1_1”.

filter.df <- taxa.df %>% 
  filter(unique_id == "caz_1_1")

DT::datatable(filter.df, options = list(scrollX = TRUE))

You can apply multiple filters separate by commas. The filters are applied from the top down. In the example below, two filters are applied within the filter() call:

unique_id == "caz_1_1"
order %in% c("ephemeroptera", "trichoptera")

The first logical statement is the same as the filter from the code chunk above, which keeps only rows associated with the sample “caz_1_1”. Then, the order column is subset the data to only include rows represented by ephemeroptera (mayfly) or trichoptera (caddisfly) taxa.

filter.df <- taxa.df %>% 
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera"))

DT::datatable(filter.df, options = list(scrollX = TRUE))

16.0.4 Select

Definition: subset of columns.
Link: https://dplyr.tidyverse.org/reference/select.html

Often we are not interested in all columns in a data frame. select() can be used to subset the columns to only include columns of interest. In the example below, the filter.df data frame, created in the Filter section, is subset to only include three columns: 1) final_id, count, and unique_id.

select.df <- filter.df %>% 
  select(unique_id, final_id, count)

knitr::kable(select.df)

unique_id	final_id	count
caz_1_1	baetidae	1
caz_1_1	caenis	14
caz_1_1	agraylea	1
caz_1_1	oxyethira	2
caz_1_1	hydroptilidae	1

The same operation can be performed by chaining the functions together with the pipe operator. taxa.df is first filtered to only include rows that meet our specified logical statements and then only the columns unique_id, final_id, and count are retained.

select.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(unique_id, final_id, count)

knitr::kable(select.df)

unique_id	final_id	count
caz_1_1	baetidae	1
caz_1_1	caenis	14
caz_1_1	agraylea	1
caz_1_1	oxyethira	2
caz_1_1	hydroptilidae	1

Reorder columns. Maybe you would prefer to see the columns in the following order:

final_id
unique_id
count

reorder.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(final_id, unique_id, count)

knitr::kable(reorder.df)

final_id	unique_id	count
baetidae	caz_1_1	1
caenis	caz_1_1	14
agraylea	caz_1_1	1
oxyethira	caz_1_1	2
hydroptilidae	caz_1_1	1

16.0.4.1 everything

Definition: helper function for select. Specifies all columns not already specified in the select() call.
Link:
- https://dplyr.tidyverse.org/reference/select.html
- https://www.rdocumentation.org/packages/tidyselect/versions/0.2.5/topics/select_helpers

For example, maybe we want to see the first three columns reordered but the remaining columns to remain in the same order. Without everything(), we would need to specify each column in the select() call. With everything(), we can specify the first three columns to be lake, station_id, and sample_number followed by everything(). The columns will then be reordered accordingly (lake, station_id, sample_number, unique_id, etc.).

reorder.df <- taxa.df %>%
  filter(unique_id == "caz_1_1",
         order %in% c("ephemeroptera", "trichoptera")) %>% 
  select(lake, station_id, sample_number, everything())

DT::datatable(reorder.df, options = list(scrollX = TRUE))

16.0.5 distinct

Definition: remove duplicate rows.
Link: https://dplyr.tidyverse.org/reference/distinct.html

After performing a subsetting columns with select(), it may be beneficial to subsequently run distinct() to remove any duplicate rows.

Maybe we are just interested in viewing basic sample information such as lake, station_id, and sample_number. If select() is performed to subset the data frame to only represent these columns, then there will be many duplicate rows. The reason for this is in the original taxa.df data frame there are multiple taxa observed per sample. lake, station_id, and sample_number are associated accordingly with the taxa, and therefore lake, station_id, and sample_number are represented many times.

nondistinct.df <- taxa.df %>% 
  select(lake, station_id, sample_number)

DT::datatable(nondistinct.df, 
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

distinct() will remove the duplicate rows present in the output above. This is a very simple call, which requires no input.

distinct.df <- taxa.df %>% 
  select(lake, station_id, sample_number) %>% 
  distinct()

DT::datatable(distinct.df, 
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:3))))

16.0.6 mutate

Definition: create or overwrite columns.
Link: https://dplyr.tidyverse.org/reference/mutate.html

In the example below, I use the select() and distinct() combination from the distinct section and then filter the data frame to only include rows where lake equals “caz”. The distinct() call has insured that each row is unique but there is not a single column that could be used to uniquely identify a given sample. Using mutate() we can create a new column, called new_id, that concatenates the station_id and sample_number columns into a single value separated by an underscore (paste(station_id, sample_number, sep = "_")).

mutate.df <- taxa.df %>% 
  select(lake, station_id, sample_number) %>% 
  distinct() %>%
  filter(lake == "caz") %>% 
  mutate(new_id = paste(station_id, sample_number,
                        sep = "_"))

DT::datatable(mutate.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:4))))

Maybe a second type of unique identifier is wanted for certain subsequent processes. mutate() allows for the creation of one or more columns within a single call. In the example below, new_id is created as it was in the example above but is followed by the creation of date_id. date_id is created by concatenating new_id and date columns. Therefore, a single mutate() call has created two new columns. Additionally, we can see that columns created downstream (date_id) can reference columns created upstream (new_id) within the mutate() call.

mutate.df <- taxa.df %>% 
  select(lake, station_id, sample_number, date) %>% 
  distinct() %>%
  filter(lake == "caz") %>% 
  mutate(new_id = paste(station_id, sample_number,
                        sep = "_"),
         date_id = paste(new_id, date,
                         sep = "_"))

DT::datatable(mutate.df,
              options = list(columnDefs = list(list(className = 'dt-center', targets = 0:6))))

16.0.7 group_by

Definition: aggregate by specified columns.
Link: https://dplyr.tidyverse.org/reference/group_by.html

Calling group_by() will aggregate the data by the column(s) specified but alone will not alter the content of the data frame.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id)

knitr::kable(group.df)

unique_id	final_id	count
caz_1_1	physa	1
caz_1_1	gyraulus	1
caz_1_1	bezzia	5
caz_1_1	mallochohelea	7
caz_1_1	chironomus	1
caz_1_1	dicrotendipes	3
caz_1_1	paratanytarsus	34
caz_1_1	cricotopus	1
caz_1_1	psectrocladius	5
caz_1_1	thienemanniella	4
caz_1_1	krenopelopia	9
caz_1_1	procladius	1
caz_1_1	hemerodromia	1
caz_1_1	baetidae	1
caz_1_1	caenis	14
caz_1_1	acentria	1
caz_1_1	enallagma	1
caz_1_1	agraylea	1
caz_1_1	oxyethira	2
caz_1_1	hydroptilidae	1
caz_1_1	gammarus	2
caz_1_1	hyalella	14
caz_1_1	caecidotea	2
caz_1_2	amnicola_grana	2
caz_1_2	physa	27
caz_1_2	gyraulus_parvus	1
caz_1_2	helisoma_anceps	2
caz_1_2	promenetus_exacuous	5
caz_1_2	valvata	1
caz_1_2	bezzia	9
caz_1_2	mallochohelea	2
caz_1_2	chironomus	5
caz_1_2	dicrotendipes	4
caz_1_2	microtendipes	1
caz_1_2	polypedilum	2
caz_1_2	pseudochironomus	1
caz_1_2	rheotanytarsus	2
caz_1_2	thienemanniella	1
caz_1_2	ablabesmyia	5
caz_1_2	callibaetis	1
caz_1_2	caenis	1
caz_1_2	caenidae	2
caz_1_2	acentria	1
caz_1_2	mystacides	2
caz_1_2	gammarus	1
caz_1_2	hyalella	13
caz_1_2	caecidotea	9

The image below shows “under the hood”" information about group.df from the Environment Tab before and after the group_by() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the group_by() call; unlike the table shown above, which does not provided any indication that the data has been aggregated.

We can follow group_by() by a mutate() call to calculate a single value per group. Each group variable is replicated for each row within a group. In the example below, we are aggregating by two unique_ids (group_by(unique_id); “caz_1_1” and “caz_1_2”) and finding the total number of organisms identified within each group (mutate(total = sum(count))). The sum() function returns one number per group. mutate() then replicates this one number for all rows within a group.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count))

knitr::kable(group.df)

unique_id	final_id	count	total
caz_1_1	physa	1	112
caz_1_1	gyraulus	1	112
caz_1_1	bezzia	5	112
caz_1_1	mallochohelea	7	112
caz_1_1	chironomus	1	112
caz_1_1	dicrotendipes	3	112
caz_1_1	paratanytarsus	34	112
caz_1_1	cricotopus	1	112
caz_1_1	psectrocladius	5	112
caz_1_1	thienemanniella	4	112
caz_1_1	krenopelopia	9	112
caz_1_1	procladius	1	112
caz_1_1	hemerodromia	1	112
caz_1_1	baetidae	1	112
caz_1_1	caenis	14	112
caz_1_1	acentria	1	112
caz_1_1	enallagma	1	112
caz_1_1	agraylea	1	112
caz_1_1	oxyethira	2	112
caz_1_1	hydroptilidae	1	112
caz_1_1	gammarus	2	112
caz_1_1	hyalella	14	112
caz_1_1	caecidotea	2	112
caz_1_2	amnicola_grana	2	100
caz_1_2	physa	27	100
caz_1_2	gyraulus_parvus	1	100
caz_1_2	helisoma_anceps	2	100
caz_1_2	promenetus_exacuous	5	100
caz_1_2	valvata	1	100
caz_1_2	bezzia	9	100
caz_1_2	mallochohelea	2	100
caz_1_2	chironomus	5	100
caz_1_2	dicrotendipes	4	100
caz_1_2	microtendipes	1	100
caz_1_2	polypedilum	2	100
caz_1_2	pseudochironomus	1	100
caz_1_2	rheotanytarsus	2	100
caz_1_2	thienemanniella	1	100
caz_1_2	ablabesmyia	5	100
caz_1_2	callibaetis	1	100
caz_1_2	caenis	1	100
caz_1_2	caenidae	2	100
caz_1_2	acentria	1	100
caz_1_2	mystacides	2	100
caz_1_2	gammarus	1	100
caz_1_2	hyalella	13	100
caz_1_2	caecidotea	9	100

This example can be taken one step further to calculate relative abundance (i.e., the percentage of the sample represented by each taxon) by adding percent = count / total * 100 to the mutate() call.

group.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100)

knitr::kable(group.df)

unique_id	final_id	count	total	percent
caz_1_1	physa	1	112	0.8928571
caz_1_1	gyraulus	1	112	0.8928571
caz_1_1	bezzia	5	112	4.4642857
caz_1_1	mallochohelea	7	112	6.2500000
caz_1_1	chironomus	1	112	0.8928571
caz_1_1	dicrotendipes	3	112	2.6785714
caz_1_1	paratanytarsus	34	112	30.3571429
caz_1_1	cricotopus	1	112	0.8928571
caz_1_1	psectrocladius	5	112	4.4642857
caz_1_1	thienemanniella	4	112	3.5714286
caz_1_1	krenopelopia	9	112	8.0357143
caz_1_1	procladius	1	112	0.8928571
caz_1_1	hemerodromia	1	112	0.8928571
caz_1_1	baetidae	1	112	0.8928571
caz_1_1	caenis	14	112	12.5000000
caz_1_1	acentria	1	112	0.8928571
caz_1_1	enallagma	1	112	0.8928571
caz_1_1	agraylea	1	112	0.8928571
caz_1_1	oxyethira	2	112	1.7857143
caz_1_1	hydroptilidae	1	112	0.8928571
caz_1_1	gammarus	2	112	1.7857143
caz_1_1	hyalella	14	112	12.5000000
caz_1_1	caecidotea	2	112	1.7857143
caz_1_2	amnicola_grana	2	100	2.0000000
caz_1_2	physa	27	100	27.0000000
caz_1_2	gyraulus_parvus	1	100	1.0000000
caz_1_2	helisoma_anceps	2	100	2.0000000
caz_1_2	promenetus_exacuous	5	100	5.0000000
caz_1_2	valvata	1	100	1.0000000
caz_1_2	bezzia	9	100	9.0000000
caz_1_2	mallochohelea	2	100	2.0000000
caz_1_2	chironomus	5	100	5.0000000
caz_1_2	dicrotendipes	4	100	4.0000000
caz_1_2	microtendipes	1	100	1.0000000
caz_1_2	polypedilum	2	100	2.0000000
caz_1_2	pseudochironomus	1	100	1.0000000
caz_1_2	rheotanytarsus	2	100	2.0000000
caz_1_2	thienemanniella	1	100	1.0000000
caz_1_2	ablabesmyia	5	100	5.0000000
caz_1_2	callibaetis	1	100	1.0000000
caz_1_2	caenis	1	100	1.0000000
caz_1_2	caenidae	2	100	2.0000000
caz_1_2	acentria	1	100	1.0000000
caz_1_2	mystacides	2	100	2.0000000
caz_1_2	gammarus	1	100	1.0000000
caz_1_2	hyalella	13	100	13.0000000
caz_1_2	caecidotea	9	100	9.0000000

16.0.7.1 ungroup

Definition: remove the aggregation applied by group_by().
Link: https://dplyr.tidyverse.org/reference/group_by.html

Once the calculations that required aggregation (i.e., group_by()) are complete, make sure to remove the aggregation from the data frame using ungroup(). If you forget to do this, R will continue to try to apply the aggregation to future calculations, which will most likely be inappropriate. Generally, you will eventually get an error message if you forget to use ungroup().

ungroup.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, final_id, count) %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(ungroup.df)

unique_id	final_id	count	total	percent
caz_1_1	physa	1	112	0.8928571
caz_1_1	gyraulus	1	112	0.8928571
caz_1_1	bezzia	5	112	4.4642857
caz_1_1	mallochohelea	7	112	6.2500000
caz_1_1	chironomus	1	112	0.8928571
caz_1_1	dicrotendipes	3	112	2.6785714
caz_1_1	paratanytarsus	34	112	30.3571429
caz_1_1	cricotopus	1	112	0.8928571
caz_1_1	psectrocladius	5	112	4.4642857
caz_1_1	thienemanniella	4	112	3.5714286
caz_1_1	krenopelopia	9	112	8.0357143
caz_1_1	procladius	1	112	0.8928571
caz_1_1	hemerodromia	1	112	0.8928571
caz_1_1	baetidae	1	112	0.8928571
caz_1_1	caenis	14	112	12.5000000
caz_1_1	acentria	1	112	0.8928571
caz_1_1	enallagma	1	112	0.8928571
caz_1_1	agraylea	1	112	0.8928571
caz_1_1	oxyethira	2	112	1.7857143
caz_1_1	hydroptilidae	1	112	0.8928571
caz_1_1	gammarus	2	112	1.7857143
caz_1_1	hyalella	14	112	12.5000000
caz_1_1	caecidotea	2	112	1.7857143
caz_1_2	amnicola_grana	2	100	2.0000000
caz_1_2	physa	27	100	27.0000000
caz_1_2	gyraulus_parvus	1	100	1.0000000
caz_1_2	helisoma_anceps	2	100	2.0000000
caz_1_2	promenetus_exacuous	5	100	5.0000000
caz_1_2	valvata	1	100	1.0000000
caz_1_2	bezzia	9	100	9.0000000
caz_1_2	mallochohelea	2	100	2.0000000
caz_1_2	chironomus	5	100	5.0000000
caz_1_2	dicrotendipes	4	100	4.0000000
caz_1_2	microtendipes	1	100	1.0000000
caz_1_2	polypedilum	2	100	2.0000000
caz_1_2	pseudochironomus	1	100	1.0000000
caz_1_2	rheotanytarsus	2	100	2.0000000
caz_1_2	thienemanniella	1	100	1.0000000
caz_1_2	ablabesmyia	5	100	5.0000000
caz_1_2	callibaetis	1	100	1.0000000
caz_1_2	caenis	1	100	1.0000000
caz_1_2	caenidae	2	100	2.0000000
caz_1_2	acentria	1	100	1.0000000
caz_1_2	mystacides	2	100	2.0000000
caz_1_2	gammarus	1	100	1.0000000
caz_1_2	hyalella	13	100	13.0000000
caz_1_2	caecidotea	9	100	9.0000000

The image below shows “under the hood”" information about ungroup.df from the Environment Tab before and after the ungroup() call is made. We do not need to focus on the details associated with this “under the hood” information but this way we can view the change made by the ungroup() call; unlike the table shown above, which does not provided any indication that the data has been unaggregated.

16.0.8 summarize

Definition: creates or overwrites columns but only retains columns specified in group_by() and the column(s) created in the summarize() call. This function requires group_by() to be called in advance.
Link: https://dplyr.tidyverse.org/reference/summarise.html

In the group_by example, we calculated the percentage of the sample comprised by each final_id taxon. final_id represents the highest taxonomic resolution for which these taxa were identified; therefore, each row represented a unique taxon.

What if we want to explore these samples at a lower resolution? In the example below, the taxa are represented at the order-level. The percent calculations are correct but, in many cases, the same taxon is represented more than once. This is a poor representation of the data. summarize() can help us solve this issue in the subsequent code chunks.

sub.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id)  %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(sub.df)

unique_id	order	count	total	percent
caz_1_1	basommatophora	1	112	0.8928571
caz_1_1	basommatophora	1	112	0.8928571
caz_1_1	diptera	5	112	4.4642857
caz_1_1	diptera	7	112	6.2500000
caz_1_1	diptera	1	112	0.8928571
caz_1_1	diptera	3	112	2.6785714
caz_1_1	diptera	34	112	30.3571429
caz_1_1	diptera	1	112	0.8928571
caz_1_1	diptera	5	112	4.4642857
caz_1_1	diptera	4	112	3.5714286
caz_1_1	diptera	9	112	8.0357143
caz_1_1	diptera	1	112	0.8928571
caz_1_1	diptera	1	112	0.8928571
caz_1_1	ephemeroptera	1	112	0.8928571
caz_1_1	ephemeroptera	14	112	12.5000000
caz_1_1	lepidoptera	1	112	0.8928571
caz_1_1	odonata	1	112	0.8928571
caz_1_1	trichoptera	1	112	0.8928571
caz_1_1	trichoptera	2	112	1.7857143
caz_1_1	trichoptera	1	112	0.8928571
caz_1_1	amphipoda	2	112	1.7857143
caz_1_1	amphipoda	14	112	12.5000000
caz_1_1	isopoda	2	112	1.7857143
caz_1_2	neotaenioglossa	2	100	2.0000000
caz_1_2	basommatophora	27	100	27.0000000
caz_1_2	basommatophora	1	100	1.0000000
caz_1_2	basommatophora	2	100	2.0000000
caz_1_2	basommatophora	5	100	5.0000000
caz_1_2	heterostropha	1	100	1.0000000
caz_1_2	diptera	9	100	9.0000000
caz_1_2	diptera	2	100	2.0000000
caz_1_2	diptera	5	100	5.0000000
caz_1_2	diptera	4	100	4.0000000
caz_1_2	diptera	1	100	1.0000000
caz_1_2	diptera	2	100	2.0000000
caz_1_2	diptera	1	100	1.0000000
caz_1_2	diptera	2	100	2.0000000
caz_1_2	diptera	1	100	1.0000000
caz_1_2	diptera	5	100	5.0000000
caz_1_2	ephemeroptera	1	100	1.0000000
caz_1_2	ephemeroptera	1	100	1.0000000
caz_1_2	ephemeroptera	2	100	2.0000000
caz_1_2	lepidoptera	1	100	1.0000000
caz_1_2	trichoptera	2	100	2.0000000
caz_1_2	amphipoda	1	100	1.0000000
caz_1_2	amphipoda	13	100	13.0000000
caz_1_2	isopoda	9	100	9.0000000

The most intuitive way to solve this problem, to sum the counts by unique and order columns before calculating the percentages, which can be down with a combination of group_by() and summarize().

Apply group_by() to the unique_id and order columns.
- group_by(unique_id, order)
Use summarize() in the same way mutate() was applied in the mutate section. In this case, the count will be overwritten by the sum of the counts aggregated by the unique_id and order columns.
- summarize(count = sum(count))
Use ungroup() to remove the groupings created with group_by().

In the table output, we can see that each taxon, order, is only represented once per sample, unique_id.

summarize.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup()

knitr::kable(summarize.df)

unique_id	order	count
caz_1_1	amphipoda	16
caz_1_1	basommatophora	2
caz_1_1	diptera	71
caz_1_1	ephemeroptera	15
caz_1_1	isopoda	2
caz_1_1	lepidoptera	1
caz_1_1	odonata	1
caz_1_1	trichoptera	4
caz_1_2	amphipoda	14
caz_1_2	basommatophora	35
caz_1_2	diptera	32
caz_1_2	ephemeroptera	4
caz_1_2	heterostropha	1
caz_1_2	isopoda	9
caz_1_2	lepidoptera	1
caz_1_2	neotaenioglossa	2
caz_1_2	trichoptera	2

To calculate the percentage of each order, apply the group_by(), mutate(), and ungroup() combination from the group_by section.

summarize.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count) %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup() %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

knitr::kable(summarize.df)

unique_id	order	count	total	percent
caz_1_1	amphipoda	16	112	14.2857143
caz_1_1	basommatophora	2	112	1.7857143
caz_1_1	diptera	71	112	63.3928571
caz_1_1	ephemeroptera	15	112	13.3928571
caz_1_1	isopoda	2	112	1.7857143
caz_1_1	lepidoptera	1	112	0.8928571
caz_1_1	odonata	1	112	0.8928571
caz_1_1	trichoptera	4	112	3.5714286
caz_1_2	amphipoda	14	100	14.0000000
caz_1_2	basommatophora	35	100	35.0000000
caz_1_2	diptera	32	100	32.0000000
caz_1_2	ephemeroptera	4	100	4.0000000
caz_1_2	heterostropha	1	100	1.0000000
caz_1_2	isopoda	9	100	9.0000000
caz_1_2	lepidoptera	1	100	1.0000000
caz_1_2	neotaenioglossa	2	100	2.0000000
caz_1_2	trichoptera	2	100	2.0000000

The previous code chunk works but it is getting very long. Long code chunks make it harder for your future self or others to interpret and can make it harder to find/troubleshoot bugs. Better practice would be to break the previous code chunk into 2-3 sub-tasks.

The following code chunk into 3 sub-tasks, which should make it easier to follow process and allow us to observe the data after each sub-task is executed.

Explanation of sub-tasks below:

caz1.df: reduce the data frame down to only the rows and columns of interest. In this case, we are only interested in the two replicates from Cazenovia Lake site 1; therefore, I named the data frame caz1.df to describe the data represented within the data frame.
order.df: summarize the taxonomic counts by the sample, unique_id, and taxonomic rank, order.
percent.df: calculate the percentage of each sample, unique_id, represented by each taxon, order.

caz1.df <- taxa.df %>%
  filter(unique_id %in% c("caz_1_1", "caz_1_2")) %>% 
  select(unique_id, order, count)

order.df <- caz1.df %>% 
  group_by(unique_id, order) %>% 
  summarize(count = sum(count)) %>% 
  ungroup()

percent.df <- order.df %>% 
  group_by(unique_id) %>% 
  mutate(total = sum(count),
         percent = count / total * 100) %>% 
  ungroup()

16.0.9 Joins

Link: https://dplyr.tidyverse.org/reference/join.html
Definition: Combine two data sets by common features. “a” and “b” will be used in the descriptions below to represent two data sets.
left_join Join only rows in “b” that have a match in “a”
right_join Join only rows in “a” that have a match in “b”
full_join Join all rows from “a” with all rows from “b”
inner_join Join only rows that are present in both “a” and “b”
semi_join Keep only rows in “a” that are found in “b”
- “b” is not actually joined with “a”
- This is more like a filter
- Opposite of an anti_join()
anti_join Keep only rows in “a” that are NOT found in “b”
- “b” is not actually joined with “a”
- This is more like a filter
- Opposite of an semi_join()

16.0.9.1 left_join

left.df <- left_join(taxa.df, env.df, by = "unique_id")

DT::datatable(left.df, options = list(scrollX = TRUE))

16.0.9.2 right_join

right.df <- right_join(taxa.df, env.df, by = "unique_id")

DT::datatable(right.df, options = list(scrollX = TRUE))

16.0.9.3 full_join

full.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  full_join(env.df, by = "unique_id")

DT::datatable(full.df, options = list(scrollX = TRUE))

16.0.9.4 inner_join

full.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  inner_join(env.df, by = "unique_id")

DT::datatable(full.df, options = list(scrollX = TRUE))

16.0.9.5 semi_join

semi.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  semi_join(env.df, by = "unique_id")

DT::datatable(semi.df, options = list(scrollX = TRUE))

16.0.9.6 anti_join

sub.df <- taxa.df %>% 
  filter(lake == "onon")

anti.df <- anti_join(env.df, sub.df, by = "unique_id")

DT::datatable(anti.df, options = list(scrollX = TRUE))

16.0.9.7 nest_join

nest.df <- taxa.df %>% 
  filter(lake == "onon") %>% 
  nest_join(env.df, by = "unique_id")

DT::datatable(nest.df, options = list(scrollX = TRUE))

16.0.10 Bind

Link: https://dplyr.tidyverse.org/reference/bind.html
Definition: Append rows or columns on to a data frame.

16.0.10.1 bind_rows

To show how bind_rows() work, let’s first split taxa.df into three data frames representing the three lakes in the data set.

onon.df <- taxa.df %>% 
  filter(lake == "onon")

otis.df <- taxa.df %>% 
  filter(lake == "ot")

caz.df <- taxa.df %>% 
  filter(lake == "caz")

If we wanted to combine these data frames back into one data frame, we can call all of them within bind_rows(). Generally, you want to have matching column headers; however, bind_rows() will recognize columns that are unique to one data frame and add them into the appended data frame. The data frame(s) that did not contain this column will have NA as the values in this column.

bind.df <- bind_rows(onon.df, otis.df, caz.df)

DT::datatable(bind.df, options = list(scrollX = TRUE))

16.0.10.2 bind_cols

bind_cols() is similar to bind_rows() but instead appends columns. I rarely find use for this function becuase in general I would prefer a [Join].

Let’s extract the taxonomic hierarchy information from onon.df and store it as a data frame object called hier.df.

hier.df <- onon.df %>% 
  select(phylum:species)

DT::datatable(hier.df, options = list(scrollX = TRUE))

Let’s extract the taxonomic count information from onon.df and store it as a data frame object called count.df.

count.df <- onon.df %>% 
  select(unique_id, final_id, count)

DT::datatable(count.df, options = list(scrollX = TRUE))

hier.df and count.df have the same number of rows, and therefore the columns can be appended together with bind_cols()

bind.df <- bind_cols(count.df, hier.df)

DT::datatable(bind.df, options = list(scrollX = TRUE))