Chapter 1 Introduction

Goals

Tercen is inspired by the Tara Ocean Foundation. Tara oceanic expeditions aim is to study and understand the impact of climate change and the ecological crisis facing the world’s oceans. Tara datasets and analysis are made open simple and accessible to the community!

Using data from this project we will answer the following questions:

  • Which eukaryotic lineages are reponsible for carbon export?

  • How are eukaryotic lineages and their association with carbon export distributed across space?

Data

The data comes from the TARA Oceans project, coming from this publication: Guidi et al. (2016):

Guidi, L., Chaffron, S., Bittner, L., Eveillard, D., Larhlimi, A., Roux, S., … & Coelho, L. P. (2016). Plankton networks driving carbon export in the oligotrophic ocean. Nature, 532(7600), 465-470.

The TARA Oceans project collected:

7.2 terabases (Tb) of metagenomics data from 68 globally distributed sites, representing ~40 million non-redundant genes, around 35,000 operational taxonomic units (OTUs) of prokaryotes (Bacteria and Archaea) and numerous mainly uncharacterized viruses and picoeukaryotes.

A set of 2.3 million eukaryotic 18S rDNA ribotypes was generated from a subset of 47 sampling sites corresponding to approximately 130,000 OTUs. Finally, 5,476 viral ‘populations’ were identified at 43 sites from viral metagenomic contigs.

These genomics data combined across all domains of life and viruses together with carbon export estimates and other environmental parameters were used to explore the relationships between marine biogeochemistry and plankton communities in the top 150 m of the oligotrophic open ocean.

The Methods section of Guidi et al. (2016) fordescribed in detail the data collection process.

Throughout this tutorial we will use two tables coming from this study:

  • The Supplementary Table 4 containing environmental data for each sample:
    • Sample: Sample ID
    • Latitude: Latitude
    • Longitude: Longitude
    • Salinity: Salinity
    • NO2 (umol/L): NO2 concentration
    • PO4 (umol/L): PO4 concentration
    • NO2NO3 (umol/L): NO2NO3 concentration
    • Mean Chloro HPLC adjusted (mg Chl/m3): Chlorophyll concentration
    • Mean Temperature deg C: Temperature in degree Celsius
    • Mean Oxygen adjusted (umol/Kg): Oxygen concentration
    • Mean Flux at 150m: Carbon flux at 150m
    • NPP 8d VGPM (mgC/m2/day): NPP weekly average (VGPM model)

Guidi et al. (2016) describe the environmental daya collection process as follow:

Environmental data were collected from 2009–2013 across all major oligotrophic oceanic provinces in the context of the Tara Oceans expeditions. Sampling stations were selected to represent distinct marine ecosystems at a global scale.

Environmental data were obtained from vertical profiles of a sampling package. It consisted of conductivity and temperature sensors, chlorophyll and CDOM fluorometers, light transmissometer (Wetlabs C-star 25 cm), a backscatter sensor (WetLabs ECO BB), a nitrate sensor (SATLANTIC ISUS) and an underwater vision profiler (Hydroptics UVP).

Nitrate and fluorescence to chlorophyll concentrations as well as salinity were calibrated with water samples collected with Niskin bottle. Net primary production (NPP) data were extracted from 8-day composites of the vertically generalized production model (VGPM) at the week of sampling. Carbon fluxes and carbon export, corresponding to the carbon flux at 150 m, were estimated based on particle concentration and size distributions obtained from the UVP and details are presented below. From particle size distribution to carbon export estimation. Previous research has shown that the distribution of particle size follows a power law over the micrometre to the millimetre size range.

  • The Supplementary Table 5 containing different eukaryotic lineages and their correlations as computed using sparse Partial Least Squares regression (sPLS). This is a method for regression-based modelling of highly multidimensional data in biology, detailed in the following paper:

Lê Cao, K. A., Rossouw, D., Robert-Granié, C. & Besse, P. A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7, 35 (2008).

In this study, it is used to directly associate eukaryotic lineages to carbon export and other environmental traits. It models a causal relationship between the lineages and the environmental traits, that is, PLS will predict environmental traits (for example, carbon export) from lineage abundances.

Methods

To answer these questions, we will:

1. Load data into Tercen

  • We will load the tables mentioned above into Tercen

  • We will then create a data projection suitable to answer our questions

2. Create a heatmap

  • We will represent the correlation between the presence of each eukaryotic lineages and environmental parameters described above

  • We will use a Tercen operator that will apply a given computation to the projection we prepared before

  • This computation will consist in making a heatmap of the correlation of eukaryotic lineages and environmental parameters, after clustering them based on their similarities for environmental parameters

  • The heatmap then depicts with a color scale the strength of the correlation of a given lineage to an environmental variable

  • The clustering of rows and columns facilitates the reading as it groups based on similarities: hence, clusters correspond to lineages and environmentalal parameters that are similar, and their color indicate whether they are strongly correlated or not

  • As in the original study, we can identify high correlations between some lineages and carbon export

3. Look at the distribution of these variables on a world map

  • We will apply another Tercen operator to represent environmental parameters on a world map.

  • This map will allow us to get an idea on how environmental parameters are distributed across space.