TARA Oceans data analysis using Tercen
Chapter 1 Introduction
Goals
Tercen is inspired by the Tara Ocean Foundation. Tara oceanic expeditions aim is to study and understand the impact of climate change and the ecological crisis facing the world’s oceans. Tara datasets and analysis are made open simple and accessible to the community!
Using data from this project we will answer the following questions:
Which eukaryotic lineages are reponsible for carbon export?
How are eukaryotic lineages and their association with carbon export distributed across space?
Data
The data comes from the TARA Oceans project, coming from this publication: Guidi et al. (2016):
Guidi, L., Chaffron, S., Bittner, L., Eveillard, D., Larhlimi, A., Roux, S., … & Coelho, L. P. (2016). Plankton networks driving carbon export in the oligotrophic ocean. Nature, 532(7600), 465-470.
The TARA Oceans project collected:
7.2 terabases (Tb) of metagenomics data from 68 globally distributed sites, representing ~40 million non-redundant genes, around 35,000 operational taxonomic units (OTUs) of prokaryotes (Bacteria and Archaea) and numerous mainly uncharacterized viruses and picoeukaryotes.
A set of 2.3 million eukaryotic 18S rDNA ribotypes was generated from a subset of 47 sampling sites corresponding to approximately 130,000 OTUs. Finally, 5,476 viral ‘populations’ were identified at 43 sites from viral metagenomic contigs.
These genomics data combined across all domains of life and viruses together with carbon export estimates and other environmental parameters were used to explore the relationships between marine biogeochemistry and plankton communities in the top 150 m of the oligotrophic open ocean.
The Methods section of Guidi et al. (2016) fordescribed in detail the data collection process.
Throughout this tutorial we will use two tables coming from this study:
- The Supplementary Table 4 containing environmental data for each sample:
Sample
: Sample IDLatitude
: LatitudeLongitude
: LongitudeSalinity
: SalinityNO2 (umol/L)
: NO2 concentrationPO4 (umol/L)
: PO4 concentrationNO2NO3 (umol/L)
: NO2NO3 concentrationMean Chloro HPLC adjusted (mg Chl/m3)
: Chlorophyll concentrationMean Temperature deg C
: Temperature in degree CelsiusMean Oxygen adjusted (umol/Kg)
: Oxygen concentrationMean Flux at 150m
: Carbon flux at 150mNPP 8d VGPM (mgC/m2/day)
: NPP weekly average (VGPM model)
Guidi et al. (2016) describe the environmental daya collection process as follow:
Environmental data were collected from 2009–2013 across all major oligotrophic oceanic provinces in the context of the Tara Oceans expeditions. Sampling stations were selected to represent distinct marine ecosystems at a global scale.
Environmental data were obtained from vertical profiles of a sampling package. It consisted of conductivity and temperature sensors, chlorophyll and CDOM fluorometers, light transmissometer (Wetlabs C-star 25 cm), a backscatter sensor (WetLabs ECO BB), a nitrate sensor (SATLANTIC ISUS) and an underwater vision profiler (Hydroptics UVP).
Nitrate and fluorescence to chlorophyll concentrations as well as salinity were calibrated with water samples collected with Niskin bottle. Net primary production (NPP) data were extracted from 8-day composites of the vertically generalized production model (VGPM) at the week of sampling. Carbon fluxes and carbon export, corresponding to the carbon flux at 150 m, were estimated based on particle concentration and size distributions obtained from the UVP and details are presented below. From particle size distribution to carbon export estimation. Previous research has shown that the distribution of particle size follows a power law over the micrometre to the millimetre size range.
- The Supplementary Table 5 containing different eukaryotic lineages and their correlations as computed using sparse Partial Least Squares regression (sPLS). This is a method for regression-based modelling of highly multidimensional data in biology, detailed in the following paper:
Lê Cao, K. A., Rossouw, D., Robert-Granié, C. & Besse, P. A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7, 35 (2008).
In this study, it is used to directly associate eukaryotic lineages to carbon export and other environmental traits. It models a causal relationship between the lineages and the environmental traits, that is, PLS will predict environmental traits (for example, carbon export) from lineage abundances.
Methods
To answer these questions, we will:
1. Load data into Tercen
We will load the tables mentioned above into Tercen
We will then create a data projection suitable to answer our questions
2. Create a heatmap
We will represent the correlation between the presence of each eukaryotic lineages and environmental parameters described above
We will use a Tercen operator that will apply a given computation to the projection we prepared before
This computation will consist in making a heatmap of the correlation of eukaryotic lineages and environmental parameters, after clustering them based on their similarities for environmental parameters
The heatmap then depicts with a color scale the strength of the correlation of a given lineage to an environmental variable
The clustering of rows and columns facilitates the reading as it groups based on similarities: hence, clusters correspond to lineages and environmentalal parameters that are similar, and their color indicate whether they are strongly correlated or not
As in the original study, we can identify high correlations between some lineages and carbon export
3. Look at the distribution of these variables on a world map
We will apply another Tercen operator to represent environmental parameters on a world map.
This map will allow us to get an idea on how environmental parameters are distributed across space.