5 Walkthrough example: R operator

Here we will learn through a concrete example how to create an R operator for Tercen. Our goal is to create an operator performing a linear regression on our input data and returning the slope and intercept of the model.

Designing the operator

The first step is to define our input projection and output relation. In Tercen, each operator shall take as input a table and return a table. Remember:

Table in, table out!

Here we want to perform the linear regression of the values projected on the y axis against the values projected on the x axis, per cell. In this example, we will output only the intercept and the slope of the model, per cell. The operator model can be seen as follow:

Setting up the project

Make sure that tercen-studio is properly set up and that both Tercen and RStudio run locally (respectively on http://127.0.0.1:5402 and http://127.0.0.1:8787/). Otherwise, please refer to Chapter 2.

1. Create a GitHub repository from a template

Create a new GitHub repository with your own account based on the Tercen R operator template (https://github.com/tercen/templateR_operator). Click on the green button Use this template in the Tercen template repository.

Then you can create your own repository based on this template. Choose an explicit name (here, lm_operator).

Now that the repository is initiated in your GitHub account, go back to RStudio Server (http://127.0.0.1:8787/). Create a new project by clicking on File > New project > Version control > Git.

You will be asked the URL of the repository (put the newly created one) and a name for the project. Now, your local project should include the skeleton:

  • main.R: main operator script

  • workspace.R: local testing script

  • operator.json: operator metadata

  • README_template.md: operator documentation template

  • doc directory: includes a dev_commands.md file, which contains useful development command lines.

2. Set up the Tercen input projection

In this example, We will use the khan dataset (available on https://github.com/tercen/khan_data). First, we start Tercen locally (http://127.0.0.1:5402) and set up a pairwise projection of the measurement in different tissues. The data step of interest should look as follow:

Note that the data step URL includes this pattern: /w/WORKF+LOWID/ds/DATASTEPID, where WORKFLOWID and DATASTEPID are unique workflow and data step identifiers, respectively. These identifiers will be used in the next step within RStudio to get data from this data step.

Develop the operator locally

Now that our RStudio project and Tercen projection are set up, we can code and test our operator locally as follow:

  • Open workspace.R

  • Replace the data step and workflow IDs taken from the Tercen data step URL in workspace.R:

library(tercen)
library(dplyr)

options("tercen.workflowId" = "WORKFLOWID")
options("tercen.stepId"     = "DATASTEPID")
  • Code your operator. Here, we implement a function do.lm() that performs a linear regression on the input data frame and return the slope and intercept of the model.
do.lm <- function(df) {
  out <- data.frame(
    .ri = df$.ri[1],
    .ci = df$.ci[1],
    intercept = NaN,
    slope = NaN
  )
  
  mod <- lm(.y ~ .x, data = df)
  
  out$intercept <- mod$coefficients[1]
  out$slope <- mod$coefficients[2]
  
  return(out)
}

ctx <- tercenCtx()  %>%          # Get data from the data step
  select(.x, .y, .ri, .ci) %>%   # select variables of interest
  group_by(.ri, .ci) %>%         # group by row and column ("per cell")
  do(do.lm(.)) %>%               # do the linear model
  ctx$addNamespace() %>%         # add namespace
  ctx$save()                     # push results back to Tercen using the API
  • Execute the code and check the results in Tercen

Note that we recommend to implement the following sanity checks when creating an operator:

  • check the presence of expected inputs (here, x and y axes)

  • use the try() function to test the main function implemented (here, lm())