5 Walkthrough example: R operator
Here we will learn through a concrete example how to create an R operator for Tercen. Our goal is to create an operator performing a linear regression on our input data and returning the slope and intercept of the model.
Designing the operator
The first step is to define our input projection and output relation. In Tercen, each operator shall take as input a table and return a table. Remember:
“Table in, table out!”
Here we want to perform the linear regression of the values projected on the y axis against the values projected on the x axis, per cell. In this example, we will output only the intercept and the slope of the model, per cell. The operator model can be seen as follow:
Setting up the project
Make sure that tercen-studio
is properly set up and that both Tercen
and RStudio
run locally (respectively on http://127.0.0.1:5402 and http://127.0.0.1:8787/). Otherwise, please refer to Chapter 2.
1. Create a GitHub repository from a template
Create a new GitHub repository with your own account based on the Tercen R operator template (https://github.com/tercen/templateR_operator). Click on the green button Use this template in the Tercen template repository.
Then you can create your own repository based on this template. Choose an explicit name (here, lm_operator
).
Now that the repository is initiated in your GitHub account, go back to RStudio Server (http://127.0.0.1:8787/). Create a new project by clicking on File > New project > Version control > Git.
You will be asked the URL of the repository (put the newly created one) and a name for the project. Now, your local project should include the skeleton:
main.R
: main operator scriptworkspace.R
: local testing scriptoperator.json
: operator metadataREADME_template.md
: operator documentation templatedoc
directory: includes adev_commands.md
file, which contains useful development command lines.
2. Set up the Tercen input projection
In this example, We will use the khan dataset (available on https://github.com/tercen/khan_data). First, we start Tercen locally (http://127.0.0.1:5402) and set up a pairwise projection of the measurement in different tissues. The data step of interest should look as follow:
Note that the data step URL includes this pattern: /w/WORKF+LOWID/ds/DATASTEPID
, where WORKFLOWID
and DATASTEPID
are unique workflow and data step identifiers, respectively. These identifiers will be used in the next step within RStudio to get data from this data step.
Develop the operator locally
Now that our RStudio project and Tercen projection are set up, we can code and test our operator locally as follow:
Open
workspace.R
Replace the data step and workflow IDs taken from the Tercen data step URL in
workspace.R
:
library(tercen)
library(dplyr)
options("tercen.workflowId" = "WORKFLOWID")
options("tercen.stepId" = "DATASTEPID")
- Code your operator. Here, we implement a function
do.lm()
that performs a linear regression on the input data frame and return the slope and intercept of the model.
do.lm <- function(df) {
out <- data.frame(
.ri = df$.ri[1],
.ci = df$.ci[1],
intercept = NaN,
slope = NaN
)
mod <- lm(.y ~ .x, data = df)
out$intercept <- mod$coefficients[1]
out$slope <- mod$coefficients[2]
return(out)
}
ctx <- tercenCtx() %>% # Get data from the data step
select(.x, .y, .ri, .ci) %>% # select variables of interest
group_by(.ri, .ci) %>% # group by row and column ("per cell")
do(do.lm(.)) %>% # do the linear model
ctx$addNamespace() %>% # add namespace
ctx$save() # push results back to Tercen using the API
- Execute the code and check the results in Tercen
Note that we recommend to implement the following sanity checks when creating an operator: