13 Data Input and Output Patterns

This chapter covers advanced patterns for handling data input and output in Tercen operators. Building on the basic concepts from previous chapters, we’ll explore sophisticated techniques for data manipulation, multiple output types, and complex data relationships.

Prerequisites

Before proceeding, ensure you’ve completed: - Development Environment Setup chapter for development environment - Basic Implementation chapter for core operator concepts - Understanding of Tercen’s projection system

13.1 Understanding Tercen’s Data Structure

Tercen organizes data using a projection system with specific index columns:

Column	Purpose	Usage
`.ri`	Row index	Identifies specific rows in the data projection
`.ci`	Column index	Identifies specific columns in the data projection
`.y`	Data values	The actual measurement or observation values
`.x`	X-axis values	Independent variable values (when applicable)

These special columns enable flexible data aggregation and output patterns.

13.2 Basic Output Patterns

13.2.1 Per-Cell Output (Default)

Most operators output one result per cell, maintaining the original data structure:

library(tercen)
library(dplyr)

# Connect to Tercen context
ctx <- tercenCtx()

# Per-cell calculation (e.g., log transformation)
result <- ctx %>%
  select(.ri, .ci, .y) %>%
  mutate(log_value = log(.y + 1)) %>%
  select(.ri, .ci, log_value) %>%
  ctx$addNamespace()

ctx$save(result)

from tercen.client import context as ctx
import polars as pl

# Connect to Tercen context
tercenCtx = ctx.TercenContext()

# Per-cell calculation
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .with_columns([
        (pl.col('.y') + 1).log().alias('log_value')
    ])
    .select(['.ri', '.ci', 'log_value'])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

13.2.2 Per-Row Output

Aggregate data across columns for each row:

# Calculate statistics per row
row_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  group_by(.ri) %>%
  summarise(
    mean_value = mean(.y, na.rm = TRUE),
    sd_value = sd(.y, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(row_stats)

# Calculate statistics per row
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .group_by(['.ri'])
    .agg([
        pl.col('.y').mean().alias('mean_value'),
        pl.col('.y').std().alias('sd_value'),
        pl.col('.y').count().alias('count')
    ])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

13.2.3 Per-Column Output

Aggregate data across rows for each column:

# Calculate statistics per column
col_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  group_by(.ci) %>%
  summarise(
    median_value = median(.y, na.rm = TRUE),
    q25 = quantile(.y, 0.25, na.rm = TRUE),
    q75 = quantile(.y, 0.75, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(col_stats)

# Calculate statistics per column
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .group_by(['.ci'])
    .agg([
        pl.col('.y').median().alias('median_value'),
        pl.col('.y').quantile(0.25).alias('q25'),
        pl.col('.y').quantile(0.75).alias('q75')
    ])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

13.3 Advanced Output Patterns

13.3.1 Multiple Output Tables

Some operators need to return multiple related datasets:

# Generate multiple output tables
summary_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  summarise(
    overall_mean = mean(.y, na.rm = TRUE),
    overall_sd = sd(.y, na.rm = TRUE)
  ) %>%
  ctx$addNamespace()

row_stats <- ctx %>%
  select(.ri, .y) %>%
  group_by(.ri) %>%
  summarise(row_mean = mean(.y, na.rm = TRUE)) %>%
  ctx$addNamespace()

col_stats <- ctx %>%
  select(.ci, .y) %>%
  group_by(.ci) %>%
  summarise(col_mean = mean(.y, na.rm = TRUE)) %>%
  ctx$addNamespace()

# Save multiple tables
ctx$save(list(summary_stats, row_stats, col_stats))

# Generate multiple output tables
summary_stats = (
    tercenCtx
    .select(['.y'], df_lib="polars")
    .select([
        pl.col('.y').mean().alias('overall_mean'),
        pl.col('.y').std().alias('overall_sd')
    ])
)

row_stats = (
    tercenCtx
    .select(['.ri', '.y'], df_lib="polars")
    .group_by(['.ri'])
    .agg([pl.col('.y').mean().alias('row_mean')])
)

col_stats = (
    tercenCtx
    .select(['.ci', '.y'], df_lib="polars")
    .group_by(['.ci'])
    .agg([pl.col('.y').mean().alias('col_mean')])
)

# Add namespaces and save
summary_stats = tercenCtx.add_namespace(summary_stats)
row_stats = tercenCtx.add_namespace(row_stats)
col_stats = tercenCtx.add_namespace(col_stats)

tercenCtx.save([summary_stats, row_stats, col_stats])

13.3.2 Working with Factor Variables

When your projection includes factors (categorical variables), incorporate them into your analysis:

# Include factors in analysis
ctx <- tercenCtx()

# Get factor columns
factors <- ctx$rselect()  # Get all factors

result <- ctx %>%
  select(.ri, .ci, .y) %>%
  left_join(factors, by = c(".ri", ".ci")) %>%
  group_by(.ri, .ci, factor_column) %>%  # Include relevant factors
  summarise(
    group_mean = mean(.y, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(result)

# Include factors in analysis
factors_df = tercenCtx.rselect(df_lib="polars")

result = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .join(factors_df, on=['.ri', '.ci'], how='left')
    .group_by(['.ri', '.ci', 'factor_column'])
    .agg([pl.col('.y').mean().alias('group_mean')])
)

result = tercenCtx.add_namespace(result)
tercenCtx.save(result)

13.4 Specialized Output Types

13.4.1 File Output

Tercen operators can generate and output files (plots, reports, data exports) that users can download or view directly in the interface. This is particularly useful for visualization operators, report generators, and data export tools.

File Output Use Cases

Plots and Visualizations: PNG, PDF, SVG graphics
Reports: HTML, PDF documents with analysis results
Data Exports: CSV, Excel files with processed data
Configuration Files: JSON, YAML files for downstream tools

See the Patterns for Plot Operators chapter for a detailed tutorial on how to output files in Tercen.

13.4.2 Relations Output

Relations in Tercen support complex data linking and joining tables. This is useful for operators that need to create complex relationships between different data dimensions.

When to Use Relations - Some examples

PCA analysis with loadings and scores
Clustering with cluster assignments and centroids
Complex statistical models with multiple output components

Key relation functions: - as_relation(): Convert data frames to relations - left_join_relation(): Join relations together - save_relation(): Save relations to Tercen - as_join_operator(): Create join operators for complex relationships

# Example: Simple relation output
library(tibble)

# Create a relation with results
result_relation <- tibble(
  component = c("PC1", "PC2", "PC3"),
  variance_explained = c(0.45, 0.32, 0.15),
  eigenvalue = c(4.5, 3.2, 1.5)
) %>%
  ctx$addNamespace() %>%
  as_relation()

# Save relation
ctx$save_relation(result_relation)

import polars as pl

# Create a relation with results
result_data = pl.DataFrame({
    'component': ['PC1', 'PC2', 'PC3'],
    'variance_explained': [0.45, 0.32, 0.15],
    'eigenvalue': [4.5, 3.2, 1.5]
})

result_relation = tercenCtx.add_namespace(result_data)
result_relation = tercenCtx.as_relation(result_relation)

# Save relation
tercenCtx.save_relation(result_relation)

13.5 Advanced Input Patterns

13.5.1 Reading Project Files

Sometimes operators need to access additional files stored in the same project:

# Get workflow and project information
workflow <- ctx$context$client$workflowService$get(ctx$context$workflowId)
project_id <- ctx$schema$projectId

# Find project files
project_files <- ctx$client$projectDocumentService$findProjectObjectsByFolderAndName(
  c(project_id, "ufff0", "ufff0"),
  c(project_id, "", ""),
  useFactory = FALSE,
  limit = 25000
)

# Find specific file
target_file <- "config.csv"
file_names <- sapply(project_files, function(f) f$name)
file_index <- which(grepl(target_file, file_names))[1]

if (!is.na(file_index)) {
  pf <- project_files[[file_index]]
  
  # Download and read file
  response <- ctx$context$client$fileService$download(pf$id)
  file_content <- response$read()
  
  # Process as needed
  if (is.raw(file_content)) {
    file_content <- rawToChar(file_content)
  }
  
  # Use file content in analysis...
}

# Get project information
project_id = tercenCtx.schema.projectId

# Find project files
project_files = tercenCtx.client.projectDocumentService.findProjectObjectsByFolderAndName(
    [project_id, "ufff0", "ufff0"],
    [project_id, "", ""], 
    useFactory=False, 
    limit=25000
)

# Find specific file
target_file = 'config.csv'
fnames = [f.name for f in project_files]
matching_files = [i for i, name in enumerate(fnames) if target_file in name]

if matching_files:
    pf = project_files[matching_files[0]]
    
    # Download and read file
    resp = tercenCtx.context.client.fileService.download(pf.id)
    file_content = resp.read()
    
    # Process as needed
    if isinstance(file_content, bytes):
        file_content = file_content.decode('utf-8')
    
    # Use file content in analysis...

Best Practice

Avoid manual file retrieval when possible. Instead, include files directly in the workflow input projection for better reproducibility and user experience.

13.6 Next Steps

With these input and output patterns mastered, you can:

Create Complex Operators: Combine multiple patterns for sophisticated analyses
Handle Edge Cases: Build robust operators that gracefully handle data issues
Optimize Performance: Use efficient data processing techniques
Integrate with Workflows: Design operators that work seamlessly in Tercen pipelines

The next chapter covers continuous integration and deployment strategies for your operators.

# Data Input and Output Patterns This chapter covers advanced patterns for handling data input and output in Tercen operators. Building on the basic concepts from previous chapters, we'll explore sophisticated techniques for data manipulation, multiple output types, and complex data relationships. ::: {.callout-note} ## Prerequisites Before proceeding, ensure you've completed: - Development Environment Setup chapter for development environment - Basic Implementation chapter for core operator concepts - Understanding of Tercen's projection system ::: ## Understanding Tercen's Data Structure Tercen organizes data using a projection system with specific index columns: | Column | Purpose | Usage | |--------|---------|-------| | `.ri` | Row index | Identifies specific rows in the data projection | | `.ci` | Column index | Identifies specific columns in the data projection | | `.y` | Data values | The actual measurement or observation values | | `.x` | X-axis values | Independent variable values (when applicable) | These special columns enable flexible data aggregation and output patterns. ## Basic Output Patterns ### Per-Cell Output (Default) Most operators output one result per cell, maintaining the original data structure: ::: {.panel-tabset} ### R ```r library(tercen) library(dplyr) # Connect to Tercen context ctx <- tercenCtx() # Per-cell calculation (e.g., log transformation) result <- ctx %>% select(.ri, .ci, .y) %>% mutate(log_value = log(.y + 1)) %>% select(.ri, .ci, log_value) %>% ctx$addNamespace() ctx$save(result) ``` ### Python ```python from tercen.client import context as ctx import polars as pl # Connect to Tercen context tercenCtx = ctx.TercenContext() # Per-cell calculation df = ( tercenCtx .select(['.ri', '.ci', '.y'], df_lib="polars") .with_columns([ (pl.col('.y') + 1).log().alias('log_value') ]) .select(['.ri', '.ci', 'log_value']) ) df = tercenCtx.add_namespace(df) tercenCtx.save(df) ``` ::: ### Per-Row Output Aggregate data across columns for each row: ::: {.panel-tabset} ### R ```r # Calculate statistics per row row_stats <- ctx %>% select(.ri, .ci, .y) %>% group_by(.ri) %>% summarise( mean_value = mean(.y, na.rm = TRUE), sd_value = sd(.y, na.rm = TRUE), count = n(), .groups = "drop" ) %>% ctx$addNamespace() ctx$save(row_stats) ``` ### Python ```python # Calculate statistics per row df = ( tercenCtx .select(['.ri', '.ci', '.y'], df_lib="polars") .group_by(['.ri']) .agg([ pl.col('.y').mean().alias('mean_value'), pl.col('.y').std().alias('sd_value'), pl.col('.y').count().alias('count') ]) ) df = tercenCtx.add_namespace(df) tercenCtx.save(df) ``` ::: ### Per-Column Output Aggregate data across rows for each column: ::: {.panel-tabset} ### R ```r # Calculate statistics per column col_stats <- ctx %>% select(.ri, .ci, .y) %>% group_by(.ci) %>% summarise( median_value = median(.y, na.rm = TRUE), q25 = quantile(.y, 0.25, na.rm = TRUE), q75 = quantile(.y, 0.75, na.rm = TRUE), .groups = "drop" ) %>% ctx$addNamespace() ctx$save(col_stats) ``` ### Python ```python # Calculate statistics per column df = ( tercenCtx .select(['.ri', '.ci', '.y'], df_lib="polars") .group_by(['.ci']) .agg([ pl.col('.y').median().alias('median_value'), pl.col('.y').quantile(0.25).alias('q25'), pl.col('.y').quantile(0.75).alias('q75') ]) ) df = tercenCtx.add_namespace(df) tercenCtx.save(df) ``` ::: ## Advanced Output Patterns ### Multiple Output Tables Some operators need to return multiple related datasets: ::: {.panel-tabset} ### R ```r # Generate multiple output tables summary_stats <- ctx %>% select(.ri, .ci, .y) %>% summarise( overall_mean = mean(.y, na.rm = TRUE), overall_sd = sd(.y, na.rm = TRUE) ) %>% ctx$addNamespace() row_stats <- ctx %>% select(.ri, .y) %>% group_by(.ri) %>% summarise(row_mean = mean(.y, na.rm = TRUE)) %>% ctx$addNamespace() col_stats <- ctx %>% select(.ci, .y) %>% group_by(.ci) %>% summarise(col_mean = mean(.y, na.rm = TRUE)) %>% ctx$addNamespace() # Save multiple tables ctx$save(list(summary_stats, row_stats, col_stats)) ``` ### Python ```python # Generate multiple output tables summary_stats = ( tercenCtx .select(['.y'], df_lib="polars") .select([ pl.col('.y').mean().alias('overall_mean'), pl.col('.y').std().alias('overall_sd') ]) ) row_stats = ( tercenCtx .select(['.ri', '.y'], df_lib="polars") .group_by(['.ri']) .agg([pl.col('.y').mean().alias('row_mean')]) ) col_stats = ( tercenCtx .select(['.ci', '.y'], df_lib="polars") .group_by(['.ci']) .agg([pl.col('.y').mean().alias('col_mean')]) ) # Add namespaces and save summary_stats = tercenCtx.add_namespace(summary_stats) row_stats = tercenCtx.add_namespace(row_stats) col_stats = tercenCtx.add_namespace(col_stats) tercenCtx.save([summary_stats, row_stats, col_stats]) ``` ::: ### Working with Factor Variables When your projection includes factors (categorical variables), incorporate them into your analysis: ::: {.panel-tabset} ### R ```r # Include factors in analysis ctx <- tercenCtx() # Get factor columns factors <- ctx$rselect() # Get all factors result <- ctx %>% select(.ri, .ci, .y) %>% left_join(factors, by = c(".ri", ".ci")) %>% group_by(.ri, .ci, factor_column) %>% # Include relevant factors summarise( group_mean = mean(.y, na.rm = TRUE), .groups = "drop" ) %>% ctx$addNamespace() ctx$save(result) ``` ### Python ```python # Include factors in analysis factors_df = tercenCtx.rselect(df_lib="polars") result = ( tercenCtx .select(['.ri', '.ci', '.y'], df_lib="polars") .join(factors_df, on=['.ri', '.ci'], how='left') .group_by(['.ri', '.ci', 'factor_column']) .agg([pl.col('.y').mean().alias('group_mean')]) ) result = tercenCtx.add_namespace(result) tercenCtx.save(result) ``` ::: ## Specialized Output Types ### File Output Tercen operators can generate and output files (plots, reports, data exports) that users can download or view directly in the interface. This is particularly useful for visualization operators, report generators, and data export tools. ::: {.callout-note} ## File Output Use Cases - **Plots and Visualizations**: PNG, PDF, SVG graphics - **Reports**: HTML, PDF documents with analysis results - **Data Exports**: CSV, Excel files with processed data - **Configuration Files**: JSON, YAML files for downstream tools ::: See the [Patterns for Plot Operators](../04-advanced-topics/1-patterns-plots.qmd) chapter for a detailed tutorial on how to output files in Tercen. ### Relations Output Relations in Tercen support complex data linking and joining tables. This is useful for operators that need to create complex relationships between different data dimensions. ::: {.callout-tip} ## When to Use Relations - Some examples - PCA analysis with loadings and scores - Clustering with cluster assignments and centroids - Complex statistical models with multiple output components ::: Key relation functions: - **`as_relation()`**: Convert data frames to relations - **`left_join_relation()`**: Join relations together - **`save_relation()`**: Save relations to Tercen - **`as_join_operator()`**: Create join operators for complex relationships ::: {.panel-tabset} ### R ```r # Example: Simple relation output library(tibble) # Create a relation with results result_relation <- tibble( component = c("PC1", "PC2", "PC3"), variance_explained = c(0.45, 0.32, 0.15), eigenvalue = c(4.5, 3.2, 1.5) ) %>% ctx$addNamespace() %>% as_relation() # Save relation ctx$save_relation(result_relation) ``` ### Python ```python import polars as pl # Create a relation with results result_data = pl.DataFrame({ 'component': ['PC1', 'PC2', 'PC3'], 'variance_explained': [0.45, 0.32, 0.15], 'eigenvalue': [4.5, 3.2, 1.5] }) result_relation = tercenCtx.add_namespace(result_data) result_relation = tercenCtx.as_relation(result_relation) # Save relation tercenCtx.save_relation(result_relation) ``` ::: ## Advanced Input Patterns ### Reading Project Files Sometimes operators need to access additional files stored in the same project: ::: {.panel-tabset} ### R ```r # Get workflow and project information workflow <- ctx$context$client$workflowService$get(ctx$context$workflowId) project_id <- ctx$schema$projectId # Find project files project_files <- ctx$client$projectDocumentService$findProjectObjectsByFolderAndName( c(project_id, "ufff0", "ufff0"), c(project_id, "", ""), useFactory = FALSE, limit = 25000 ) # Find specific file target_file <- "config.csv" file_names <- sapply(project_files, function(f) f$name) file_index <- which(grepl(target_file, file_names))[1] if (!is.na(file_index)) { pf <- project_files[[file_index]] # Download and read file response <- ctx$context$client$fileService$download(pf$id) file_content <- response$read() # Process as needed if (is.raw(file_content)) { file_content <- rawToChar(file_content) } # Use file content in analysis... } ``` ### Python ```python # Get project information project_id = tercenCtx.schema.projectId # Find project files project_files = tercenCtx.client.projectDocumentService.findProjectObjectsByFolderAndName( [project_id, "ufff0", "ufff0"], [project_id, "", ""], useFactory=False, limit=25000 ) # Find specific file target_file = 'config.csv' fnames = [f.name for f in project_files] matching_files = [i for i, name in enumerate(fnames) if target_file in name] if matching_files: pf = project_files[matching_files[0]] # Download and read file resp = tercenCtx.context.client.fileService.download(pf.id) file_content = resp.read() # Process as needed if isinstance(file_content, bytes): file_content = file_content.decode('utf-8') # Use file content in analysis... ``` ::: ::: {.callout-warning} ## Best Practice Avoid manual file retrieval when possible. Instead, include files directly in the workflow input projection for better reproducibility and user experience. ::: ## Next Steps With these input and output patterns mastered, you can: 1. **Create Complex Operators**: Combine multiple patterns for sophisticated analyses 2. **Handle Edge Cases**: Build robust operators that gracefully handle data issues 3. **Optimize Performance**: Use efficient data processing techniques 4. **Integrate with Workflows**: Design operators that work seamlessly in Tercen pipelines The next chapter covers continuous integration and deployment strategies for your operators.