14  Data Input and Output Patterns

This chapter covers advanced patterns for handling data input and output in Tercen operators. Building on the basic concepts from previous chapters, we’ll explore sophisticated techniques for data manipulation, multiple output types, and complex data relationships.

Prerequisites

Before proceeding, ensure you’ve completed: - Development Environment Setup chapter for development environment - Basic Implementation chapter for core operator concepts - Understanding of Tercen’s projection system

14.1 Understanding Tercen’s Data Structure

Tercen organizes data using a projection system with specific index columns:

Column Purpose Usage
.ri Row index Identifies specific rows in the data projection
.ci Column index Identifies specific columns in the data projection
.y Data values The actual measurement or observation values
.x X-axis values Independent variable values (when applicable)

These special columns enable flexible data aggregation and output patterns.

14.2 Basic Output Patterns

14.2.1 Per-Cell Output (Default)

Most operators output one result per cell, maintaining the original data structure:

library(tercen)
library(dplyr)

# Connect to Tercen context
ctx <- tercenCtx()

# Per-cell calculation (e.g., log transformation)
result <- ctx %>%
  select(.ri, .ci, .y) %>%
  mutate(log_value = log(.y + 1)) %>%
  select(.ri, .ci, log_value) %>%
  ctx$addNamespace()

ctx$save(result)
from tercen.client import context as ctx
import polars as pl

# Connect to Tercen context
tercenCtx = ctx.TercenContext()

# Per-cell calculation
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .with_columns([
        (pl.col('.y') + 1).log().alias('log_value')
    ])
    .select(['.ri', '.ci', 'log_value'])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

14.2.2 Per-Row Output

Aggregate data across columns for each row:

# Calculate statistics per row
row_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  group_by(.ri) %>%
  summarise(
    mean_value = mean(.y, na.rm = TRUE),
    sd_value = sd(.y, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(row_stats)
# Calculate statistics per row
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .group_by(['.ri'])
    .agg([
        pl.col('.y').mean().alias('mean_value'),
        pl.col('.y').std().alias('sd_value'),
        pl.col('.y').count().alias('count')
    ])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

14.2.3 Per-Column Output

Aggregate data across rows for each column:

# Calculate statistics per column
col_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  group_by(.ci) %>%
  summarise(
    median_value = median(.y, na.rm = TRUE),
    q25 = quantile(.y, 0.25, na.rm = TRUE),
    q75 = quantile(.y, 0.75, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(col_stats)
# Calculate statistics per column
df = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .group_by(['.ci'])
    .agg([
        pl.col('.y').median().alias('median_value'),
        pl.col('.y').quantile(0.25).alias('q25'),
        pl.col('.y').quantile(0.75).alias('q75')
    ])
)

df = tercenCtx.add_namespace(df)
tercenCtx.save(df)

14.3 Advanced Output Patterns

14.3.1 Multiple Output Tables

Some operators need to return multiple related datasets:

# Generate multiple output tables
summary_stats <- ctx %>%
  select(.ri, .ci, .y) %>%
  summarise(
    overall_mean = mean(.y, na.rm = TRUE),
    overall_sd = sd(.y, na.rm = TRUE)
  ) %>%
  ctx$addNamespace()

row_stats <- ctx %>%
  select(.ri, .y) %>%
  group_by(.ri) %>%
  summarise(row_mean = mean(.y, na.rm = TRUE)) %>%
  ctx$addNamespace()

col_stats <- ctx %>%
  select(.ci, .y) %>%
  group_by(.ci) %>%
  summarise(col_mean = mean(.y, na.rm = TRUE)) %>%
  ctx$addNamespace()

# Save multiple tables
ctx$save(list(summary_stats, row_stats, col_stats))
# Generate multiple output tables
summary_stats = (
    tercenCtx
    .select(['.y'], df_lib="polars")
    .select([
        pl.col('.y').mean().alias('overall_mean'),
        pl.col('.y').std().alias('overall_sd')
    ])
)

row_stats = (
    tercenCtx
    .select(['.ri', '.y'], df_lib="polars")
    .group_by(['.ri'])
    .agg([pl.col('.y').mean().alias('row_mean')])
)

col_stats = (
    tercenCtx
    .select(['.ci', '.y'], df_lib="polars")
    .group_by(['.ci'])
    .agg([pl.col('.y').mean().alias('col_mean')])
)

# Add namespaces and save
summary_stats = tercenCtx.add_namespace(summary_stats)
row_stats = tercenCtx.add_namespace(row_stats)
col_stats = tercenCtx.add_namespace(col_stats)

tercenCtx.save([summary_stats, row_stats, col_stats])

14.3.2 Working with Factor Variables

When your projection includes factors (categorical variables), incorporate them into your analysis:

# Include factors in analysis
ctx <- tercenCtx()

# Get factor columns
factors <- ctx$rselect()  # Get all factors

result <- ctx %>%
  select(.ri, .ci, .y) %>%
  left_join(factors, by = c(".ri", ".ci")) %>%
  group_by(.ri, .ci, factor_column) %>%  # Include relevant factors
  summarise(
    group_mean = mean(.y, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ctx$addNamespace()

ctx$save(result)
# Include factors in analysis
factors_df = tercenCtx.rselect(df_lib="polars")

result = (
    tercenCtx
    .select(['.ri', '.ci', '.y'], df_lib="polars")
    .join(factors_df, on=['.ri', '.ci'], how='left')
    .group_by(['.ri', '.ci', 'factor_column'])
    .agg([pl.col('.y').mean().alias('group_mean')])
)

result = tercenCtx.add_namespace(result)
tercenCtx.save(result)

14.4 Specialized Output Types

14.4.1 File Output

Tercen operators can generate and output files (plots, reports, data exports) that users can download or view directly in the interface. This is particularly useful for visualization operators, report generators, and data export tools.

File Output Use Cases
  • Plots and Visualizations: PNG, PDF, SVG graphics
  • Reports: HTML, PDF documents with analysis results
  • Data Exports: CSV, Excel files with processed data
  • Configuration Files: JSON, YAML files for downstream tools

See the Patterns for Plot Operators chapter for a detailed tutorial on how to output files in Tercen.

14.4.2 Relations Output

Relations in Tercen support complex data linking and joining tables. This is useful for operators that need to create complex relationships between different data dimensions.

When to Use Relations - Some examples
  • PCA analysis with loadings and scores
  • Clustering with cluster assignments and centroids
  • Complex statistical models with multiple output components

Key relation functions: - as_relation(): Convert data frames to relations - left_join_relation(): Join relations together - save_relation(): Save relations to Tercen - as_join_operator(): Create join operators for complex relationships

# Example: Simple relation output
library(tibble)

# Create a relation with results
result_relation <- tibble(
  component = c("PC1", "PC2", "PC3"),
  variance_explained = c(0.45, 0.32, 0.15),
  eigenvalue = c(4.5, 3.2, 1.5)
) %>%
  ctx$addNamespace() %>%
  as_relation()

# Save relation
ctx$save_relation(result_relation)
import polars as pl

# Create a relation with results
result_data = pl.DataFrame({
    'component': ['PC1', 'PC2', 'PC3'],
    'variance_explained': [0.45, 0.32, 0.15],
    'eigenvalue': [4.5, 3.2, 1.5]
})

result_relation = tercenCtx.add_namespace(result_data)
result_relation = tercenCtx.as_relation(result_relation)

# Save relation
tercenCtx.save_relation(result_relation)

14.5 Advanced Input Patterns

14.5.1 Reading Project Files

Sometimes operators need to access additional files stored in the same project:

# Get workflow and project information
workflow <- ctx$context$client$workflowService$get(ctx$context$workflowId)
project_id <- ctx$schema$projectId

# Find project files
project_files <- ctx$client$projectDocumentService$findProjectObjectsByFolderAndName(
  c(project_id, "ufff0", "ufff0"),
  c(project_id, "", ""),
  useFactory = FALSE,
  limit = 25000
)

# Find specific file
target_file <- "config.csv"
file_names <- sapply(project_files, function(f) f$name)
file_index <- which(grepl(target_file, file_names))[1]

if (!is.na(file_index)) {
  pf <- project_files[[file_index]]
  
  # Download and read file
  response <- ctx$context$client$fileService$download(pf$id)
  file_content <- response$read()
  
  # Process as needed
  if (is.raw(file_content)) {
    file_content <- rawToChar(file_content)
  }
  
  # Use file content in analysis...
}
# Get project information
project_id = tercenCtx.schema.projectId

# Find project files
project_files = tercenCtx.client.projectDocumentService.findProjectObjectsByFolderAndName(
    [project_id, "ufff0", "ufff0"],
    [project_id, "", ""], 
    useFactory=False, 
    limit=25000
)

# Find specific file
target_file = 'config.csv'
fnames = [f.name for f in project_files]
matching_files = [i for i, name in enumerate(fnames) if target_file in name]

if matching_files:
    pf = project_files[matching_files[0]]
    
    # Download and read file
    resp = tercenCtx.context.client.fileService.download(pf.id)
    file_content = resp.read()
    
    # Process as needed
    if isinstance(file_content, bytes):
        file_content = file_content.decode('utf-8')
    
    # Use file content in analysis...
Best Practice

Avoid manual file retrieval when possible. Instead, include files directly in the workflow input projection for better reproducibility and user experience.

14.6 Next Steps

With these input and output patterns mastered, you can:

  1. Create Complex Operators: Combine multiple patterns for sophisticated analyses
  2. Handle Edge Cases: Build robust operators that gracefully handle data issues
  3. Optimize Performance: Use efficient data processing techniques
  4. Integrate with Workflows: Design operators that work seamlessly in Tercen pipelines

The next chapter covers continuous integration and deployment strategies for your operators.