9 Developing a Python Operator
We will now see how we can develop a simple operator in Tercen. Building an operator requires to go through the following steps:
- Design the operator
- Setup the github repository
- Setup the input projection
- Connecting to Tercen
- Develop and test
- Manage input settings
- Manage dependencies
- Deployment
Development Workflow
[OPTIONAL] Step 1: Create a New Git Repository
Start by creating a new Git repository for your Python operator. You can use the template Python operator repository as a starting point. You can either fork the repository or create a new one based on the template.
Step 2: Open VS Code Server
Open your Tercen Studio development environment and access the VS Code Server by navigating to: http://127.0.0.1:8443
in your web browser.
[OPTIONAL] Step 3: Clone the Repository
In VS Code Server, open the Command Palette (Ctrl+Shift+P
or Cmd+Shift+P
) and search for the “Clone from GitHub” command. Provide the URL of your newly created Git repository and choose a location to clone it into.
If you were not able to create a GitHub repository, you can clone the template repository directly. You will be able to experiment with the API and follow this tutorial but you won’t be able to push changes and install the operator.
Step 4: Set Up Environment and Install Core Requirements
Open a terminal in VS Code Server by clicking on the terminal icon in the lower left corner. Navigate to your cloned repository directory using the cd
command. Install the core requirements by running the following command:
Step 5: Develop Your Operator with a Real-Life Example
Start developing your operator by creating a Python script in the cloned repository directory. Create a new Python script, for example, main.py
, and paste the following code:
from tercen.client import context as ctx
import numpy as np
tercenCtx = ctx.TercenContext()
# Select relevant columns and create a pandas DataFrame
df = (
tercenCtx
.select(['.y', '.ci', '.ri'], df_lib="polars")
.groupby(['.ci', '.ri'])
.mean()
.rename({".y": "mean"})
)
# Add namespace and save the computed mean per cell
df = tercenCtx.add_namespace(df)
tercenCtx.save(df)
Let’s break down the code step by step to understand its functionality:
This section of the code imports the necessary modules. tercen.client.context
provides the Tercen context for interacting with the environment, while numpy
is a popular library for numerical computations in Python.
Here, an instance of the TercenContext
class is created. This context facilitates interaction with the Tercen environment, including data access and operations.
df = (
tercenCtx
.select(['.y', '.ci', '.ri'], df_lib="polars")
.groupby(['.ci', '.ri'])
.mean()
.rename({".y": "mean"})
)
This section performs a series of operations on the data:
.select(['.y', '.ci', '.ri'], df_lib="polars")
: Selects columns ‘.y’, ‘.ci’, and ‘.ri’ from the data. Thedf_lib
parameter is set to “polars,” indicating that the data is treated as a Polars DataFrame..groupby(['.ci', '.ri'])
: Groups the data by columns ‘.ci’ (column index) and ‘.ri’ (row index)..mean()
: Calculates the mean for the grouped data. This computes the mean value for each group..rename({".y": "mean"})
: Renames the column named ‘.y’ to “mean” to reflect that it contains the computed mean values.
The result is a Polars DataFrame named df
containing the computed mean per cell.
This line adds a namespace to the DataFrame using add_namespace
. This step ensures a unique and data step specific prefix is added to new factors to avoid duplicate factor names in a workflow.
Finally, the computed DataFrame is saved using the save
method of the TercenContext
. This action makes the calculated mean per cell available for use within the Tercen environment.
Step 6: Generate Requirements
If your operator requires additional Python packages, you can generate the requirements.txt file using the following command:
[OPTIONAL] Step 7: Push Changes to GitHub
Commit your changes to your local Git repository and push the changes to GitHub. This will trigger the Continuous Integration (CI) GitHub workflow, which performs automated tests on your operator.
[OPTIONAL] Step 8: Tag the Repository
Once you are satisfied with your operator’s development and testing, you can tag your repository. Tagging will trigger the Release GitHub workflow, which will create a release for your operator.
Conclusion
Congratulations! You have successfully developed and deployed a Python operator for Tercen. By following these steps, you can create custom data processing operators to extend the functionality of Tercen and streamline your data analysis workflows. Remember to consult the Tercen documentation for more details and advanced features. Happy coding!