Skip to main content

Command Palette

Search for a command to run...

Handling Errors and Cleanup in Temporal Workflows (TypeScript)

A guide to error handling and cleanup in Temporal workflows, covering the Saga pattern, retry policies, and cleanup triggers on failure.

Updated
15 min read
Handling Errors and Cleanup in Temporal Workflows (TypeScript)
R

https://github.com/robinbraemer

Developing robust Temporal workflows involves anticipating failures and ensuring that any necessary cleanup or compensating actions occur regardless of how a workflow or activity fails. This includes handling activity exceptions, workflow errors, timeouts, and cancellations. Below, we outline best practices for error handling in Temporal (TypeScript), including the Saga pattern for compensating transactions, use of retry policies, and how to trigger cleanup activities on failure.

Common Failure Scenarios in Temporal

Temporal workflows can fail in several ways, each requiring proper handling:

Understanding these scenarios allows us to design workflows that catch failures and ensure a cleanup Activity or compensating transaction runs in all cases.

Compensating Transactions and the Saga Pattern

For workflows that perform multiple steps (especially across external systems) and need all-or-nothing semantics, use the Saga pattern (compensating transactions). In Temporal, you implement Sagas by pairing each forward operation with a corresponding compensation operation that can undo it (Saga Compensating Transactions | Temporal) (Saga Compensating Transactions | Temporal). If any step fails, previously completed steps are rolled back by invoking their compensations in reverse order, leaving the overall system in a consistent state (as if the workflow's side effects never happened).

Temporal's TypeScript SDK does not have a built-in Saga helper (unlike the Java SDK’s Saga class), but it's straightforward to implement:

  1. Perform each activity and record its compensation: After each successful activity, record a compensation function (e.g., an Activity that reverses that step) in a list. For example, if you create a record in one activity, record a compensating activity to delete that record.

  2. On failure, run compensations in reverse order: In the Workflow’s catch block, iterate through the recorded compensation functions and call them (typically in reverse order of the original operations). Each compensation should be designed to safely handle the case where the original operation might not have fully completed (idempotent or conditional undo logic).

  3. Optionally, handle compensation failures: If a compensation action itself fails, log or handle it appropriately (Temporal will by default retry activities, so a failed compensation activity can be retried as well). The workflow should still attempt all compensations even if one of them fails.

TypeScript Workflow example using Saga pattern:

import { CancellationScope } from '@temporalio/workflow';
type Compensation = () => Promise<void>;

export async function OrderWorkflow(): Promise<void> {
  const compensations: Compensation[] = [];
  try {
    // Step 1: Perform operation and record its compensation
    await createOrder(); 
    compensations.unshift(async () => { await deleteOrder(); });  // compensation for step 1

    // Step 2: Perform next operation
    await reserveInventory();
    compensations.unshift(async () => { await releaseInventory(); }); // compensation for step 2

    // Step 3: Perform another operation
    await chargePayment();
    compensations.unshift(async () => { await refundPayment(); });    // compensation for step 3

    // ... more steps as needed ...

  } catch (err) {
    // If any step fails, execute compensating actions for completed steps
    await CancellationScope.nonCancellable(async () => {
      for (const compensate of compensations) {
        try {
          await compensate();
        } catch (compErr) {
          console.error("Compensation failed:", compErr);
          // continue to next compensation even if one fails
        }
      }
    });
    throw err;  // rethrow to mark workflow as failed after compensation
  }
}

In the above example, each successful step pushes a compensating function onto a stack. If a failure occurs, we execute all collected compensations. We wrap the compensation loop in a non-cancellable scope to ensure it runs to completion even if the workflow was cancelled (more on this below). This pattern ensures that resources created or actions taken in earlier steps are undone when a later step fails (Saga Compensating Transactions | Temporal) (Saga Compensating Transactions | Temporal). Temporal’s documentation provides a similar example of collecting compensation callbacks in TypeScript (Saga Compensating Transactions | Temporal).

Best practices for Saga compensations:

  • Order and Idempotency: Invoke compensations in the reverse order of the original actions (LIFO order) since the latest action should be undone first. Make each compensation action idempotent or safe to run even if the original step partially failed or was never performed. For example, a “delete resource” compensation should succeed (or do nothing) even if the resource didn’t exist, as shown by functions like putBowlAwayIfPresent in Temporal's saga example (Saga Compensating Transactions | Temporal).

  • Marking Non-Retryable Errors: If a failure at a certain step is not transient (e.g., a business logic validation), consider throwing it as a non-retryable error (using ApplicationFailure.nonRetryable) from the activity (Temporal Error Handling In Practice). This prevents Temporal from retrying the activity endlessly and instead fails fast, triggering the compensation logic.

  • Handling Compensation Failures: Design compensating activities with their own retry policies – you generally want to retry them on failure as well (since cleanup is crucial). Even if a compensation ultimately fails, the Workflow should catch that error (as in the example above) and continue attempting the remaining compensations. Log these failures for visibility. The Workflow can still be marked failed after compensation, or you might choose to swallow the original error if you consider the saga completion a “graceful” outcome.

  • Consistency Consideration: In rare cases, an original activity might complete after its compensation has run, due to timing issues (for example, a delayed activity attempt finishing after the Workflow already assumed it failed) (Support for Saga compensating transactions in Typescript - Community Support - Temporal). To guard against this, ensure that activities are designed to have no effect if they are cancelled or if a compensating action was performed. This might involve application-level checks (e.g., the activity writes data with a version or token that the compensation invalidates) (Support for Saga compensating transactions in Typescript - Community Support - Temporal). While such race conditions are uncommon, being aware of them is part of Temporal best practices for absolutely reliable transactions.

Triggering Cleanup Activities on Failure

Not all failure scenarios require a full saga with multiple compensating steps. Often, you just need to run a single cleanup activity at the end of a workflow to release resources (delete temporary files, send a compensating notification, etc.) if the workflow fails. Temporal workflows can use standard try/catch/finally logic to ensure cleanup runs:

  • Try/Catch in the Workflow: Wrap your workflow logic in a try { ... } catch (err) { ... }. In the catch block, call a cleanup activity. This catch will execute for any unhandled exception in the try block, whether it's an activity failure or an error thrown by workflow code. For example:

      import { CancellationScope, isCancellation } from '@temporalio/workflow';
      import * as activities from '../activities';  // import activities, including cleanup
    
      const { cleanupTempFiles } = activities; // assume this is an activity to cleanup files
    
      export async function FileProcessingWorkflow(input: string): Promise<void> {
        try {
          await activities.processFile(input);  // main activity that might fail
          await activities.otherStep(input);
          // ... normal workflow logic ...
        } catch (err) {
          // Determine failure type and perform cleanup
          if (isCancellation(err)) {
            // If workflow was cancelled, ensure cleanup runs in a non-cancellable scope
            await CancellationScope.nonCancellable(async () => {
              await cleanupTempFiles(input);
            });
          } else {
            // Non-cancellation failure (activity throw or other error)
            await cleanupTempFiles(input);
          }
          throw err;  // rethrow to fail the workflow after cleanup
        }
      }
    

    In this example, if any activity throws an error or the workflow is cancelled, the catch block triggers the cleanupTempFiles activity. We use CancellationScope.nonCancellable when the error is a cancellation to shield the cleanup step from being cancelled (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation). This is important because when a workflow cancellation is requested, the root cancellation scope of the workflow is cancelled, which would normally cancel all subsequent activity invocations. Running the cleanup in a non-cancellable scope ensures the cleanup activity is started and completed even if the workflow was cancelled mid-execution (How to define cleanup for workflow cancellations (typescript-sdk) - Community Support - Temporal). Temporal's documentation shows this pattern: detecting a cancellation with isCancellation(error) and then running the cleanup logic in a non-cancellable scope (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation).

  • Using Finally: You can also use a finally block to schedule cleanup logic that should run regardless of success or failure. However, be mindful that if the workflow is cancelled, you still need the non-cancellable scope trick. Often, the catch approach with rethrow (as above) is sufficient, since you typically only want to cleanup on failure scenarios. If you need to do something on both success and failure, you might call that in finally, and still handle cancellation as shown.

  • Cleanup Activity Implementation: The cleanup itself is just another activity. It can be a simple call to delete files, rollback a database change, send a compensating event, etc. Ensure the cleanup activity is idempotent or safe to run multiple times, since in failure scenarios Temporal might retry it if it fails, or a workflow might be retried/restarted and attempt the cleanup again. For example, attempting to delete a file that’s already deleted should not error. This makes the cleanup robust.

  • Workflow Cancellation vs Termination: Always prefer cancellation over termination when you want workflows to do cleanup. Cancellation triggers the cancellation exception inside the workflow (which you can catch as shown above to run compensations/cleanup) (How to define cleanup for workflow cancellations (typescript-sdk) - Community Support - Temporal). In contrast, a termination (or an execution timeout expiring) will immediately stop the workflow without giving it a chance to handle the event (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal). Thus, for scenarios where you might externally stop a workflow and still need cleanup, use WorkflowHandle.cancel() rather than WorkflowHandle.terminate(). The Temporal team specifically advises using workflow cancellation (which is graceful) or internal timers for timeouts, instead of hard timeouts that the workflow cannot intercept (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal) (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal).

Retry Policies and Timeout Handling

Temporal’s built-in retry policies and timeout mechanisms play a role in how you handle failures and trigger cleanups:

  • Activity Retry Policy: By default, Temporal will retry an activity that fails or times out, using an exponential backoff strategy. You can configure the retry policy on each activity (max attempts, interval, etc.) or disable retries. Best practice is to allow retries for transient errors so that the workflow can succeed without human intervention. Only bypass retries for errors that are definitively not recoverable (e.g., validation errors). In those cases, throw an ApplicationFailure.nonRetryable() from the activity (Temporal Error Handling In Practice) so the workflow catches it immediately and can perform compensation. For example, if chargePayment activity in a saga gets a decline response (business rule fail), you might throw a non-retryable failure to trigger an immediate rollback instead of retrying the charge. Conversely, if an activity simply times out due to a network issue, letting Temporal retry it a few times is wise; only after final failure would the workflow enter the compensation logic.

  • Workflow Retry Policy: You can also configure retries at the workflow level (for the whole workflow run). If a workflow run fails, Temporal Server can automatically start a new run of the workflow (often used for cron or recurring scenarios). However, if you are implementing compensation inside the workflow, you typically will not want the entire workflow retried from scratch on failure, as that could duplicate work. In most cases, leave Workflow retries disabled when using manual compensation logic, or handle idempotency carefully so that a retried workflow run doesn’t repeat side effects in an inconsistent way. (This is a more advanced scenario; often it's simpler to not rely on workflow retries for saga-style workflows and instead handle all cleanup in one run.)

  • Heartbeats for long-running Activities: If an activity performs a lengthy operation, use heartbeats (Activity.Context.heartbeat in the activity code) with a heartbeat timeout. This makes the Temporal server aware that the activity is alive and allows you to cancel it faster if needed. If you cancel a workflow while an activity is running, the activity will only receive a cancellation signal on its next heartbeat (How to define cleanup for workflow cancellations (typescript-sdk) - Community Support - Temporal). A well-behaved activity should periodically heartbeat and also handle cancellation internally (e.g., by catching a CanceledError in activity code or checking Activity.Context.cancelled). Proper heartbeat usage ensures timely cancellation and thus timely execution of your cleanup logic in the workflow.

  • Handling Workflow Timeouts Gracefully: As noted, a Workflow Execution Timeout is akin to a kill switch with no chance to run cleanup code. To ensure cleanup runs, do not solely rely on execution timeouts. Instead, implement a timer inside the workflow (e.g., using Temporal’s sleep or Schedule APIs) to enforce a deadline. If the timer fires, you can decide to throw a controlled exception or cancel the workflow from within (which triggers the cancellation flow you can handle). Another pattern is to use a parent workflow to start a child with a shorter timeout: if the child workflow times out and fails, the parent workflow catches that failure and can run a cleanup activity or compensation in the parent context (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal). This “parent-child” approach effectively turns an external timeout into a catchable error in the parent workflow.

Putting It All Together: Best Practices

To ensure a cleanup activity runs in all failure cases, combine the above techniques:

  • Always catch errors in the Workflow: Surround your critical workflow steps with a try/catch. In the catch, invoke cleanup or compensations. This covers activity exceptions, activity timeouts (which surface as exceptions), and even cancellation (which throws a CancelledFailure). Use isCancellation(error) to detect cancellations and run cleanup in a non-cancellable scope if needed (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation). This guarantees that even if the workflow is cancelled mid-way, your cleanup activity (or compensating actions) will be executed before the workflow truly terminates.

  • Use CancellationScope.nonCancellable for cleanup steps: This is a crucial Temporal API to prevent a cancellation from aborting the cleanup. Any Activities or workflow code inside a non-cancellable scope will ignore cancellation requests from parent scopes (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation) (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation). In practice, wrap your cleanup activity call or compensation loop in CancellationScope.nonCancellable(...) whenever you call them from a catch/finally. Temporal samples demonstrate this pattern for running cleanup after a cancellation event (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation).

  • Leverage Saga pattern for multi-step workflows: If your workflow does several distinct operations that need rollback, structure your code to collect compensations for each step. The example above and Temporal’s saga tutorial code show how to do this in TypeScript (Saga Compensating Transactions | Temporal). This ensures that any failure at any point triggers the appropriate cleanup of all prior successful steps.

  • Prefer workflow cancellation over termination/timeouts: Design your system to request cancellations when you need to stop a workflow early. This gives the workflow a chance to perform its cleanup. Avoid terminating workflows except in truly unrecoverable situations, as it will not run any workflow code thereafter (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal).

  • Configure retries thoughtfully: Let Temporal handle transient failures with retries, but mark permanent errors as non-retryable to fall out to your compensation logic. Also consider the retry policy on your cleanup activity – you may want it to retry on failure (so the cleanup itself is robust).

  • Consult Temporal documentation: Temporal’s docs and community resources have extensive discussions on failure handling. For example, the Temporal docs on cancellation and scopes provide guidance on using CancellationScope and checking for CancelledFailure (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation), and the Temporal blog covers the Saga pattern with examples in multiple languages (Saga Compensating Transactions | Temporal) (Saga Compensating Transactions | Temporal). These can provide additional context and examples.

By following these practices, you can ensure that no matter how a workflow fails – whether an activity crashes, a third-party service call times out, or a cancellation request comes in – your Temporal workflow will reliably execute the necessary cleanup or rollback logic before completing. This leads to more resilient, correct applications that gracefully handle errors in complex long-running processes.

Sources:

  1. Temporal Community Forum – Handling workflow cancellation and ensuring cleanup (How to define cleanup for workflow cancellations (typescript-sdk) - Community Support - Temporal) (How to define cleanup for workflow cancellations (typescript-sdk) - Community Support - Temporal)

  2. Temporal Documentation – Workflow Cancellation and Scopes (TypeScript) (Interrupt a Workflow - TypeScript SDK | Temporal Platform Documentation)

  3. Temporal Blog – Saga Pattern (Compensating Transactions) in TypeScript (Saga Compensating Transactions | Temporal) (Saga Compensating Transactions | Temporal)

  4. Temporal Community Forum – Saga compensations and activity completion considerations (Support for Saga compensating transactions in Typescript - Community Support - Temporal) (Support for Saga compensating transactions in Typescript - Community Support - Temporal)

  5. Flightcontrol Blog – Temporal Error Handling (error wrapping and non-retryable errors) (Temporal Error Handling In Practice) (Temporal Error Handling In Practice)

  6. Temporal Community Forum – Workflow timeouts vs. cleanup best practices (Recommended way for running cleanup activity on workflow timeout - Community Support - Temporal)