Building Resilient Agents: Advanced Error Handling and Recovery Patterns

In the world of workflow automation, the rise of agentic systems promises a new era of autonomy and efficiency. These AI-driven agents can orchestrate complex tasks, interact with multiple systems, and make decisions to achieve a goal. But with great power comes a great challenge: What happens when things go wrong?

An agent booking a travel itinerary might successfully reserve a flight but fail to secure the hotel due to an API timeout. A data processing agent might update one database but fail on a second, leaving your system in a dangerously inconsistent state. Without a robust strategy for handling failures, your autonomous agents can quickly create more problems than they solve.

True resilience isn't about hoping errors won't happen; it's about designing a system that expects them and knows how to recover gracefully. This is where the concept of atomic actions becomes the critical foundation for building reliable agentic workflows.

The Bedrock of Reliability: action.do

Before we can build complex recovery patterns, we must start with a reliable building block. On the .do platform, this fundamental unit is the action.do.

An atomic action is the smallest, indivisible unit of work that either completes successfully or fails entirely, leaving no trace. It guarantees that you will never be stuck in a partial or inconsistent state. Think of it like a database transaction: the entire operation succeeds, or it's rolled back as if it never happened.

By encapsulating single tasks—like sending an email, updating a CRM record, or calling a third-party API—as atomic actions, you create a set of predictable and dependable tools for your agent.

This principle of atomicity, enforced by the platform, is the first and most important step in error handling.

Beyond Basic 'Try/Catch': State-Aware Recovery

A simple try/catch block around an API call is the first line of defense, and it's essential. Our SDKs make this straightforward:

import { Do } from '@do-platform/sdk';

// Initialize the .do client with your API key
const client = new Do({ apiKey: 'YOUR_API_KEY' });

// Execute a predefined atomic action by name
async function sendWelcomeEmail(userId: string) {
  try {
    const result = await client.action.execute({
      name: 'send-welcome-email',
      params: {
        recipientId: userId,
        template: 'new-user-welcome-v1'
      }
    });
    console.log('Action Executed Successfully:', result.id);
    return result;
  } catch (error) {
    console.error('Action Failed:', error);
    // Now what? The system state might be wrong.
  }
}

But in a multi-step agentic workflow, catching an error is just the beginning. The real challenge is knowing how to recover the system's state. This requires more advanced patterns built on top of your atomic actions.

Advanced Recovery Patterns for Agentic Workflows

Let's explore three powerful patterns you can implement using the building blocks provided by action.do.

1. The Compensating Action

Problem: A workflow involves several steps (A, B, C). Step A and B succeed, but C fails. How do you undo the effects of A and B to return the system to its original state?

Solution: For every action that makes a meaningful change to a system, you can define a corresponding "compensating action." This is an atomic action whose sole purpose is to reverse the effect of another.

Example:

Action: reserve-rental-car
Compensating Action: cancel-rental-car-reservation

An agent tasked with "book full travel itinerary" might execute a series of actions:

book-flight (Succeeds)
reserve-rental-car (Succeeds)
book-hotel (Fails)

Upon detecting the failure of book-hotel, the agent's recovery logic would invoke the compensating actions in reverse order: cancel-rental-car-reservation and then cancel-flight. Because each of these is an atomic action.do, you can trust them to execute reliably, restoring a consistent state.

2. Retry with Exponential Backoff

Problem: A transient error occurs. The external service was temporarily unavailable, a network connection blipped, or a rate limit was momentarily hit. Trying again immediately might just lead to another failure and put more stress on the struggling system.

Solution: Configure a retry policy, ideally with exponential backoff. This pattern retries the failed action after a waiting period, doubling the wait time after each subsequent failure up to a maximum number of retries.

Instead of cluttering your agent's core logic with complex setTimeout loops, you can define this behavior as a policy on the action itself within the .do platform. This is a core tenet of Business-as-Code: the recovery policy is declared alongside the business logic, making it transparent and manageable.

Example: The send-welcome-email action might fail because the email provider's API is briefly down.

Attempt 1: Fails.
Wait: 2 seconds.
Attempt 2: Fails.
Wait: 4 seconds.
Attempt 3: Succeeds.

The agent experiences a slight delay, but the workflow ultimately completes successfully without manual intervention.

3. The Circuit Breaker Pattern

Problem: An external service is down for an extended period. Continuously retrying the action will waste resources, flood logs with errors, and potentially prevent other, healthy workflows from running.

Solution: Implement a circuit breaker. This pattern acts like an electrical circuit breaker in your house.

Closed: The circuit is closed, and requests (actions) flow normally.
Open: After a configured number of consecutive failures, the circuit "trips" and moves to the open state. All subsequent calls to that action fail immediately without even being attempted. This gives the downstream system time to recover.
Half-Open: After a timeout period, the circuit moves to a half-open state. It allows a single "probe" request to go through. If it succeeds, the circuit closes. If it fails, the circuit opens again, restarting the timeout.

This pattern is crucial for building robust agents that can survive external system outages without bringing your own platform down.

From Building Blocks to Resilient Services

By defining your operations as atomic action.do units and applying these recovery patterns, you transform fragile scripts into a resilient system. These actions become the trusted building blocks for higher-level service.do workflows.

A service is a business outcome—like "Onboard New Customer"—composed of an orchestrated sequence of atomic actions, complete with its own logic for compensation, retries, and decision-making.

Building resilience isn't an afterthought; it's a design principle. By starting with atomic actions and embracing advanced recovery patterns, you can empower your agentic workflows to not only execute tasks but to do so with the precision, reliability, and intelligence required for modern business automation.

Frequently Asked Questions (FAQs)

What is an 'atomic action' in the context of .do?
An atomic action is the smallest, indivisible unit of work within a workflow. It represents a single, specific task—like sending an email or updating a database record—that either completes successfully or fails entirely, ensuring system reliability and preventing partial states.

How do actions differ from services?
An action (action.do) is a single, granular operation. A service (service.do) is a higher-level business capability composed of one or more actions orchestrated into a workflow. Actions are the building blocks; services are the valuable outcomes.

Can I create my own custom actions?
Yes. The .do platform empowers you to define your own custom actions using Business-as-Code. You can encapsulate any business logic, external API call, or script into a reusable, versioned, and callable action for your agentic workflows.

How are actions invoked?
Actions are invoked programmatically through the .do API or our language-specific SDKs. You simply call the action by its unique name and provide the necessary parameters, allowing for seamless integration into any application or system.

What happens when an action fails?
Because actions are atomic, a failure is handled cleanly without leaving your system in an inconsistent state. The platform provides detailed error logging and allows you to configure automated retries, notifications, or alternative compensatory actions.

Do Work. With AI.