Designing Durable Agentic Workflows with Temporal

Most agentic systems start the same way. You have a trigger — a webhook, a schedule, an event — and you want an agent to run in response. The first version is usually a cron job or a webhook handler that calls an LLM and hopes for the best.

That's fine for demos. It falls apart the moment you need retries, observability, or anything that runs longer than thirty seconds.

Why cron is not enough

Cron is stateless. It fires, it forgets. If your agent times out mid-run, cron doesn't know. If the downstream API is rate-limited, cron will hammer it again on the next tick. If you want to know what your agent was doing three days ago when something went wrong, your logs better be perfect.

The problems compound fast once you add:

Long-running tasks. Agents that call APIs, wait for approvals, or loop until a condition is met.
Branching. One trigger kicks off multiple sub-agents with different timeouts.
Human-in-the-loop. Pause, wait for a user to approve, resume. This can take hours.
Retry logic. Exponential backoff with jitter, max attempts, dead-letter queues.

Cron handles none of these gracefully. You end up re-implementing Temporal by hand, badly.

What Temporal gives you

Temporal turns your workflow into a persistent, resumable function. The workflow can:

Sleep for days without holding a thread.
Call activities (side effects — API calls, DB writes) with automatic retry.
Signal-pause and wait for an external event.
Fork child workflows and wait on their completion.
Time out gracefully with a compensating action.

The key insight is that Temporal's event history is the source of truth. If your worker crashes mid-workflow, Temporal replays the event history on the next healthy worker and picks up exactly where it left off. No state machine you maintain manually, no hand-rolled saga pattern.

How we use it in Lasius

In Lasius, every connector trigger maps to a Temporal workflow. Here's the pattern:

// Simplified connector workflow
export async function connectorWorkflow(input: ConnectorInput) {
  // Step 1: Validate and enrich the trigger payload
  const context = await activities.buildContext(input);

  // Step 2: Call the MCP gateway — this is the agent's tool call
  const result = await activities.callMcpGateway({
    agentId: context.agentId,
    tool: context.tool,
    args: context.args,
  });

  // Step 3: If the result requires human approval, signal-pause
  if (result.requiresApproval) {
    await workflow.condition(
      () => approvalSignal.received,
      { timeout: "72h" }
    );
  }

  // Step 4: Persist the outcome
  await activities.persistResult(result);
}

The activity for callMcpGateway gets automatic retries with exponential backoff. The workflow.condition pause can wait 72 hours for a human approval with zero threads blocked. If anything crashes between steps, Temporal replays and resumes.

The polling pattern

One trigger type we use constantly is polling — checking an external API on an interval until something changes. Temporal makes this clean:

export async function pollUntilDone(input: PollInput) {
  let done = false;

  while (!done) {
    const status = await activities.checkStatus(input.resourceId);
    done = status === "complete";

    if (!done) {
      await workflow.sleep("30s");
    }
  }

  await activities.onComplete(input.resourceId);
}

This looks like a simple loop. Under the hood Temporal serializes every iteration to durable storage. If the worker crashes during checkStatus, it retries that call. If the worker crashes during workflow.sleep, it resumes the sleep on the next worker. The whole thing is durable without you touching a database.

What breaks

Temporal is not free. The things that bite you:

Non-determinism. Workflow code must be deterministic across replays. No Math.random(), no Date.now(), no unversioned API calls inside the workflow function. All side effects go in activities.
Activity timeouts. Pick them deliberately. An activity that can run for five minutes needs a five-minute start-to-close timeout, not the default ten seconds.
History size. Very long-running workflows with tight loops can blow out history. Use continueAsNew to reset.

The non-determinism requirement is the one that catches people. Write your workflows as if they might be replayed from any point in history, because they will be.

The right mental model

Think of a Temporal workflow as a reliable, pausable, inspectable process that just happens to be distributed. Every step is logged. Every retry is recorded. You can query any workflow's state at any time. You can signal it, cancel it, or let it finish on its own schedule.

For agentic work — where the "function" might span hours, call a dozen APIs, and need to survive three infrastructure restarts — this is the right primitive.

Cron is a scheduler. Temporal is an operating system for your workflows. Once you've used it for agent orchestration, going back feels like writing assembly.