Workflow Design Series · Part 3 of 4

Workflow Design: Error Handling, Polling & Resilience

What separates a workflow that works in a demo from one that holds up in production — intentional outcome handling, pre-flight state acquisition, error handler discipline, and polling patterns that don't lie about completion.

Author

vrodocs

Published

June 2026

Version

1.0

The Green Lie

A completed workflow is not a successful workflow.

That distinction sounds obvious until you've watched a DNS decommission job finish green, closed the execution window, and moved on, only to discover days later that two record types were never deleted. The workflow did exactly what it was told to do. It deleted the records it knew about, reported success, and stopped. Nobody told it to check for the others.

This is the most dangerous failure mode in vRO: silent success. A loud failure leaves evidence. A red element, a stack trace, an alert: something that tells you to go look. Silent success leaves nothing. The workflow completed. The log is clean. The infrastructure is wrong.

The gap is almost never bad logic. It's incomplete logic, an assumption baked into the workflow that the environment will always look the way it did when the workflow was written. In production, that assumption fails constantly.

Fail Loud or Fail Intentionally: There Is No Third Option

Every outcome a workflow can encounter falls into one of two categories.

The first category is handled: the outcome was anticipated, a decision was made about what to do with it, and that decision is encoded in the workflow. Maybe it fails with a clear error message. Maybe it logs a warning and continues. Maybe it pauses and waits for a signal. The specific response doesn't matter. What matters is that it was a choice.

The second category is unhandled: the outcome was not anticipated, or it was anticipated and quietly ignored. This is where partial deletes live. This is where a null that should have thrown an error gets passed downstream until something finally breaks three elements later, far from the actual problem.

There is no third category. "It probably won't happen" is not a design choice; it's an unhandled case with optimistic framing.

Production-ready workflows are ones where every outcome that matters has been moved from the second category into the first. That work is never fully done, but the discipline of doing it is what separates automation you can trust from automation you have to babysit.

Pre-Flight: Acquire Before You Act

The DNS decommission pattern is worth examining in detail because the lesson generalizes to nearly every destructive workflow.

The naive implementation looks reasonable: query the hostname, delete the record, done. It works in testing. The record you built the workflow around disappears. Green.

The production problem is that DNS environments accumulate record types. A hostname might have an A record, a PTR record, a CNAME, and an alias, each managed separately, each requiring a distinct delete operation. Some DNS platforms require the IP address to locate and delete a PTR record, not just the hostname. A short name and a fully qualified name may resolve to separate records that both need to go.

The workflow that deletes by hostname alone leaves the others behind. It doesn't fail. It just doesn't finish the job.

⚠

The Silent RemainderA partial delete that returns green is harder to detect and recover from than a clean failure. You don't know what was left behind until something downstream breaks, or until you go looking.

The fix is a pre-flight phase: before touching anything, acquire the complete state of what you're about to act on. Query all record types associated with the target. Resolve both the short name and the FQDN. Capture the IP address. Build a complete picture of what needs to be deleted, then delete each piece explicitly with the right identifier for that record type.

This reframes the lookup phase from optional overhead into the contract that makes the operation safe. You don't know what you're deleting until you've looked. And if the lookup reveals something unexpected: a record type you didn't account for, a hostname that resolves to multiple IPs, that's the moment to fail loud, not after you've deleted half of it.

The General Pattern

Query the full current state of the target resource before any mutation
Validate that what you found matches what you expected, and fail if it doesn't
Identify every discrete action required to complete the operation
Execute each action explicitly, with appropriate error handling on each one

Skip the pre-flight and you're operating on assumptions. In production, assumptions have a cost.

Error Handlers Are Architecture, Not Afterthoughts

Every element in a vRO workflow that touches an external system needs an error handler connected. Not most of them. All of them.

REST calls fail. vCenter operations fail. Config element lookups return null. DNS queries time out. The question is never whether these things will happen; it's whether your workflow has a plan when they do.

An error handler in vRO is a schema-level connection: the error path out of an element routes to a separate element instead of terminating the workflow in an undefined state. What that element does is your decision: log and continue, notify and stop, attempt a retry, route to an Error End with a meaningful message. The important thing is that the decision exists and is visible in the diagram.

Throw Discipline

Inside scripts, the throw pattern determines what shows up in the execution log when something goes wrong. The [TaskName] prefix on every log and throw is the mechanism that makes a workflow log readable after a failure, without it a log with thirty elements produces a wall of messages with no indication of where in the execution each one came from.

JavaScript// Descriptive throw — tells you exactly what failed and why
throw "[DeleteDNSRecord] PTR delete failed for IP " + ipAddress + ": " + responseBody;

// Re-throw after logging — fires the element's error handler
try {
    var result = System.getModule("com.company.dns").DeletePTRRecord(ipAddress);
    if (!result) {
        throw "DeletePTRRecord returned null for: " + ipAddress;
    }
} catch (e) {
    System.error("[DeleteDNSRecord] Failed: " + e);
    throw e;
}

Terminal vs. Recoverable

Two categories of errors require different handling decisions.

Terminal errors mean the operation cannot proceed and attempting to continue would leave infrastructure in a worse state than stopping. A pre-flight query that returns no record when one was expected. An auth failure on the first REST call. These should throw immediately, route to an Error End, and produce a clear message. Do not attempt to work around them.

Recoverable conditions are ones where the operation can continue with a modified approach, or the condition is expected and has a known response. A record type that wasn't found during a delete sweep; it may have already been removed. A retry-eligible timeout on a non-critical lookup. These can be logged as warnings and allowed to continue, but the decision to continue must be explicit, not a silent pass.

ℹ

The Distinction Is Yours to MakeWhether a given condition is terminal or recoverable is environment-specific. What matters is that the decision is made and encoded, not left to default behavior.

Intentional Failure: When the Right Answer Is to Stop

Not every error condition should be recovered from. Some should stop the workflow on purpose.

Consider a disk resize workflow that checks for snapshots before extending a volume. In a development environment, a snapshot is probably fine to work around, log it, proceed anyway, or remove it as part of the operation. In production, a snapshot is a different signal entirely. It may indicate a pending change window, an open incident, a backup in progress. Extending a disk while a snapshot exists in production is a decision that should require human judgment, not automation.

The correct response is an intentional failure: detect the snapshot, stop the workflow, and surface a clear message that tells the operator exactly why it stopped and what to do next.

JavaScriptif (snapshotCount > 0) {
    throw "[CheckSnapshot] VM " + vmName + " has " + snapshotCount +
          " snapshot(s). Disk extension requires snapshot removal in this environment. " +
          "Resolve snapshots and re-run.";
}

This is not a bug. It is a feature. The workflow refusing to proceed is the correct behavior, protecting the environment from a well-intentioned operation that could cause problems in context.

Design Options for Stop Conditions

Pattern	Behavior	When to Use
Fail and alert	Stop the workflow, surface the condition, require human intervention	Condition indicates something that should never be automated past
Fail with override	Stop by default; accept an explicit input flag to proceed	Condition is usually a stop signal but occasionally needs a deliberate bypass
Log and continue	Record the condition, proceed without stopping	Condition is genuinely advisory and proceeding is always safe

The choice between these is an architectural decision specific to your environment. It cannot be made generically. What matters is that the decision is made, documented, encoded, and not left to default behavior.

✓

Workflows That Fail on Purpose Are Workflows You Can TrustIf you know a workflow will stop when it encounters a condition it wasn't designed to handle, you can deploy it with confidence. Workflows that silently proceed past unexpected conditions are the ones that produce infrastructure surprises at 2am.

Polling Patterns

Some operations complete synchronously. Most meaningful ones don't.

A REST delete that returns 202 Accepted is telling you the operation was received, not that it finished. A vCenter reconfigure task returns a task object that you have to watch. A DNS record propagation may require a confirmation query after the delete call returns success. Treating any of these as complete when the initial response arrives is the same class of mistake as not checking record types in the DNS example, optimistic assumptions about what "done" means.

Canonical Polling Loop

JavaScriptvar MAX_POLLS = 24;    // 24 × 5s = 2 minutes max
var POLL_MS   = 5000;
var polls     = 0;
var done      = false;

while (!done && polls < MAX_POLLS) {
    System.sleep(POLL_MS);
    polls++;

    var status = System.getModule("com.company.platform").CheckOperationStatus(statusUrl);

    if (status === "Succeeded") {
        done = true;
    } else if (status === "Failed") {
        throw "[PollOperationStatus] Operation failed after " + polls +
              " poll(s). URL: " + statusUrl;
    }
    // "InProgress" — continue looping
}

if (!done) {
    throw "[PollOperationStatus] Timed out after " + polls +
          " poll(s). URL: " + statusUrl;
}

Three Things This Pattern Enforces

Maximum iterations. An unbounded polling loop is a workflow that can run forever if the remote system stalls. Set a ceiling and throw explicitly when you hit it, and don't let the workflow hang.
Explicit failure on terminal status. If the operation reports failed, stop immediately with a meaningful message. Don't keep polling.
Timeout throw with context. When you hit the poll ceiling, the throw needs enough to diagnose the situation: the operation URL, how many polls were attempted. A bare "timed out" message is not actionable.

ℹ

Capture the Status URL Before You PollFor async REST operations that return a status URL in response headers (common in Azure ARM), store that URL before entering the polling loop. It's the only way to check actual operation status, and if you don't capture it from the initial response, it's gone.

Sleep vs. Timer Element

System.sleep() in a polling loop is correct for operations measured in seconds or minutes, holding the workflow thread while waiting. For operations measured in days, use the native Timer element in the workflow schema instead. The Timer element suspends the execution entirely and resumes it at the scheduled time, freeing the thread rather than blocking it indefinitely.

Production Readiness Checklist

The gap between a demo workflow and a production workflow is not complexity; it's coverage. Every item below is a question to ask about an existing workflow before calling it production-ready.

Pre-Flight

Does the workflow acquire the complete current state of the target before acting?
Are all variants of the target resource accounted for: record types, name formats, associated objects?
Does the workflow validate that what it found matches what it expected before proceeding?

Error Handling

Is an error handler connected on every element that touches an external system?
Do all throw statements include a [TaskName] prefix and enough context to diagnose the failure?
Are terminal errors distinguished from recoverable conditions, with explicit handling for each?

Intentional Failure Conditions

Has every condition that should stop the workflow been identified and encoded?
For conditions that require a judgment call, is the decision documented and not left to default behavior?
If an override path exists, does it require an explicit input flag, not a silent fallback?

Polling

Do all async operations poll for actual completion rather than trusting the initial response?
Does every polling loop have a maximum iteration count and an explicit timeout throw?
For day-scale waits, is the Timer element used instead of System.sleep()?

Log Quality

Can you read the execution log of a failed run and identify within seconds which element failed and why?
Does every System.log, System.warn, and System.error call include a task-level prefix?

Key Takeaways

Silent success is a failure mode. A green execution is not evidence that the operation completed correctly, only that the workflow didn't encounter anything it was told to treat as an error.
Acquire before you act. Pre-flight state acquisition is the contract that makes destructive operations safe. Query everything, validate it, then proceed with precision.
Every outcome gets an explicit decision. Handled or unhandled: there is no middle category. Unhandled conditions are gaps, not acceptable defaults.
Some workflows should fail on purpose. Detecting a condition and stopping is correct behavior when proceeding would require human judgment. Encode the stop, document the reason, and optionally provide an override path.
Polling requires a ceiling. Async operations need polling loops with maximum iterations, explicit failure on terminal status, and timeout throws with diagnostic context.
Log quality is a production requirement. A workflow log that doesn't tell you where and why it failed in the first thirty seconds of reading is not production-ready.

ℹ

Up NextPart 4 covers composition and building a library, the shift from writing individual automations to building a platform where workflows call workflows and actions form a reusable foundation.