What separates a workflow that works in a demo from one that holds up in production — intentional outcome handling, pre-flight state acquisition, error handler discipline, and polling patterns that don't lie about completion.
A completed workflow is not a successful workflow.
That distinction sounds obvious until you've watched a DNS decommission job finish green, closed the execution window, and moved on, only to discover days later that two record types were never deleted. The workflow did exactly what it was told to do. It deleted the records it knew about, reported success, and stopped. Nobody told it to check for the others.
This is the most dangerous failure mode in vRO: silent success. A loud failure leaves evidence. A red element, a stack trace, an alert: something that tells you to go look. Silent success leaves nothing. The workflow completed. The log is clean. The infrastructure is wrong.
The gap is almost never bad logic. It's incomplete logic, an assumption baked into the workflow that the environment will always look the way it did when the workflow was written. In production, that assumption fails constantly.
Every outcome a workflow can encounter falls into one of two categories.
The first category is handled: the outcome was anticipated, a decision was made about what to do with it, and that decision is encoded in the workflow. Maybe it fails with a clear error message. Maybe it logs a warning and continues. Maybe it pauses and waits for a signal. The specific response doesn't matter. What matters is that it was a choice.
The second category is unhandled: the outcome was not anticipated, or it was anticipated and quietly ignored. This is where partial deletes live. This is where a null that should have thrown an error gets passed downstream until something finally breaks three elements later, far from the actual problem.
There is no third category. "It probably won't happen" is not a design choice; it's an unhandled case with optimistic framing.
Production-ready workflows are ones where every outcome that matters has been moved from the second category into the first. That work is never fully done, but the discipline of doing it is what separates automation you can trust from automation you have to babysit.
The DNS decommission pattern is worth examining in detail because the lesson generalizes to nearly every destructive workflow.
The naive implementation looks reasonable: query the hostname, delete the record, done. It works in testing. The record you built the workflow around disappears. Green.
The production problem is that DNS environments accumulate record types. A hostname might have an A record, a PTR record, a CNAME, and an alias, each managed separately, each requiring a distinct delete operation. Some DNS platforms require the IP address to locate and delete a PTR record, not just the hostname. A short name and a fully qualified name may resolve to separate records that both need to go.
The workflow that deletes by hostname alone leaves the others behind. It doesn't fail. It just doesn't finish the job.
The fix is a pre-flight phase: before touching anything, acquire the complete state of what you're about to act on. Query all record types associated with the target. Resolve both the short name and the FQDN. Capture the IP address. Build a complete picture of what needs to be deleted, then delete each piece explicitly with the right identifier for that record type.
This reframes the lookup phase from optional overhead into the contract that makes the operation safe. You don't know what you're deleting until you've looked. And if the lookup reveals something unexpected: a record type you didn't account for, a hostname that resolves to multiple IPs, that's the moment to fail loud, not after you've deleted half of it.
Skip the pre-flight and you're operating on assumptions. In production, assumptions have a cost.
Every element in a vRO workflow that touches an external system needs an error handler connected. Not most of them. All of them.
REST calls fail. vCenter operations fail. Config element lookups return null. DNS queries time out. The question is never whether these things will happen; it's whether your workflow has a plan when they do.
An error handler in vRO is a schema-level connection: the error path out of an element routes to a separate element instead of terminating the workflow in an undefined state. What that element does is your decision: log and continue, notify and stop, attempt a retry, route to an Error End with a meaningful message. The important thing is that the decision exists and is visible in the diagram.
Inside scripts, the throw pattern determines what shows up in the execution log when something goes wrong. The [TaskName] prefix on every log and throw is the mechanism that makes a workflow log readable after a failure, without it a log with thirty elements produces a wall of messages with no indication of where in the execution each one came from.
JavaScript// Descriptive throw — tells you exactly what failed and why
throw "[DeleteDNSRecord] PTR delete failed for IP " + ipAddress + ": " + responseBody;
// Re-throw after logging — fires the element's error handler
try {
var result = System.getModule("com.company.dns").DeletePTRRecord(ipAddress);
if (!result) {
throw "DeletePTRRecord returned null for: " + ipAddress;
}
} catch (e) {
System.error("[DeleteDNSRecord] Failed: " + e);
throw e;
}
Two categories of errors require different handling decisions.
Terminal errors mean the operation cannot proceed and attempting to continue would leave infrastructure in a worse state than stopping. A pre-flight query that returns no record when one was expected. An auth failure on the first REST call. These should throw immediately, route to an Error End, and produce a clear message. Do not attempt to work around them.
Recoverable conditions are ones where the operation can continue with a modified approach, or the condition is expected and has a known response. A record type that wasn't found during a delete sweep; it may have already been removed. A retry-eligible timeout on a non-critical lookup. These can be logged as warnings and allowed to continue, but the decision to continue must be explicit, not a silent pass.
Not every error condition should be recovered from. Some should stop the workflow on purpose.
Consider a disk resize workflow that checks for snapshots before extending a volume. In a development environment, a snapshot is probably fine to work around, log it, proceed anyway, or remove it as part of the operation. In production, a snapshot is a different signal entirely. It may indicate a pending change window, an open incident, a backup in progress. Extending a disk while a snapshot exists in production is a decision that should require human judgment, not automation.
The correct response is an intentional failure: detect the snapshot, stop the workflow, and surface a clear message that tells the operator exactly why it stopped and what to do next.
JavaScriptif (snapshotCount > 0) {
throw "[CheckSnapshot] VM " + vmName + " has " + snapshotCount +
" snapshot(s). Disk extension requires snapshot removal in this environment. " +
"Resolve snapshots and re-run.";
}
This is not a bug. It is a feature. The workflow refusing to proceed is the correct behavior, protecting the environment from a well-intentioned operation that could cause problems in context.
| Pattern | Behavior | When to Use |
|---|---|---|
| Fail and alert | Stop the workflow, surface the condition, require human intervention | Condition indicates something that should never be automated past |
| Fail with override | Stop by default; accept an explicit input flag to proceed | Condition is usually a stop signal but occasionally needs a deliberate bypass |
| Log and continue | Record the condition, proceed without stopping | Condition is genuinely advisory and proceeding is always safe |
The choice between these is an architectural decision specific to your environment. It cannot be made generically. What matters is that the decision is made, documented, encoded, and not left to default behavior.
Some operations complete synchronously. Most meaningful ones don't.
A REST delete that returns 202 Accepted is telling you the operation was received, not that it finished. A vCenter reconfigure task returns a task object that you have to watch. A DNS record propagation may require a confirmation query after the delete call returns success. Treating any of these as complete when the initial response arrives is the same class of mistake as not checking record types in the DNS example, optimistic assumptions about what "done" means.
JavaScriptvar MAX_POLLS = 24; // 24 × 5s = 2 minutes max
var POLL_MS = 5000;
var polls = 0;
var done = false;
while (!done && polls < MAX_POLLS) {
System.sleep(POLL_MS);
polls++;
var status = System.getModule("com.company.platform").CheckOperationStatus(statusUrl);
if (status === "Succeeded") {
done = true;
} else if (status === "Failed") {
throw "[PollOperationStatus] Operation failed after " + polls +
" poll(s). URL: " + statusUrl;
}
// "InProgress" — continue looping
}
if (!done) {
throw "[PollOperationStatus] Timed out after " + polls +
" poll(s). URL: " + statusUrl;
}
System.sleep() in a polling loop is correct for operations measured in seconds or minutes, holding the workflow thread while waiting. For operations measured in days, use the native Timer element in the workflow schema instead. The Timer element suspends the execution entirely and resumes it at the scheduled time, freeing the thread rather than blocking it indefinitely.
The gap between a demo workflow and a production workflow is not complexity; it's coverage. Every item below is a question to ask about an existing workflow before calling it production-ready.
[TaskName] prefix and enough context to diagnose the failure?System.sleep()?System.log, System.warn, and System.error call include a task-level prefix?