n8n Error Handling

← Back to skills

By default, when an n8n node throws, the **whole workflow halts**. For an interactive run you're watching, that's fine — you see the red node and fix it. For anything unattended (a webhook API, a cron job, a queue worker, an agent tool), it's the wrong default: the caller gets a timeout or an empty 500, the operator gets no alert, and the symptom is "the integration just stopped working" with no log and no clue.

Category: DevOps & Automation
Repo: wilkomarketing-antigravity-n8n-skills
Path: n8n-error-handling/SKILL.md
Updated: 6/22/2026, 4:17:15 PM

AI Summary

By default, when an n8n node throws, the **whole workflow halts**. For an interactive run you're watching, that's fine — you see the red node and fix it. For anything unattended (a webhook API, a cron job, a queue worker, an agent tool), it's the wrong default: the caller gets a timeout or an empty 500, the operator gets no alert, and the symptom is "the integration just stopped working" with no log and no clue. It is useful for CI/CD pipelines, infrastructure as code, deployment automation, monitoring, and DevOps workflows. Source: wilkomarketing-antigravity-n8n-skills (n8n-error-handling/SKILL.md).

n8n Error Handling

By default, when an n8n node throws, the whole workflow halts. For an interactive run you're watching, that's fine — you see the red node and fix it. For anything unattended (a webhook API, a cron job, a queue worker, an agent tool), it's the wrong default: the caller gets a timeout or an empty 500, the operator gets no alert, and the symptom is "the integration just stopped working" with no log and no clue.

This skill is about making failures loud, structured, and recoverable — and, best case, self-healing so transient blips never reach a human at all.

The two ideas that prevent most silent failures:

  • Per-node error outputs — a node's failure routes down a second output you control, instead of killing the run.
  • A workflow-level error workflow — a catch-all that fires for anything that escapes per-node handling (timeouts, crashes between nodes, unwired failures).

When you actually need this

Workflow shapeError handling posture
Webhook / API (anything with Respond to Webhook)Required. Every fallible node's error output wired; status code matches cause.
Scheduled / cron / queue worker / agent tool (unattended)Required. A workflow-level error workflow, plus retryOnFail on network nodes.
Internal one-off you run and watch yourselfOptional. Default onError: "stopWorkflow" is fine — you'll see the red node and re-run.

The dividing line: if anyone other than you sees the output — a downstream system, an end user, an on-call engineer — the failure has to be handled, not swallowed. If you're the only watcher and the cost of failure is "I notice and re-run", looser is fine.


The #1 silent trap: per-node error output is a TWO-step setup

This is the single most common way an n8n workflow "handles" errors while actually swallowing them. Routing a node's failure to a handler takes two changes, and doing only one looks complete but misbehaves:

  1. Set onError: "continueErrorOutput" on the node. This is what creates the second output. Without it, main[1] doesn't exist no matter what you wire.
  2. Wire that error output (connections.<node>.main[1], i.e. sourceIndex: 1) to a real handler. Without a target, the error data is emitted into the void.

Get one without the other and you hit a failure mode:

What you didWhat happens at runtime
onError set, error output not wiredError data is silently discarded. Downstream doesn't fire. The dashboard shows the run as succeeded. Worst case — no error logged anywhere.
Error output wired, onError not setThe slot never fires; the handler is unreachable. On failure the workflow just halts (default stopWorkflow).
Both doneFailure routes down main[1] to your handler. ✅

Doing both with n8n_update_partial_workflow

// 1) Turn on the error output (creates main[1])
{ type: "updateNode", nodeName: "HTTP Request",
  changes: { onError: "continueErrorOutput" } }

// 2) Wire the error output to a handler. sourceIndex: 1 = the error output.
{ type: "addConnection",
  source: "HTTP Request",
  target: "Handle Error",
  sourceIndex: 1 }

sourceIndex: 0 is the success path, sourceIndex: 1 is the error path. (For IF nodes the aliases branch: "true"/"false" map to index 0/1; for a generic fallible node, use the explicit sourceIndex: 1.)

Then verify. This trap doesn't surface in validate_workflow — a half-wired error output validates clean. Pull the workflow with n8n_get_workflow and confirm both halves:

  • The node's onError is "continueErrorOutput".
  • connections["HTTP Request"].main[1] contains your handler.

Valid onError values:

ValueEffect
"stopWorkflow" (default)Error halts the whole workflow.
"continueRegularOutput"Error item flows out the normal output. Rare, usually wrong — downstream gets error-shaped data and keeps going.
"continueErrorOutput"Error item flows out the separate error output (main[1]). The one you wire.

Full failure-mode catalog, fan-in/fan-out shapes, and verification: NODE_ERROR_OUTPUTS.md.


Self-healing first: retryOnFail before you wire error paths

Before you build error branches, absorb the transient failures so they never reach those branches. On any node that calls a network service — HTTP Request, comms (Gmail/Slack/Discord), databases, AI nodes, third-party integrations — set node-level retry:

{ type: "updateNode", nodeName: "HTTP Request",
  changes: {
    retryOnFail: true,
    maxTries: 3,
    waitBetweenTries: 5000   // ms
  } }

Why this comes first: a 429 or a brief upstream hiccup will retry and usually succeed on its own. The error output then fires only on real, persistent failures — so your 5xx responses and on-call alerts reflect actual problems instead of noise.

Engine limits to know: retry fires on any error (there's no per-status-code filter), maxTries caps at 5, and waitBetweenTries caps at 5000ms — so 5000 is both the max and a sensible default. See n8n-node-configuration (NODE_FAMILY_GOTCHAS.md) for node-specific notes.


API workflows: the canonical shape

A webhook-triggered workflow that responds to its caller has one rule that overrides everything else: no hanging branches. Every path — success and every error — must end at a Respond to Webhook, or the caller sits there until it times out.

Webhook (responseMode: "responseNode")
  ├── validate input → process → Respond (200, body)
  └── (any fallible node's error output → sourceIndex 1)
            → Respond (4xx/5xx, structured error body)
            → optional: log full error privately / notify

Three things make this work:

  1. Fan-in to one error responder. Many fallible nodes can route their main[1] to a single Respond node. Keeps the graph readable.
  2. Validation failures (4xx) are checked upstream, not via error outputs. A missing field isn't a node crashing — it's an expected outcome with a known response. Branch on it with IF/Switch (or the schema validator below) and return 400/401/403/404 directly. Error outputs are for unexpected failures (5xx).
  3. responseCode defaults to 200 — even on error branches. This is its own silent trap (see RESPONSE_SHAPES.md and n8n-node-configuration NODE_FAMILY_GOTCHAS.md): an error branch that returns 200 with an error body looks like success to the caller's HTTP client, so their error handling never fires. Set responseCode explicitly on every Respond node.

Input validation: the Set-node schema validator

For any endpoint doing structured input validation, run the check as an IIFE inside a single Set node rather than a chain of IF/Switch nodes per field. One node validates the whole payload, returns { valid, validationError, details, requiredSchema }, and an IF branches on valid → your logic (200) or a 400 Respond that echoes the schema back so the caller can self-correct. It's also dramatically faster than a recursive validator in a Code node + sub-workflow. The full pattern, the constraint cookbook, and the expression-escaping gotchas live in API_WORKFLOWS.md.


Response shapes: map cause → status code

A 5xx with text/plain "Internal Server Error" is technically an error response and practically useless. And not every failure is a 5xx. Match the status code to why the request failed, because the caller branches on it: their monitoring alerts on 5xx (your fault) but not 4xx (their fault), and 5xx suggests "retry" while 4xx suggests "don't".

The common mistake: wiring everything — including bad input — to one Respond that returns 500 internal_error. Now the caller can't tell their bug from your outage, and your error rates can't separate real incidents from client noise.

CauseStatuserror codeWhere it's handled
Required field missing / wrong type400validation_errorUpstream check (schema validator / IF), not error output
Auth missing or invalid401unauthorizedUpstream check
Authenticated but not allowed403forbiddenUpstream check
Resource ID valid in request, absent in your data404not_foundBranch on the lookup result, not its error
Conflicts with current state (duplicate, race)409conflictDetect with logic
Caller exceeded rate limit429rate_limit_exceededSet Retry-After header
Node threw, cause unknown500internal_errorError output path
Third-party API returned an error502upstream_errorError output of the HTTP node
Can't process right now (downstream down)503service_unavailableDetect specific error, hint retry
Third-party API timed out504upstream_timeoutError output filtered by message

So there are two distinct flows: 4xx is decided before the work (IF/Switch + dedicated Respond), 5xx comes out of error outputs ("we tried, it broke").

One Respond, expression-driven code. When error paths differ only by number and message (same body shape, same headers), don't fan out to N Respond nodes through a Switch. The Respond node accepts expressions in both Response Code and body — compute the code inline:

// Response Code field on a single Respond to Webhook:
{{ (() => {
    const msg = $json.error?.message || $json.message || '';
    if (msg.includes('INVALID_ID')) return 400;
    if (/429|too many/i.test(msg)) return 429;
    if (/timeout/i.test(msg))      return 504;
    if (/upstream|llm|api/i.test(msg)) return 502;
    return 500;
})() }}

Reserve Switch + multiple Responds for paths that diverge structurally (different headers, different body shapes, redirects). Same shape with a different number is one expression-driven Respond.

The default envelope is { "error": "<code>", "message": "<human text>" } — the HTTP status already says success-vs-failure, so no ok: false flag. Never leak internals (stack traces, SQL, upstream bodies, tokens) into the response — log those privately, return a sanitized message. Correlation IDs, retry_after, validation details, and the full do-not-leak list are in RESPONSE_SHAPES.md.


Workflow-level error workflow (the catch-all)

Per-node outputs handle the failures you anticipated on the nodes you remembered to wire. An error workflow catches everything else: a node you forgot to wire, a crash between nodes, a whole-workflow timeout, a trigger failure. For unattended workflows this is the safety net that turns "it silently stopped" into "an alert arrived".

Build it as a separate workflow starting with an Error Trigger node. n8n invokes it with the failure context:

{
  "execution": { "id": "...", "url": "...", "lastNodeExecuted": "Fetch order",
    "error": { "name": "NodeApiError", "message": "...", "timestamp": 1715000000000 } },
  "workflow": { "id": "...", "name": "Sync Stripe customers" }
}

Minimal version — capture → notify:

Error Trigger → Set (build alert from execution + error) → Slack/email (post to #incidents)

A good alert includes the workflow name, a link to the editor and a link to the failed execution, the failed node name, and the real error message (not "Workflow failed"). Field expressions and the optional "fetch the failing input via the n8n node" upgrade are in ERROR_WORKFLOWS.md.

Two traps worth flagging up front:

  • The recursion trap. If the error workflow notifies Slack and Slack is what's down, the error workflow fails too — and the original error vanishes. Notify on a different channel than your monitored workflows use (most workflows alert Slack → error workflow uses email), and add a fallback (write to a Data Table) so a failed notification still leaves a trace.
  • A "handled" error won't bubble up. If a node's error output is wired to a no-op that drops the data, n8n considers the error handled and the error workflow does not fire. Only catch per-node when you're actually doing something with the error.

What the community MCP can't do: assigning the error workflow (instance default or per-workflow override) is an n8n UI setting — Workflow Settings → Error Workflow. There is no MCP tool to set it. Build the error workflow with the MCP, then tell the user the exact UI step to wire it up, and to repeat it (or set the instance default) for every unattended workflow.


What's NOT available via the community MCP

Want to doReality
Set a workflow's Error Workflow settingUI only (Workflow Settings → Error Workflow). No MCP tool. Build the workflow, then hand the user the UI step.
Toggle other workflow settings (Save Execution Data, timezone, timeout, caller policy)UI only. n8n_update_partial_workflow has updateSettings, but the error-workflow assignment is not reliably exposed — confirm in the UI.
Enable instance-wide error logging (Sentry, server logs)Instance config, outside n8n workflows entirely.

What the MCP can do: build the error workflow, set onError/retryOnFail on nodes (updateNode/patchNodeField), wire error outputs (addConnection with sourceIndex: 1), validate (validate_workflow, n8n_validate_workflow), auto-fix common issues (n8n_autofix_workflow), test (n8n_test_workflow), and inspect failures (n8n_executions).


Anti-patterns

Anti-patternWhat goes wrongFix
onError set but error output unwiredError silently discarded; run shows as succeededWire sourceIndex: 1 to a real handler, or revert onError to stopWorkflow so it's loud
Error output wired but onError not setSlot never fires; handler unreachable; workflow halts on failureSet onError: "continueErrorOutput"
Webhook → process → respond, no error branchCaller gets a timeout or n8n's generic 500Wire every fallible node's error output to a Respond
Error branch returns 200 with an {error} bodyCaller's client reads success; their error handling never firesSet responseCode to 4xx/5xx explicitly on error Responds
One 500 internal_error for everythingCaller can't tell their bad input from your outageMap cause → status (4xx caller, 5xx you)
Catching errors in a Code node and returning them as dataDownstream processes error-shaped data and continuesLet it throw; use onError: "continueErrorOutput" + wired path
Network node with no retryOnFailEvery transient 429/blip surfaces as a 5xx; alerts fire on noiseretryOnFail: true, maxTries: 3, waitBetweenTries: 5000
Switch → N Responds differing only by status code5 nodes for what's one RespondCompute the code inline in one expression-driven Respond
Unattended workflow with no error workflowA genuine failure goes nowhereBuild an Error Trigger workflow + assign it in the UI
Error workflow notifies the same channel the workflows monitorChannel down → error workflow also fails → error vanishesUse a different channel + a Data Table fallback
Leaking $json.error (stack/SQL/tokens) into the responseExposes internals to callers/attackersLog privately, return a sanitized message

Reference files

FileRead when
NODE_ERROR_OUTPUTS.mdWiring a per-node error output on individual fallible nodes
API_WORKFLOWS.mdBuilding/reviewing a webhook → Respond workflow, including the schema validator
RESPONSE_SHAPES.mdDefining response body conventions, status codes, and what not to leak
ERROR_WORKFLOWS.mdSetting up the workflow-level catch-all for unattended workflows

Integration with other skills

  • n8n-workflow-patterns — the webhook/API and scheduled patterns are where error handling lives. Use it for the overall shape; use this skill to harden it.
  • n8n-node-configurationonError/retryOnFail are node config; NODE_FAMILY_GOTCHAS.md covers the Webhook/Respond response-code traps in depth.
  • n8n-validation-expert — the half-wired error output (one of the two steps missing) is a connection/config audit item, not a validation error. This skill is the fix.
  • n8n-expression-syntax — the expression-driven Response Code and the alert-message expressions rely on correct {{ }} syntax and $json.error access.
  • n8n-code-javascript / n8n-code-python — if you catch errors inside a Code node, decide deliberately: re-throw to use the error output, or handle and continue. Don't return error-shaped data and pretend it succeeded.
  • n8n-code-tool — an agent's Code Tool surfaces thrown errors back to the LLM, which then retries; that's a different error contract from workflow nodes.
  • n8n-binary-and-data — file/binary operations are fallible too; wire their error outputs like any network node.

Quick reference checklist

For an API / webhook workflow:

  • Webhook trigger uses responseMode: "responseNode"
  • Input validated upstream → 4xx Respond (schema validator or IF)
  • Every fallible node has onError: "continueErrorOutput" and main[1] wired
  • Network nodes have retryOnFail: true, maxTries: 3, waitBetweenTries: 5000
  • Error path ends at a Respond with an explicit 4xx/5xx responseCode
  • Status code matches cause (4xx caller, 5xx you)
  • Error body is { error, message } — no stack traces, SQL, or tokens
  • Verified with n8n_get_workflow: both onError and main[1] present on each fallible node

For an unattended (scheduled/cron/queue) workflow:

  • Network nodes have retryOnFail configured
  • An Error Trigger workflow exists (capture → notify, optional retry)
  • The error workflow notifies on a different channel + has a fallback (recursion trap)
  • The error-workflow setting is assigned in the n8n UI (MCP can't do it — remind the user)

Remember: the default is silence. Error handling is two moves — make the failure route (per-node onError + wired output, or a catch-all error workflow) and make it speak (a status code and body that tell the truth). Half a move is worse than none, because it looks done.

Related skills