Skip to content

Error Handling Patterns

Error handling separates architects who design for the happy path from those who design for reality. External systems go down. Networks fail. Data arrives malformed. Rate limits get exceeded. These patterns cover what every CTA must know to handle failures properly.


Classify the error before choosing a pattern. Different error types demand different responses.

CategoryExamplesCorrect ResponseWrong Response
TransientNetwork timeout, 503 Service Unavailable, rate limit (429)Retry with backoffFail immediately
Persistent404 Not Found, 400 Bad Request, invalid dataRoute to dead letter queue, alertRetry infinitely
SystemicExternal system fully down, certificate expiredCircuit breaker, fallbackKeep retrying (wastes resources)
Data qualityMissing required fields, invalid format, duplicatesReject and notify, data cleansingSilently drop or force through
CapacityBulk API daily limit reached, governor limitsQueue and defer, throttleFail the entire batch

Retry failed operations with increasing delays between attempts. The foundation of transient error handling.

Failed API calls classified as retryable get exponential backoff with jitter until max retries; non-retryable or exhausted calls route to the dead letter queue.
Figure 1. Exponential backoff doubles the wait between each retry attempt while jitter randomizes exact timing to prevent synchronized retry spikes. Only 5xx, timeout, and 429 errors are retryable. 4xx client errors route directly to the dead letter queue.
ParameterRecommended ValueRationale
Max retries3-5Enough for transient issues, not so many that persistent failures waste time
Base delay1 secondStarting delay before first retry
Max delay60 secondsCap to prevent excessively long waits
Backoff multiplier2x (exponential)1s, 2s, 4s, 8s, 16s…
JitterRandom 0-1s addedPrevents thundering herd when many clients retry simultaneously
AttemptDelay (no jitter)Delay (with jitter)
11 second1.0 - 2.0 seconds
22 seconds2.0 - 3.0 seconds
34 seconds4.0 - 5.0 seconds
48 seconds8.0 - 9.0 seconds
516 seconds16.0 - 17.0 seconds
Total~31 seconds~31 - 36 seconds

Stops a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.

Three-state machine transitions from Closed to Open on failure threshold, to HalfOpen after timeout, then back to Closed on test success or Open on test failure.
Figure 2. The circuit breaker prevents cascading failures by stopping calls to a known-down system and failing fast instead. The HalfOpen probe allows automatic recovery without manual intervention once the external system comes back online.
StateBehaviorSalesforce Implementation
ClosedNormal operation, calls pass throughStandard callout behavior
OpenAll calls fail immediately, no callout attemptedCheck Custom Metadata / Platform Cache before callout
Half-OpenAllow one test call to check if service recoveredScheduled job or manual reset attempts one call
ParameterRecommendedPurpose
Failure threshold5 consecutive failuresNumber of failures before opening circuit
Open timeout30-60 secondsHow long to wait before testing recovery
Success threshold2-3 successes in half-openSuccesses needed to close circuit

Messages that cannot be processed after all retries go to a dead letter queue for inspection, reprocessing, or alerting.

Messages that exhaust retries route to a dead letter queue, triggering operations alerts and a manual review dashboard where fixable messages are corrected and resubmitted.
Figure 3. The dead letter queue captures messages that failed all retry attempts, preserving them for human review and resubmission. Without a DLQ, failed messages are silently lost and the integration appears to run while data quietly fails to transfer.
ApproachBest ForPersistence
Custom Object (Integration_Error__c)Full audit trail, reportingPermanent (until deleted)
Platform Events (Error_Event__e)Real-time alerting24-72 hours
Big ObjectHigh-volume error loggingPermanent, archive-oriented
Middleware DLQ (MuleSoft/Anypoint MQ)Middleware-managed integrationsConfigurable retention
External monitoring (Splunk, Datadog)Centralized ops monitoringPer tool retention

A well-designed DLQ record captures everything needed for diagnosis and reprocessing:

FieldPurpose
Source SystemWhere the message originated
Target SystemWhere it was being sent
PayloadThe original message content
Error MessageWhat went wrong
Error CodeHTTP status, exception type
Retry CountHow many attempts were made
First Failure TimestampWhen it first failed
Last Failure TimestampWhen retries were exhausted
StatusNew / Under Review / Resubmitted / Archived
Correlation IDLinks to the original transaction

Processing the same message multiple times must produce the same result. Mandatory for any at-least-once delivery system.

Platform Events, CDC, and most middleware deliver at-least-once. Duplicates will happen because of:

  • Network retries at the transport layer
  • Subscriber reconnection replaying events
  • Middleware retry on ambiguous failures
  • Bulk API partial retries
Incoming messages are checked against stored idempotency keys; already-processed messages return cached results while new messages process and store their key.
Figure 4. Idempotency key checks prevent duplicate processing when at-least-once delivery systems (Platform Events, CDC, middleware retries) deliver the same message more than once. Keys generated from natural business identifiers are more reliable than payload hashes.
StrategyHow It WorksProsCons
Idempotency keyClient sends unique key; server checks before processingMost reliableRequires key storage and lookup
Natural key dedupUse business key (Order Number) to detect duplicatesNo extra infrastructureRequires unique business key
Upsert operationsUse External ID for upsert instead of insertBuilt into SalesforceOnly works for CRUD, not business logic
Payload hashHash the message content, check for duplicate hashesWorks without client changesHash collisions (rare), different messages may hash same
Timestamp comparisonOnly process if timestamp is newer than last processedSimpleClock skew issues

Error handling without monitoring is a fire alarm with no sound. Failures must be detected and addressed before they create business impact.

Integration processes, dead letter queues, and error logs feed a log collector that drives an alert rules engine and operations dashboard with tiered response channels.
Figure 5. Tiered alerting routes warning-level events to email and Slack for awareness while critical events page on-call engineers through PagerDuty. All alerts auto-create tickets for traceability, and the operations dashboard provides continuous visibility without alert fatigue.
MetricThresholdAlert Level
Integration failure rate> 5% of transactionsWarning
Integration failure rate> 20% of transactionsCritical
DLQ depth> 100 messagesWarning
DLQ depth growingIncreasing for 30+ minutesCritical
API call consumption> 80% of daily limitWarning
API call consumption> 95% of daily limitCritical
Average response time> 5 seconds (for real-time)Warning
Circuit breaker openAny circuit openCritical
Event subscriber lag> 1 hour behindWarning
Event subscriber lag> 12 hours behindCritical (approaching retention limit)
ToolWhat It MonitorsCost
Event MonitoringAPI calls, logins, report exportsShield add-on
Custom DashboardIntegration_Error__c recordsIncluded
Flow Email AlertsTrigger on error recordsIncluded
Platform EventsReal-time error broadcastingIncluded
Einstein AnalyticsTrend analysis on error patternsAdd-on

Combining Patterns: The Complete Error Handling Stack

Section titled “Combining Patterns: The Complete Error Handling Stack”

In a CTA scenario, present a layered error handling strategy, not just a single pattern.

Error classification drives pattern selection: transient errors retry with backoff, systemic errors open the circuit breaker, data quality errors reject to DLQ immediately.
Figure 6. Error classification is the foundation of the full error handling stack. Transient errors retry, systemic failures trigger the circuit breaker to stop wasting resources, and data quality rejections go directly to the DLQ with validation context for the operations team.

End-to-End Failure Scenario: ERP Goes Down

Section titled “End-to-End Failure Scenario: ERP Goes Down”

This sequence diagram shows how the patterns work together when the ERP becomes unavailable during order processing. This type of walkthrough scores well at the CTA board.

Order event triggers three retry attempts against a down ERP, opens the circuit breaker after threshold, routes to DLQ with alert, then auto-recovers and resubmits on ERP restoration.
Figure 7. Walking through a complete ERP outage scenario end-to-end demonstrates how retry, circuit breaker, DLQ, and alerting work together. No orders are lost: they queue in the DLQ and resubmit automatically once the circuit closes, with full operations visibility throughout.
Detailed walkthrough

This sequence has five distinct phases. Reading it as a runtime narrative rather than an architecture diagram is exactly how you should present it to the review board.

Phase 1: Normal handoff. Salesforce fires a Platform Event when an order is submitted. Middleware receives it and immediately checks circuit state. The circuit breaker returns CLOSED, meaning the ERP is considered healthy. Middleware makes its first POST to /orders. The ERP returns a 503.

Phase 2: Retry with exponential backoff. The 503 is a retryable error (transient, server-side). Middleware waits one second and tries again. Another 503. It waits two seconds and tries a third time. Another 503. The backoff interval doubles between attempts (1s, 2s) deliberately. A recovering ERP under load needs breathing room. If every failing client retries at identical intervals, the recovered system receives a traffic spike at the exact moment it is trying to stabilize, which can re-collapse it. The increasing wait distributes pressure. Three attempts is enough to distinguish a brief self-correcting flap from a genuine outage.

Phase 3: Circuit trips. After the third failure, middleware reports to the circuit breaker state store. The threshold is met and the breaker flips from CLOSED to OPEN. Two things happen simultaneously: the failed order routes to the DLQ, and operations gets a PagerDuty alert plus an auto-created Jira ticket for traceability. Any subsequent order events that arrive while the circuit is OPEN fail fast without touching the ERP. This stops a broken integration from wasting resources and amplifying load on an already-struggling system.

Phase 4: Half-open probe. After 60 seconds, the circuit moves to HALF-OPEN. One test call goes out to the ERP. If it succeeds, the breaker resets to CLOSED. If it fails, it snaps back to OPEN and the cooldown restarts. No bulk traffic crosses until the single probe succeeds.

Phase 5: Recovery and replay. The ERP returns 200 OK on the probe. Circuit closes. Operations receives a recovery notification and triggers a bulk DLQ resubmit. The queued orders replay through middleware to the ERP in sequence. Every order that arrived during the outage is eventually delivered, with a complete audit trail from original Platform Event timestamp through successful resubmit.

The zero-data-loss guarantee comes from the DLQ, not from the retry mechanism. Retries handle transient glitches. The DLQ handles the cases retries cannot resolve. Together they are why this pattern scores well at the board.


Anti-PatternWhy It FailsBetter Approach
Retry foreverWastes resources, masks permanent failuresMax retries + DLQ
Retry without backoffHammers already-struggling systemsExponential backoff with jitter
Swallow errors silentlyNobody knows the integration is brokenLog, alert, DLQ
Single retry for all errors400 Bad Request will never succeed with retryClassify errors, only retry transient
No idempotencyDuplicate processing on retryIdempotency keys or upsert
Manual-only error recoveryDoes not scale, creates a human bottleneckAutomated reprocessing with manual review for edge cases

  • Risk Management: integration failures are a top risk category; error handling feeds directly into risk registers
  • Data Quality & Governance: data quality errors are a major category of integration failures; governance prevents bad data from propagating
  • Review Board Presentation & Q&A: judges ask “what happens when this fails?” on every integration. Prepare error handling explanations.

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.