Error Handling Patterns

Error handling separates architects who design for the happy path from those who design for reality. External systems go down. Networks fail. Data arrives malformed. Rate limits get exceeded. These patterns cover what every CTA must know to handle failures properly.

Error Categories

Classify the error before choosing a pattern. Different error types demand different responses.

Category	Examples	Correct Response	Wrong Response
Transient	Network timeout, 503 Service Unavailable, rate limit (429)	Retry with backoff	Fail immediately
Persistent	404 Not Found, 400 Bad Request, invalid data	Route to dead letter queue, alert	Retry infinitely
Systemic	External system fully down, certificate expired	Circuit breaker, fallback	Keep retrying (wastes resources)
Data quality	Missing required fields, invalid format, duplicates	Reject and notify, data cleansing	Silently drop or force through
Capacity	Bulk API daily limit reached, governor limits	Queue and defer, throttle	Fail the entire batch

Pattern 1: Retry with Exponential Backoff

Retry failed operations with increasing delays between attempts. The foundation of transient error handling.

How It Works

Figure 1. Exponential backoff doubles the wait between each retry attempt while jitter randomizes exact timing to prevent synchronized retry spikes. Only 5xx, timeout, and 429 errors are retryable. 4xx client errors route directly to the dead letter queue.

Implementation Parameters

Parameter	Recommended Value	Rationale
Max retries	3-5	Enough for transient issues, not so many that persistent failures waste time
Base delay	1 second	Starting delay before first retry
Max delay	60 seconds	Cap to prevent excessively long waits
Backoff multiplier	2x (exponential)	1s, 2s, 4s, 8s, 16s…
Jitter	Random 0-1s added	Prevents thundering herd when many clients retry simultaneously

Retry Timing Example

Attempt	Delay (no jitter)	Delay (with jitter)
1	1 second	1.0 - 2.0 seconds
2	2 seconds	2.0 - 3.0 seconds
3	4 seconds	4.0 - 5.0 seconds
4	8 seconds	8.0 - 9.0 seconds
5	16 seconds	16.0 - 17.0 seconds
Total	~31 seconds	~31 - 36 seconds

Pattern 2: Circuit Breaker

Stops a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.

States

Figure 2. The circuit breaker prevents cascading failures by stopping calls to a known-down system and failing fast instead. The HalfOpen probe allows automatic recovery without manual intervention once the external system comes back online.

Implementation in Salesforce

State	Behavior	Salesforce Implementation
Closed	Normal operation, calls pass through	Standard callout behavior
Open	All calls fail immediately, no callout attempted	Check Custom Metadata / Platform Cache before callout
Half-Open	Allow one test call to check if service recovered	Scheduled job or manual reset attempts one call

Configuration Parameters

Parameter	Recommended	Purpose
Failure threshold	5 consecutive failures	Number of failures before opening circuit
Open timeout	30-60 seconds	How long to wait before testing recovery
Success threshold	2-3 successes in half-open	Successes needed to close circuit

Pattern 3: Dead Letter Queue (DLQ)

Messages that cannot be processed after all retries go to a dead letter queue for inspection, reprocessing, or alerting.

Flow

Figure 3. The dead letter queue captures messages that failed all retry attempts, preserving them for human review and resubmission. Without a DLQ, failed messages are silently lost and the integration appears to run while data quietly fails to transfer.

Salesforce DLQ Options

Approach	Best For	Persistence
Custom Object (Integration_Error__c)	Full audit trail, reporting	Permanent (until deleted)
Platform Events (Error_Event__e)	Real-time alerting	24-72 hours
Big Object	High-volume error logging	Permanent, archive-oriented
Middleware DLQ (MuleSoft/Anypoint MQ)	Middleware-managed integrations	Configurable retention
External monitoring (Splunk, Datadog)	Centralized ops monitoring	Per tool retention

DLQ Record Design

A well-designed DLQ record captures everything needed for diagnosis and reprocessing:

Field	Purpose
Source System	Where the message originated
Target System	Where it was being sent
Payload	The original message content
Error Message	What went wrong
Error Code	HTTP status, exception type
Retry Count	How many attempts were made
First Failure Timestamp	When it first failed
Last Failure Timestamp	When retries were exhausted
Status	New / Under Review / Resubmitted / Archived
Correlation ID	Links to the original transaction

Pattern 4: Idempotency

Processing the same message multiple times must produce the same result. Mandatory for any at-least-once delivery system.

Why It Matters

Platform Events, CDC, and most middleware deliver at-least-once. Duplicates will happen because of:

Network retries at the transport layer
Subscriber reconnection replaying events
Middleware retry on ambiguous failures
Bulk API partial retries

Implementation Strategies

Figure 4. Idempotency key checks prevent duplicate processing when at-least-once delivery systems (Platform Events, CDC, middleware retries) deliver the same message more than once. Keys generated from natural business identifiers are more reliable than payload hashes.

Strategy	How It Works	Pros	Cons
Idempotency key	Client sends unique key; server checks before processing	Most reliable	Requires key storage and lookup
Natural key dedup	Use business key (Order Number) to detect duplicates	No extra infrastructure	Requires unique business key
Upsert operations	Use External ID for upsert instead of insert	Built into Salesforce	Only works for CRUD, not business logic
Payload hash	Hash the message content, check for duplicate hashes	Works without client changes	Hash collisions (rare), different messages may hash same
Timestamp comparison	Only process if timestamp is newer than last processed	Simple	Clock skew issues

Pattern 5: Monitoring and Alerting

Error handling without monitoring is a fire alarm with no sound. Failures must be detected and addressed before they create business impact.

Monitoring Architecture

Figure 5. Tiered alerting routes warning-level events to email and Slack for awareness while critical events page on-call engineers through PagerDuty. All alerts auto-create tickets for traceability, and the operations dashboard provides continuous visibility without alert fatigue.

What to Monitor

Metric	Threshold	Alert Level
Integration failure rate	> 5% of transactions	Warning
Integration failure rate	> 20% of transactions	Critical
DLQ depth	> 100 messages	Warning
DLQ depth growing	Increasing for 30+ minutes	Critical
API call consumption	> 80% of daily limit	Warning
API call consumption	> 95% of daily limit	Critical
Average response time	> 5 seconds (for real-time)	Warning
Circuit breaker open	Any circuit open	Critical
Event subscriber lag	> 1 hour behind	Warning
Event subscriber lag	> 12 hours behind	Critical (approaching retention limit)

Salesforce-Native Monitoring Options

Tool	What It Monitors	Cost
Event Monitoring	API calls, logins, report exports	Shield add-on
Custom Dashboard	Integration_Error__c records	Included
Flow Email Alerts	Trigger on error records	Included
Platform Events	Real-time error broadcasting	Included
Einstein Analytics	Trend analysis on error patterns	Add-on

Combining Patterns: The Complete Error Handling Stack

In a CTA scenario, present a layered error handling strategy, not just a single pattern.

Figure 6. Error classification is the foundation of the full error handling stack. Transient errors retry, systemic failures trigger the circuit breaker to stop wasting resources, and data quality rejections go directly to the DLQ with validation context for the operations team.

End-to-End Failure Scenario: ERP Goes Down

This sequence diagram shows how the patterns work together when the ERP becomes unavailable during order processing. This type of walkthrough scores well at the CTA board.

Figure 7. Walking through a complete ERP outage scenario end-to-end demonstrates how retry, circuit breaker, DLQ, and alerting work together. No orders are lost: they queue in the DLQ and resubmit automatically once the circuit closes, with full operations visibility throughout.

Detailed walkthrough

This sequence has five distinct phases. Reading it as a runtime narrative rather than an architecture diagram is exactly how you should present it to the review board.

Phase 1: Normal handoff. Salesforce fires a Platform Event when an order is submitted. Middleware receives it and immediately checks circuit state. The circuit breaker returns CLOSED, meaning the ERP is considered healthy. Middleware makes its first POST to /orders. The ERP returns a 503.

Phase 2: Retry with exponential backoff. The 503 is a retryable error (transient, server-side). Middleware waits one second and tries again. Another 503. It waits two seconds and tries a third time. Another 503. The backoff interval doubles between attempts (1s, 2s) deliberately. A recovering ERP under load needs breathing room. If every failing client retries at identical intervals, the recovered system receives a traffic spike at the exact moment it is trying to stabilize, which can re-collapse it. The increasing wait distributes pressure. Three attempts is enough to distinguish a brief self-correcting flap from a genuine outage.

Phase 3: Circuit trips. After the third failure, middleware reports to the circuit breaker state store. The threshold is met and the breaker flips from CLOSED to OPEN. Two things happen simultaneously: the failed order routes to the DLQ, and operations gets a PagerDuty alert plus an auto-created Jira ticket for traceability. Any subsequent order events that arrive while the circuit is OPEN fail fast without touching the ERP. This stops a broken integration from wasting resources and amplifying load on an already-struggling system.

Phase 4: Half-open probe. After 60 seconds, the circuit moves to HALF-OPEN. One test call goes out to the ERP. If it succeeds, the breaker resets to CLOSED. If it fails, it snaps back to OPEN and the cooldown restarts. No bulk traffic crosses until the single probe succeeds.

Phase 5: Recovery and replay. The ERP returns 200 OK on the probe. Circuit closes. Operations receives a recovery notification and triggers a bulk DLQ resubmit. The queued orders replay through middleware to the ERP in sequence. Every order that arrived during the outage is eventually delivered, with a complete audit trail from original Platform Event timestamp through successful resubmit.

The zero-data-loss guarantee comes from the DLQ, not from the retry mechanism. Retries handle transient glitches. The DLQ handles the cases retries cannot resolve. Together they are why this pattern scores well at the board.

Anti-Patterns

Anti-Pattern	Why It Fails	Better Approach
Retry forever	Wastes resources, masks permanent failures	Max retries + DLQ
Retry without backoff	Hammers already-struggling systems	Exponential backoff with jitter
Swallow errors silently	Nobody knows the integration is broken	Log, alert, DLQ
Single retry for all errors	400 Bad Request will never succeed with retry	Classify errors, only retry transient
No idempotency	Duplicate processing on retry	Idempotency keys or upsert
Manual-only error recovery	Does not scale, creates a human bottleneck	Automated reprocessing with manual review for edge cases

Risk Management: integration failures are a top risk category; error handling feeds directly into risk registers
Data Quality & Governance: data quality errors are a major category of integration failures; governance prevents bad data from propagating
Review Board Presentation & Q&A: judges ask “what happens when this fails?” on every integration. Prepare error handling explanations.

Sources

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.