Error Handling Patterns
Error handling separates architects who design for the happy path from those who design for reality. External systems go down. Networks fail. Data arrives malformed. Rate limits get exceeded. These patterns cover what every CTA must know to handle failures properly.
Error Categories
Section titled “Error Categories”Classify the error before choosing a pattern. Different error types demand different responses.
| Category | Examples | Correct Response | Wrong Response |
|---|---|---|---|
| Transient | Network timeout, 503 Service Unavailable, rate limit (429) | Retry with backoff | Fail immediately |
| Persistent | 404 Not Found, 400 Bad Request, invalid data | Route to dead letter queue, alert | Retry infinitely |
| Systemic | External system fully down, certificate expired | Circuit breaker, fallback | Keep retrying (wastes resources) |
| Data quality | Missing required fields, invalid format, duplicates | Reject and notify, data cleansing | Silently drop or force through |
| Capacity | Bulk API daily limit reached, governor limits | Queue and defer, throttle | Fail the entire batch |
Pattern 1: Retry with Exponential Backoff
Section titled “Pattern 1: Retry with Exponential Backoff”Retry failed operations with increasing delays between attempts. The foundation of transient error handling.
How It Works
Section titled “How It Works”Implementation Parameters
Section titled “Implementation Parameters”| Parameter | Recommended Value | Rationale |
|---|---|---|
| Max retries | 3-5 | Enough for transient issues, not so many that persistent failures waste time |
| Base delay | 1 second | Starting delay before first retry |
| Max delay | 60 seconds | Cap to prevent excessively long waits |
| Backoff multiplier | 2x (exponential) | 1s, 2s, 4s, 8s, 16s… |
| Jitter | Random 0-1s added | Prevents thundering herd when many clients retry simultaneously |
Retry Timing Example
Section titled “Retry Timing Example”| Attempt | Delay (no jitter) | Delay (with jitter) |
|---|---|---|
| 1 | 1 second | 1.0 - 2.0 seconds |
| 2 | 2 seconds | 2.0 - 3.0 seconds |
| 3 | 4 seconds | 4.0 - 5.0 seconds |
| 4 | 8 seconds | 8.0 - 9.0 seconds |
| 5 | 16 seconds | 16.0 - 17.0 seconds |
| Total | ~31 seconds | ~31 - 36 seconds |
Pattern 2: Circuit Breaker
Section titled “Pattern 2: Circuit Breaker”Stops a system from repeatedly calling an external service that is known to be down. Modeled after electrical circuit breakers.
States
Section titled “States”Implementation in Salesforce
Section titled “Implementation in Salesforce”| State | Behavior | Salesforce Implementation |
|---|---|---|
| Closed | Normal operation, calls pass through | Standard callout behavior |
| Open | All calls fail immediately, no callout attempted | Check Custom Metadata / Platform Cache before callout |
| Half-Open | Allow one test call to check if service recovered | Scheduled job or manual reset attempts one call |
Configuration Parameters
Section titled “Configuration Parameters”| Parameter | Recommended | Purpose |
|---|---|---|
| Failure threshold | 5 consecutive failures | Number of failures before opening circuit |
| Open timeout | 30-60 seconds | How long to wait before testing recovery |
| Success threshold | 2-3 successes in half-open | Successes needed to close circuit |
Pattern 3: Dead Letter Queue (DLQ)
Section titled “Pattern 3: Dead Letter Queue (DLQ)”Messages that cannot be processed after all retries go to a dead letter queue for inspection, reprocessing, or alerting.
Salesforce DLQ Options
Section titled “Salesforce DLQ Options”| Approach | Best For | Persistence |
|---|---|---|
| Custom Object (Integration_Error__c) | Full audit trail, reporting | Permanent (until deleted) |
| Platform Events (Error_Event__e) | Real-time alerting | 24-72 hours |
| Big Object | High-volume error logging | Permanent, archive-oriented |
| Middleware DLQ (MuleSoft/Anypoint MQ) | Middleware-managed integrations | Configurable retention |
| External monitoring (Splunk, Datadog) | Centralized ops monitoring | Per tool retention |
DLQ Record Design
Section titled “DLQ Record Design”A well-designed DLQ record captures everything needed for diagnosis and reprocessing:
| Field | Purpose |
|---|---|
| Source System | Where the message originated |
| Target System | Where it was being sent |
| Payload | The original message content |
| Error Message | What went wrong |
| Error Code | HTTP status, exception type |
| Retry Count | How many attempts were made |
| First Failure Timestamp | When it first failed |
| Last Failure Timestamp | When retries were exhausted |
| Status | New / Under Review / Resubmitted / Archived |
| Correlation ID | Links to the original transaction |
Pattern 4: Idempotency
Section titled “Pattern 4: Idempotency”Processing the same message multiple times must produce the same result. Mandatory for any at-least-once delivery system.
Why It Matters
Section titled “Why It Matters”Platform Events, CDC, and most middleware deliver at-least-once. Duplicates will happen because of:
- Network retries at the transport layer
- Subscriber reconnection replaying events
- Middleware retry on ambiguous failures
- Bulk API partial retries
Implementation Strategies
Section titled “Implementation Strategies”| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Idempotency key | Client sends unique key; server checks before processing | Most reliable | Requires key storage and lookup |
| Natural key dedup | Use business key (Order Number) to detect duplicates | No extra infrastructure | Requires unique business key |
| Upsert operations | Use External ID for upsert instead of insert | Built into Salesforce | Only works for CRUD, not business logic |
| Payload hash | Hash the message content, check for duplicate hashes | Works without client changes | Hash collisions (rare), different messages may hash same |
| Timestamp comparison | Only process if timestamp is newer than last processed | Simple | Clock skew issues |
Pattern 5: Monitoring and Alerting
Section titled “Pattern 5: Monitoring and Alerting”Error handling without monitoring is a fire alarm with no sound. Failures must be detected and addressed before they create business impact.
Monitoring Architecture
Section titled “Monitoring Architecture”What to Monitor
Section titled “What to Monitor”| Metric | Threshold | Alert Level |
|---|---|---|
| Integration failure rate | > 5% of transactions | Warning |
| Integration failure rate | > 20% of transactions | Critical |
| DLQ depth | > 100 messages | Warning |
| DLQ depth growing | Increasing for 30+ minutes | Critical |
| API call consumption | > 80% of daily limit | Warning |
| API call consumption | > 95% of daily limit | Critical |
| Average response time | > 5 seconds (for real-time) | Warning |
| Circuit breaker open | Any circuit open | Critical |
| Event subscriber lag | > 1 hour behind | Warning |
| Event subscriber lag | > 12 hours behind | Critical (approaching retention limit) |
Salesforce-Native Monitoring Options
Section titled “Salesforce-Native Monitoring Options”| Tool | What It Monitors | Cost |
|---|---|---|
| Event Monitoring | API calls, logins, report exports | Shield add-on |
| Custom Dashboard | Integration_Error__c records | Included |
| Flow Email Alerts | Trigger on error records | Included |
| Platform Events | Real-time error broadcasting | Included |
| Einstein Analytics | Trend analysis on error patterns | Add-on |
Combining Patterns: The Complete Error Handling Stack
Section titled “Combining Patterns: The Complete Error Handling Stack”In a CTA scenario, present a layered error handling strategy, not just a single pattern.
End-to-End Failure Scenario: ERP Goes Down
Section titled “End-to-End Failure Scenario: ERP Goes Down”This sequence diagram shows how the patterns work together when the ERP becomes unavailable during order processing. This type of walkthrough scores well at the CTA board.
Detailed walkthrough
This sequence has five distinct phases. Reading it as a runtime narrative rather than an architecture diagram is exactly how you should present it to the review board.
Phase 1: Normal handoff. Salesforce fires a Platform Event when an order is submitted. Middleware receives it and immediately checks circuit state. The circuit breaker returns CLOSED, meaning the ERP is considered healthy. Middleware makes its first POST to /orders. The ERP returns a 503.
Phase 2: Retry with exponential backoff. The 503 is a retryable error (transient, server-side). Middleware waits one second and tries again. Another 503. It waits two seconds and tries a third time. Another 503. The backoff interval doubles between attempts (1s, 2s) deliberately. A recovering ERP under load needs breathing room. If every failing client retries at identical intervals, the recovered system receives a traffic spike at the exact moment it is trying to stabilize, which can re-collapse it. The increasing wait distributes pressure. Three attempts is enough to distinguish a brief self-correcting flap from a genuine outage.
Phase 3: Circuit trips. After the third failure, middleware reports to the circuit breaker state store. The threshold is met and the breaker flips from CLOSED to OPEN. Two things happen simultaneously: the failed order routes to the DLQ, and operations gets a PagerDuty alert plus an auto-created Jira ticket for traceability. Any subsequent order events that arrive while the circuit is OPEN fail fast without touching the ERP. This stops a broken integration from wasting resources and amplifying load on an already-struggling system.
Phase 4: Half-open probe. After 60 seconds, the circuit moves to HALF-OPEN. One test call goes out to the ERP. If it succeeds, the breaker resets to CLOSED. If it fails, it snaps back to OPEN and the cooldown restarts. No bulk traffic crosses until the single probe succeeds.
Phase 5: Recovery and replay. The ERP returns 200 OK on the probe. Circuit closes. Operations receives a recovery notification and triggers a bulk DLQ resubmit. The queued orders replay through middleware to the ERP in sequence. Every order that arrived during the outage is eventually delivered, with a complete audit trail from original Platform Event timestamp through successful resubmit.
The zero-data-loss guarantee comes from the DLQ, not from the retry mechanism. Retries handle transient glitches. The DLQ handles the cases retries cannot resolve. Together they are why this pattern scores well at the board.
Anti-Patterns
Section titled “Anti-Patterns”| Anti-Pattern | Why It Fails | Better Approach |
|---|---|---|
| Retry forever | Wastes resources, masks permanent failures | Max retries + DLQ |
| Retry without backoff | Hammers already-struggling systems | Exponential backoff with jitter |
| Swallow errors silently | Nobody knows the integration is broken | Log, alert, DLQ |
| Single retry for all errors | 400 Bad Request will never succeed with retry | Classify errors, only retry transient |
| No idempotency | Duplicate processing on retry | Idempotency keys or upsert |
| Manual-only error recovery | Does not scale, creates a human bottleneck | Automated reprocessing with manual review for edge cases |
Related Topics
Section titled “Related Topics”- Risk Management: integration failures are a top risk category; error handling feeds directly into risk registers
- Data Quality & Governance: data quality errors are a major category of integration failures; governance prevents bad data from propagating
- Review Board Presentation & Q&A: judges ask “what happens when this fails?” on every integration. Prepare error handling explanations.
Sources
Section titled “Sources”- Salesforce Integration Patterns: Error Handling
- MuleSoft: Error Handling Best Practices
- Martin Fowler, “Circuit Breaker Pattern”
- Michael Nygard, “Release It! Design and Deploy Production-Ready Software”
- AWS: Exponential Backoff and Jitter
- CTA Study Group notes on integration error handling scenarios
Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.