Error Handling: Quick Reference

When the board asks “what happens when this fails?” your answer must cover error classification, retry strategy, circuit breaker behavior, dead letter queue design, and monitoring. For full details, see Error Handling Patterns Deep Dive.

Error Classification - Do This First

Different error responses demand different actions. Retrying a 400 Bad Request forever is an anti-pattern. Failing immediately on a 503 wastes a recoverable situation.

Category	HTTP Codes	Examples	Correct Action	Wrong Action
Transient	408, 429, 500, 502, 503, 504	Timeout, rate limit, service unavailable	Retry with exponential backoff	Fail immediately
Persistent	400, 401, 403, 404, 422	Bad request, unauthorized, not found	Route to DLQ, alert team	Retry (will never succeed)
Systemic	Repeated 503, connection refused	System fully down, cert expired	Circuit breaker, fallback mode	Keep retrying (wastes resources)
Data quality	400, 422 (validation)	Missing fields, invalid format, dupes	Reject, notify, data cleansing	Silently drop or force through
Capacity	429 (Salesforce daily limit)	API limit exhausted, governor limits	Queue and defer, throttle	Fail the entire batch

The Complete Error Handling Stack

Present this layered strategy at the board. Every integration touchpoint should use this stack.

Figure 1. Error classification is the critical first step: retrying a persistent 400 error wastes resources and will never succeed, while failing immediately on a transient 503 discards a recoverable situation. All exhausted retries and unrecoverable errors converge on the dead letter queue, which must trigger active alerting, not silent parking.

Pattern 1: Retry with Exponential Backoff

Parameters

Parameter	Value	Rationale
Max retries	3-5	Enough for transient; not so many it wastes time on persistent failures
Base delay	1 second	Starting wait before first retry
Multiplier	2x (exponential)	1s —> 2s —> 4s —> 8s —> 16s
Max delay cap	60 seconds	Prevent absurdly long waits
Jitter	Random 0-1s added	Prevents thundering herd

Retry Timing Table

Attempt	Delay	Cumulative
1	1s (+jitter)	~1-2s
2	2s (+jitter)	~3-5s
3	4s (+jitter)	~7-10s
4	8s (+jitter)	~15-19s
5	16s (+jitter)	~31-36s

Pattern 2: Circuit Breaker

Three states, three behaviors - the goal is to stop wasting callouts against a system that is clearly down:

Figure 2. The circuit breaker’s open state stops all callout attempts against a known-down system, preventing wasted API quota and avoiding cascading failures. The half-open state allows exactly one test probe to confirm recovery before resuming normal traffic, avoiding the thundering herd that would result from immediately resuming all calls.

State	Behavior	Salesforce Implementation
Closed	Normal - calls pass through, track failures	Standard callout behavior
Open	All calls fail fast - no callout attempted	Check Platform Cache / Custom Metadata before calling
Half-Open	Allow one test call to check recovery	Scheduled job or manual reset tests one call

Configuration

Parameter	Recommended	Purpose
Failure threshold	5 consecutive	Opens the circuit
Open timeout	30-60 seconds	Time before testing recovery
Success threshold	2-3 in half-open	Confirms recovery before closing

Pattern 3: Dead Letter Queue (DLQ)

Messages that exhaust all retries get parked in a DLQ for inspection, diagnosis, and reprocessing.

DLQ Record Design

Field	Purpose
Source_System__c	Where message originated
Target_System__c	Where it was going
Payload__c	Original message (Long Text Area)
Error_Message__c	What went wrong
Error_Code__c	HTTP status / exception type
Retry_Count__c	How many attempts were made
First_Failure__c	When it first failed
Last_Failure__c	When retries exhausted
Status__c	New / Under Review / Resubmitted / Archived
Correlation_ID__c	Links to original transaction

Salesforce DLQ Implementation Options

Approach	Best For	Retention
Custom Object (Integration_Error__c)	Audit trail, reporting, dashboards	Permanent
Platform Events (Error_Event__e)	Real-time alerting to monitoring tools	24-72h
Big Object	High-volume error logging	Permanent, archive-oriented
MuleSoft Anypoint MQ DLQ	Middleware-managed integrations	Configurable
External (Splunk, Datadog)	Centralized ops monitoring	Per tool

Pattern 4: Idempotency

At-least-once delivery means duplicates will happen. Every receiver must handle repeated messages without side effects.

Idempotency Strategy Quick Pick

Strategy	How It Works	When to Use
Upsert + External ID	SF upsert is naturally idempotent	Data sync (default choice)
Idempotency key	Sender includes unique key; receiver checks before processing	Custom business logic
Natural key dedup	Check by business key (Order Number) before insert	When unique business key exists
Payload hash	Hash message content, reject duplicates	No client-side key available
Timestamp comparison	Only process if newer than last processed	Simple, but clock skew risk

Figure 3. Idempotency handling must cover the case where no key is provided by the sender. Generating one from a payload hash or natural business key ensures the receiver can still deduplicate. At-least-once delivery guarantees that duplicates will arrive; this pattern ensures they are harmless.

Pattern 5: Monitoring and Alerting

Error handling without monitoring means end users discover failures days later. Build alerting first, not as an afterthought.

What to Monitor - Alert Thresholds

Metric	Warning	Critical
Integration failure rate	> 5% of transactions	> 20% of transactions
DLQ depth	> 100 messages	Growing for 30+ min
API call consumption	> 80% of daily limit	> 95% of daily limit
Response time (real-time)	> 5 seconds	> 10 seconds
Circuit breaker state	-	Any circuit open
Event subscriber lag	> 1 hour behind	> 12 hours behind

Monitoring Stack

Layer	Salesforce-Native	External
Metrics collection	Event Monitoring (Shield add-on)	Splunk, Datadog, ELK
Dashboards	Custom dashboard on Integration_Error__c	Grafana, Datadog
Alerting	Flow email alerts, Platform Events	PagerDuty, OpsGenie, Slack
Ticketing	Auto-create Case from Flow	Jira, ServiceNow

Reverse-Engineered Use Cases

Scenario 1: ERP Goes Down During Order Processing

Situation: Salesforce sends orders to SAP via Fire-and-Forget (Platform Events + MuleSoft). SAP goes down for 2 hours during peak.

What you’d present:

First 5 failures: MuleSoft retries with exponential backoff (1s, 2s, 4s, 8s, 16s + jitter)
After 5 failures: Circuit breaker opens. Subsequent orders fail fast (no SAP call attempted)
Failed orders: Route to Anypoint MQ dead letter queue with full payload and error context
Alert: PagerDuty pages integration team; auto-created Jira ticket
Recovery: After 60s, circuit breaker half-opens, tests one order. SAP still down - circuit stays open
SAP recovers: Half-open test succeeds. Circuit closes. Normal flow resumes
DLQ replay: Integration team bulk-replays 2 hours of queued orders from DLQ
Idempotency: SAP uses order number as idempotency key - replayed orders that partially processed are safe

Scenario 2: Bulk API Partial Failure

Situation: Nightly sync of 500,000 Account records from data warehouse via Bulk API 2.0. Job completes with 498,000 success and 2,000 failures.

What you’d present:

Successful records: Committed normally (no rollback of successes)
Failed records: Download error results file (GET /jobs/ingest/{id}/failedResults)
Classify failures: 1,800 validation rule failures (data quality), 200 duplicate External ID conflicts
Data quality errors: Route to data steward dashboard for cleansing, fix source data, resubmit only failed records
Duplicate errors: Investigate - likely stale dedup window. Switch to upsert if using insert
Monitoring: Dashboard shows 99.6% success rate (within SLA), alert on the 2,000 failures for review

Scenario 3: Event Subscriber Falls Behind

Situation: External analytics system subscribes to CDC on Opportunity via Pub/Sub API. The analytics system goes down for maintenance over a 4-day weekend. CDC retention is 3 days.

What you’d present:

Day 1-3: Events accumulate on bus. When subscriber reconnects, it replays from last checkpoint
Day 3+: Events older than 3 days are lost - beyond retention window
Recovery: Subscriber detects gap event from Salesforce, triggers batch reconciliation job
Reconciliation: Run Bulk API 2.0 query for all Opportunities modified in the last 5 days, full sync
Prevention: Monitor subscriber lag; alert when lag > 12 hours (gives 2.5 days to fix before data loss)
Design improvement: Hybrid architecture - CDC for near-RT, nightly batch sync as safety net

Anti-Pattern Quick Reference

Anti-Pattern	Why It Fails	Do This Instead
Retry forever	Wastes resources, masks permanent failures	Max retries + DLQ
Retry without backoff	Hammers struggling system	Exponential backoff + jitter
Retry all errors equally	400 will never succeed on retry	Classify first, only retry transient
Swallow errors silently	Nobody knows integration is broken	Log + alert + DLQ
No idempotency	Duplicates on retry	External ID upsert or idempotency keys
Manual-only recovery	Does not scale	Automated retry + manual for edge cases
Monitor reactively	Users discover failures days later	Proactive alerting with thresholds

Sources

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.