Skip to content

Error Handling: Quick Reference

When the board asks “what happens when this fails?” your answer must cover error classification, retry strategy, circuit breaker behavior, dead letter queue design, and monitoring. For full details, see Error Handling Patterns Deep Dive.

Different error responses demand different actions. Retrying a 400 Bad Request forever is an anti-pattern. Failing immediately on a 503 wastes a recoverable situation.

CategoryHTTP CodesExamplesCorrect ActionWrong Action
Transient408, 429, 500, 502, 503, 504Timeout, rate limit, service unavailableRetry with exponential backoffFail immediately
Persistent400, 401, 403, 404, 422Bad request, unauthorized, not foundRoute to DLQ, alert teamRetry (will never succeed)
SystemicRepeated 503, connection refusedSystem fully down, cert expiredCircuit breaker, fallback modeKeep retrying (wastes resources)
Data quality400, 422 (validation)Missing fields, invalid format, dupesReject, notify, data cleansingSilently drop or force through
Capacity429 (Salesforce daily limit)API limit exhausted, governor limitsQueue and defer, throttleFail the entire batch

Present this layered strategy at the board. Every integration touchpoint should use this stack.

Layered error handling flow classifying failures as transient, systemic, or persistent and routing them through retry with backoff, circuit breaker, or direct dead letter queue as appropriate.
Figure 1. Error classification is the critical first step: retrying a persistent 400 error wastes resources and will never succeed, while failing immediately on a transient 503 discards a recoverable situation. All exhausted retries and unrecoverable errors converge on the dead letter queue, which must trigger active alerting, not silent parking.
ParameterValueRationale
Max retries3-5Enough for transient; not so many it wastes time on persistent failures
Base delay1 secondStarting wait before first retry
Multiplier2x (exponential)1s —> 2s —> 4s —> 8s —> 16s
Max delay cap60 secondsPrevent absurdly long waits
JitterRandom 0-1s addedPrevents thundering herd
AttemptDelayCumulative
11s (+jitter)~1-2s
22s (+jitter)~3-5s
34s (+jitter)~7-10s
48s (+jitter)~15-19s
516s (+jitter)~31-36s

Three states, three behaviors - the goal is to stop wasting callouts against a system that is clearly down:

Three-state diagram showing transitions between Closed, Open, and Half-Open circuit breaker states based on failure threshold, timeout expiry, and test call results.
Figure 2. The circuit breaker’s open state stops all callout attempts against a known-down system, preventing wasted API quota and avoiding cascading failures. The half-open state allows exactly one test probe to confirm recovery before resuming normal traffic, avoiding the thundering herd that would result from immediately resuming all calls.
StateBehaviorSalesforce Implementation
ClosedNormal - calls pass through, track failuresStandard callout behavior
OpenAll calls fail fast - no callout attemptedCheck Platform Cache / Custom Metadata before calling
Half-OpenAllow one test call to check recoveryScheduled job or manual reset tests one call
ParameterRecommendedPurpose
Failure threshold5 consecutiveOpens the circuit
Open timeout30-60 secondsTime before testing recovery
Success threshold2-3 in half-openConfirms recovery before closing

Messages that exhaust all retries get parked in a DLQ for inspection, diagnosis, and reprocessing.

FieldPurpose
Source_System__cWhere message originated
Target_System__cWhere it was going
Payload__cOriginal message (Long Text Area)
Error_Message__cWhat went wrong
Error_Code__cHTTP status / exception type
Retry_Count__cHow many attempts were made
First_Failure__cWhen it first failed
Last_Failure__cWhen retries exhausted
Status__cNew / Under Review / Resubmitted / Archived
Correlation_ID__cLinks to original transaction
ApproachBest ForRetention
Custom Object (Integration_Error__c)Audit trail, reporting, dashboardsPermanent
Platform Events (Error_Event__e)Real-time alerting to monitoring tools24-72h
Big ObjectHigh-volume error loggingPermanent, archive-oriented
MuleSoft Anypoint MQ DLQMiddleware-managed integrationsConfigurable
External (Splunk, Datadog)Centralized ops monitoringPer tool

At-least-once delivery means duplicates will happen. Every receiver must handle repeated messages without side effects.

StrategyHow It WorksWhen to Use
Upsert + External IDSF upsert is naturally idempotentData sync (default choice)
Idempotency keySender includes unique key; receiver checks before processingCustom business logic
Natural key dedupCheck by business key (Order Number) before insertWhen unique business key exists
Payload hashHash message content, reject duplicatesNo client-side key available
Timestamp comparisonOnly process if newer than last processedSimple, but clock skew risk
Flowchart showing how an incoming message is checked for an idempotency key, deduped against previously processed keys, and either skipped or processed and stored.
Figure 3. Idempotency handling must cover the case where no key is provided by the sender. Generating one from a payload hash or natural business key ensures the receiver can still deduplicate. At-least-once delivery guarantees that duplicates will arrive; this pattern ensures they are harmless.

Error handling without monitoring means end users discover failures days later. Build alerting first, not as an afterthought.

MetricWarningCritical
Integration failure rate> 5% of transactions> 20% of transactions
DLQ depth> 100 messagesGrowing for 30+ min
API call consumption> 80% of daily limit> 95% of daily limit
Response time (real-time)> 5 seconds> 10 seconds
Circuit breaker state-Any circuit open
Event subscriber lag> 1 hour behind> 12 hours behind
LayerSalesforce-NativeExternal
Metrics collectionEvent Monitoring (Shield add-on)Splunk, Datadog, ELK
DashboardsCustom dashboard on Integration_Error__cGrafana, Datadog
AlertingFlow email alerts, Platform EventsPagerDuty, OpsGenie, Slack
TicketingAuto-create Case from FlowJira, ServiceNow

Scenario 1: ERP Goes Down During Order Processing

Section titled “Scenario 1: ERP Goes Down During Order Processing”

Situation: Salesforce sends orders to SAP via Fire-and-Forget (Platform Events + MuleSoft). SAP goes down for 2 hours during peak.

What you’d present:

  1. First 5 failures: MuleSoft retries with exponential backoff (1s, 2s, 4s, 8s, 16s + jitter)
  2. After 5 failures: Circuit breaker opens. Subsequent orders fail fast (no SAP call attempted)
  3. Failed orders: Route to Anypoint MQ dead letter queue with full payload and error context
  4. Alert: PagerDuty pages integration team; auto-created Jira ticket
  5. Recovery: After 60s, circuit breaker half-opens, tests one order. SAP still down - circuit stays open
  6. SAP recovers: Half-open test succeeds. Circuit closes. Normal flow resumes
  7. DLQ replay: Integration team bulk-replays 2 hours of queued orders from DLQ
  8. Idempotency: SAP uses order number as idempotency key - replayed orders that partially processed are safe

Situation: Nightly sync of 500,000 Account records from data warehouse via Bulk API 2.0. Job completes with 498,000 success and 2,000 failures.

What you’d present:

  1. Successful records: Committed normally (no rollback of successes)
  2. Failed records: Download error results file (GET /jobs/ingest/{id}/failedResults)
  3. Classify failures: 1,800 validation rule failures (data quality), 200 duplicate External ID conflicts
  4. Data quality errors: Route to data steward dashboard for cleansing, fix source data, resubmit only failed records
  5. Duplicate errors: Investigate - likely stale dedup window. Switch to upsert if using insert
  6. Monitoring: Dashboard shows 99.6% success rate (within SLA), alert on the 2,000 failures for review

Situation: External analytics system subscribes to CDC on Opportunity via Pub/Sub API. The analytics system goes down for maintenance over a 4-day weekend. CDC retention is 3 days.

What you’d present:

  1. Day 1-3: Events accumulate on bus. When subscriber reconnects, it replays from last checkpoint
  2. Day 3+: Events older than 3 days are lost - beyond retention window
  3. Recovery: Subscriber detects gap event from Salesforce, triggers batch reconciliation job
  4. Reconciliation: Run Bulk API 2.0 query for all Opportunities modified in the last 5 days, full sync
  5. Prevention: Monitor subscriber lag; alert when lag > 12 hours (gives 2.5 days to fix before data loss)
  6. Design improvement: Hybrid architecture - CDC for near-RT, nightly batch sync as safety net
Anti-PatternWhy It FailsDo This Instead
Retry foreverWastes resources, masks permanent failuresMax retries + DLQ
Retry without backoffHammers struggling systemExponential backoff + jitter
Retry all errors equally400 will never succeed on retryClassify first, only retry transient
Swallow errors silentlyNobody knows integration is brokenLog + alert + DLQ
No idempotencyDuplicates on retryExternal ID upsert or idempotency keys
Manual-only recoveryDoes not scaleAutomated retry + manual for edge cases
Monitor reactivelyUsers discover failures days laterProactive alerting with thresholds

Personal study notes for the Salesforce CTA exam. Content compiled from VJ's study notes, official Salesforce documentation, community sources, and online publicly available content, then organized and presented with AI assistance. Not affiliated with Salesforce. © 2025–2026 VJ Srivastava.