Retry & Idempotency

Webhook delivery is unreliable by nature — network failures, partner system outages, and transient errors can cause requests to fail or be received more than once. Bunjang's webhook system is designed to retry failed deliveries automatically, and your endpoint must be designed to handle duplicate deliveries safely.

This page describes Bunjang's retry behavior, circuit breaker mechanism, and how to implement idempotent processing on your side.

Response timeout

Your endpoint must respond with a 2xx status code within 10 seconds of receiving a webhook request. Responses that exceed this timeout are treated as delivery failures and trigger retries.

To meet this requirement reliably, do not perform heavy business logic synchronously in your webhook handler. Instead:

Verify the signature.

Check the eventId for duplicates.

Enqueue the event for asynchronous processing or write it to durable storage.

Return 2xx immediately.

Heavy operations — fetching the latest state from a REST API, updating downstream systems, sending notifications — should run in a separate worker after the response is sent.

Retry policy

Bunjang automatically retries webhook deliveries on transient failures.

What is retried

Condition	Retried?
`2xx` response	No — delivery considered successful
`5xx` response	Yes
`429 Too Many Requests`	Yes
Network error (connection refused, DNS failure, etc.)	Yes
Read timeout (no response within 10 seconds)	Yes
Other `4xx` responses (`400`, `401`, `403`, `404`, etc.)	No — delivery considered permanently failed

Retry schedule

When a transient failure occurs, Bunjang performs up to 3 retries using exponential backoff with jitter. The total delivery attempt window is bounded by an overall timeout of approximately 60 seconds.

Attempt 1 (initial) → fail
  ↓ ~1 second backoff (±50% jitter)
Attempt 2 → fail
  ↓ ~2 seconds backoff (±50% jitter)
Attempt 3 → fail
  ↓ ~4 seconds backoff (±50% jitter)
Attempt 4 (final) → fail → record failure, trigger circuit breaker logic

If all attempts fail, the event is considered failed and no further automatic retries are performed for that event. The failure is then evaluated by the circuit breaker.

Note: The retry schedule is provided for reference and may change without notice as Bunjang tunes the system based on operational data. It is not a service-level agreement. Design your system to handle eventual delivery rather than expecting exact retry timing.

Implications for your system

The total retry window is short (approximately 60 seconds). This protects against brief network blips and short partner-side issues like GC pauses or rolling restarts, but it does not protect against extended downtime.

If your endpoint is unavailable for more than about a minute, events delivered during that window will be lost. To recover from extended outages, see Recovering missed events.

Circuit breaker

To protect both Bunjang's webhook infrastructure and your partner endpoint from cascading failures, Bunjang implements a circuit breaker per subscription.

How it works

The circuit breaker tracks failures per subscription and disables delivery when failures accumulate beyond a threshold. While disabled, no new webhook requests are sent. After a cooldown period, Bunjang attempts a single recovery delivery; if it succeeds, the subscription is re-enabled.

Trigger conditions

A subscription is immediately disabled when:

A response with 401, 403, 404, or other non-retryable 4xx status
codes is received. These indicate misconfiguration that retries cannot
resolve.

A response with 429 Too Many Requests is received after all retries
have been exhausted. The subscription enters a short cooldown (1 minute)
to respect rate limiting.

A subscription is disabled after repeated failures when:

10 or more transient failures occur within a 1-hour window. Transient
failures include 400 responses, 5xx responses, network errors,
and timeouts.

Cooldown and recovery

Trigger	Initial cooldown	Behavior on recovery attempt
`429 Too Many Requests`	1 minute	If successful, subscription is re-enabled. If failed, the subscription is disabled again with a fresh cooldown.
All other failures	30 minutes	Same as above.

A single successful delivery during a recovery attempt is sufficient to fully re-enable the subscription and reset the failure counter.

What this means in practice

Brief partner outages (under ~1 minute): Bunjang's retries usually handle these without involving the circuit breaker.

Sustained partner outages (over ~1 hour with repeated failures): The subscription is disabled. Events generated during the disabled period are not queued for delivery later — they are simply not sent.

Misconfiguration (wrong signing secret, expired credentials, missing endpoint): The subscription is disabled immediately on the first 4xx response.

If your subscription is disabled, you must coordinate with your Bunjang integration contact to investigate the root cause before re-enabling.

Idempotency

Bunjang's webhook system guarantees at-least-once delivery, not exactly-once. The same event may be delivered multiple times due to:

Retries after a timeout where your endpoint actually processed the request but the response was lost.

Network-layer duplication.

Internal recovery mechanisms.

Your endpoint must therefore process each event idempotently — receiving the same event twice must produce the same result as receiving it once.

How to implement idempotency

Every webhook payload contains an eventId field that uniquely identifies the event. Use this as your idempotency key.

Recommended pattern:

When a webhook arrives, extract eventId from the payload.

Atomically check whether you've already processed this eventId:

If yes → return 200 OK immediately without re-processing.

If no → record the eventId as "processing" (with a TTL to handle crashes), process the event, then mark it as "processed".

Return 200 OK.

Storage options:

Database with unique constraint: Insert (event_id, processed_at) into a dedicated table; rely on the unique constraint to reject duplicates.

Redis with SET NX: Use SET event:{eventId} processed EX 86400 NX (24-hour TTL). Returns success only on first write.

Application-level cache: Suitable for short-lived deduplication only; not recommended as the sole mechanism.

TTL guidance: Retain processed eventIds for at least 24 hours. This comfortably covers Bunjang's retry window (~60 seconds) and circuit breaker cooldowns (up to 30 minutes), with significant safety margin for partner-side incident recovery. Longer retention adds storage overhead without practical benefit.

Common idempotency mistakes

Checking and inserting in separate steps without a transaction. Two concurrent webhook deliveries can both pass the "not yet processed" check and proceed to process. Use atomic operations (unique constraint, SET NX, or a transaction with appropriate isolation).

Using business identifiers instead of eventId. The same business entity (e.g. an orderId) can appear in many events. Deduplicate on eventId, not on entities mentioned in the payload.

Returning an error when a duplicate is detected. Return 200 OK for duplicates. Returning 4xx will cause Bunjang to disable your subscription.

Forgetting to handle "processing crashed mid-flight" cases. If your handler dies after marking an event as "processing" but before completing, the retry will see "processing" status and either skip or block. Design your TTL and state machine accordingly.

Recovering missed events

Because Bunjang's webhook retry window is bounded (approximately 60 seconds of retries, plus circuit breaker cooldowns), extended outages or subscription disablement may result in missed events.

Bunjang does not provide manual webhook replay. Missed events are not queued for later delivery.

To recover, use the corresponding REST API as the authoritative source of state. Each event type maps to a domain that has its own REST API for querying current state. When you suspect missed events — after an incident, after subscription re-enablement, or as a periodic reconciliation — query the relevant API for changes since your last successful sync.

General recovery pattern

Track the timestamp of the last successfully processed webhook (or last successful sync) per event domain in your system.

After any incident or extended downtime, call the corresponding REST API filtered by an "updated since" parameter set to that timestamp.

Reconcile each returned record against your local state.

Update your last-sync timestamp.

Example: recovering missed `order.status.changed` events

For order status events, use the Order API with the statusUpdateStartDate filter:

GET /api/v1/orders?statusUpdateStartDate=2026-05-15T19%3A12%3A00Z&statusUpdateEndDate=2026-05-15T20%3A12%3A00Z&page=0&size=100

This returns all orders whose status was updated after the given timestamp, regardless of whether a webhook was delivered for the change.

As new event types are added, refer to each event's reference page for the corresponding recovery API and filter parameter.

Design principle

Treat webhook delivery as a notification mechanism that reduces polling frequency, not as a guaranteed event stream. For business-critical state, the corresponding REST API is the source of truth, and webhooks are an optimization that lets you avoid constant polling under normal operating conditions.

For event types not yet covered by webhooks, periodic polling of the corresponding REST API remains the appropriate pattern.

Summary checklist

When implementing your webhook receiver, ensure you:

Respond with 2xx within 10 seconds.

Defer heavy processing to an asynchronous worker.

Deduplicate events using eventId with atomic storage operations.

Return 200 OK on duplicate detection — never 4xx.

Retain processed eventIds for at least 24 hours.

Implement a recovery process using the corresponding REST API for each event domain.

Monitor your endpoint's response times and error rates to avoid triggering the circuit breaker.

Response timeout#

Retry policy#

What is retried#

Retry schedule#

Implications for your system#

Circuit breaker#

How it works#

Trigger conditions#

Cooldown and recovery#

What this means in practice#

Idempotency#

How to implement idempotency#

Common idempotency mistakes#

Recovering missed events#

General recovery pattern#

Example: recovering missed order.status.changed events#

Design principle#

Summary checklist#