A Hybrid Transactional Outbox Event Delivery Pattern

Published: 10/5/2025

This article presents a hybrid event delivery pattern that uses three specialized tables to handle different event delivery requirements: eager publishing with fallback scanning for latency-sensitive events, scheduled polling for time-based events, and batch aggregation for high-volume data.

The Problem: Reliable Event Publishing is Hard

In distributed systems, publishing events reliably is deceptively difficult. You need to atomically:

Update your database
Publish an event to a message queue

But these are two separate systems. What happens when your database transaction succeeds but the message queue is down? You lose events. What if the message publishes but the database rolls back? You get phantom events.

This is the dual-write problem, and it’s the bane of event-driven architectures.

Transactional Outbox

The Transactional Outbox pattern solves this by:

Writing events to an outbox table within the same database transaction
A separate process reads the outbox and publishes to the message queue
Events are marked as published after successful delivery

This works, but it comes with trade-offs:

Polling adds latency: Checking the database every few seconds delays event delivery
Change Data Capture is complex: Real-time CDC solutions like Debezium are powerful but add operational overhead
Large payloads in MQ: Full event data flows through the message queue

Real-World Example: Multi-Provider AI Chat Platform

Let’s ground this in a concrete example: building a chat platform that supports multiple AI providers (OpenAI, Anthropic, Google, etc.). This system needs to handle:

User subscriptions that expire at specific times
Token usage tracking across millions of API calls
Payment processing that triggers service activation
AI responses cached in Redis that need eventual persistence

Each of these has different consistency and latency requirements, making it a perfect case study for our hybrid pattern.

A Better Approach: Hybrid Outbox with Claim Check and Buffer

I implemented a pattern that separates different delivery mechanisms into dedicated tables. The key insight: different event types have different requirements and should use optimized storage.

1
┌─────────────────────────────────────────────────────────────┐
2
│                      AI Chat Service                        │
3
└───┬─────────────────┬─────────────────┬─────────────────────┘
4
    │                 │                 │
5
    │ 1. Eager Event  │ 2. Schedule     │ 3. Accumulate
6
    │                 │ Future Event    │ High Volume
7
    ▼                 ▼                 ▼
8
┌─────────────────────────────────────────────────────────────┐
9
│              Three Specialized Tables                       │
10
├─────────────────────────────────────────────────────────────┤
11
│  ┌───────────────────────────────────────────────────────┐  │
12
│  │ outbox_events: Hybrid (AI Response, Payments)         │  │
13
│  │   → Eager publish + 30s scanner fallback              │  │
14
│  │   → Index: (status, created_at) for fast scanning     │  │
15
│  │   → Cleanup: Archive after 30 days                    │  │
16
│  └───────────────────────────────────────────────────────┘  │
17
│  ┌───────────────────────────────────────────────────────┐  │
18
│  │ scheduled_events: Future triggers (Subscriptions)     │  │
19
│  │   → Poll for scheduled_at <= NOW()                    │  │
20
│  │   → Index: (scheduled_at, status) for time queries    │  │
21
│  │   → Cleanup: Delete after execution                   │  │
22
│  └───────────────────────────────────────────────────────┘  │
23
│  ┌───────────────────────────────────────────────────────┐  │
24
│  │ accumulation_buffer: Batching (Token Usage)           │  │
25
│  │   → No MQ, direct aggregation every 5 minutes         │  │
26
│  │   → Index: (user_id, created_at) for grouping         │  │
27
│  │   → Cleanup: Delete immediately after aggregation     │  │
28
│  └───────────────────────────────────────────────────────┘  │
29
└─────────────────────────────────────────────────────────────┘
30
                          │
31
                          │ { event_id: UUID or series }
32
                          │
33
                          ▼
34
                    ┌──────────┐
35
                    │    MQ    │
36
                    └─────┬────┘
37
                          │
38
                          ▼
39
                    ┌──────────┐
40
                    │ Consumer │
41
                    └──────────┘

Even in the worst case (MQ failure, process crash, network partition), all events eventually get processed. The database transaction ensures events are never lost.

Since consumers fetch events by ID and check status, duplicate deliveries are naturally handled. This is critical when the scanner republishes events that were actually already sent.

And, since there is no need for complex CDC infrastructure, maintenance is quite easy.

The key insight is different event types need different storage strategies.

database schema design decisions:

outbox_events: No scheduled_at column (not needed)
scheduled_events: Requires scheduled_at for future triggers
accumulation_buffer: Minimal schema, uses BIGSERIAL for fast inserts, no status field needed. Just delete it after processing instead.

Hybrid Events (Hot Path): AI Response Persistence

When a user sends a message, the AI response is initially cached in Redis for instant retrieval. We need to persist it to PostgreSQL for:

Long-term storage and search
Analytics and training data
Audit trails

This is latency-sensitive but not critical. If eager publish fails, a 30-second delay is acceptable.

1
async fn save_ai_response(
2
    pool: &PgPool,
3
    mq: &MessageQueue,
4
    response: AiResponse,
5
) -> Result<()> {
6
    let mut tx = pool.begin().await?;
7
    // do something here, then publish the event
8
    let event = sqlx::query_as!(
9
        OutboxEvent,
10
        r#"
11
        INSERT INTO outbox_events (event_type, payload, status)
12
        VALUES ($1, $2, $3)
13
        RETURNING id, event_type, payload, status, created_at, sent_at, processed_at
14
        "#,
15
        "ai_response_ready",
16
        sqlx::types::Json(EventPayload::AiResponseReady { /* ... */ }) as _,
17
        EventStatus::Pending as EventStatus,
18
    )
19
    .fetch_one(&mut *tx)
20
    .await?;
21
    tx.commit().await?;
22

23
    // Eager publish after the commit (non-blocking)
24
    let event_id = event.id;
25
    let mq = mq.clone();
26
    tokio::spawn(async move {
27
        if let Err(e) = eager_publish(&mq, event_id).await {
28
            tracing::warn!("Eager publish failed for AI response: {:?}", e);
29
            // Scanner will catch it within 30 seconds
30
        }
31
    });
32

33
    Ok(())
34
}

Why hybrid here? Most AI responses need quick persistence for analytics dashboards showing real-time usage. The eager publish ensures sub-second latency 99% of the time, while the scanner guarantees eventual consistency.

Scheduled Events (Cold Path): Subscription Expiry

Subscriptions expire at known future times. There’s no need for eager publishing—just poll for due events.

1
async fn create_subscription(
2
    pool: &PgPool,
3
    user_id: Uuid,
4
    plan: SubscriptionPlan,
5
    duration_days: i32,
6
) -> Result<Subscription> {
7
    let mut tx = pool.begin().await?;
8

9
    let expires_at = Utc::now() + Duration::days(duration_days as i64);
10

11
    // Create subscription
12
    let subscription = sqlx::query_as!(
13
        Subscription,
14
        "INSERT INTO subscriptions (user_id, plan, expires_at, status)
15
         VALUES ($1, $2, $3, $4)
16
         RETURNING *",
17
        user_id,
18
        plan as SubscriptionPlan,
19
        expires_at,
20
        SubscriptionStatus::Active as SubscriptionStatus,
21
    )
22
    .fetch_one(&mut *tx)
23
    .await?;
24

25
    // Schedule expiry event in dedicated table
26
    sqlx::query!(
27
        r#"
28
        INSERT INTO scheduled_events (event_type, payload, scheduled_at, status)
29
        VALUES ($1, $2, $3, $4)
30
        "#,
31
        "subscription_expiry",
32
        ScheduledEventPayload::SubscriptionExpiry{ /* ... */ },
33
        expires_at,
34
        EventStatus::Pending as EventStatus,
35
    )
36
    .execute(&mut *tx)
37
    .await?;
38

39
    tx.commit().await?;
40

41
    Ok(subscription)
42
}
43

44
// Scanner runs every minute, checks for due subscriptions
45
async fn scan_scheduled_events(pool: &PgPool, mq: &MessageQueue) -> Result<()> {
46
    let now = Utc::now();
47

48
    let due_events = sqlx::query_as!(
49
        ScheduledEvent,
50
        r#"
51
        SELECT id, event_type, payload, scheduled_at,
52
               status as "status: EventStatus", sent_at, processed_at
53
        FROM scheduled_events
54
        WHERE status = $1 AND scheduled_at <= $2
55
        LIMIT 100
56
        "#,
57
        EventStatus::Pending as EventStatus,
58
        now,
59
    )
60
    .fetch_all(pool)
61
    .await?;
62

63
    for event in due_events {
64
        publish_scheduled_event(pool, mq, event.id)?;
65
    }
66

67
    Ok(())
68
}
69

70
// Consumer: Handle subscription expiry
71
async fn handle_subscription_expiry(
72
    pool: &PgPool,
73
    event: OutboxEvent,
74
) -> Result<()> {
75
    let payload: EventPayload = serde_json::from_value(event.payload)?;
76

77
    if let EventPayload::SubscriptionExpiry { subscription_id, user_id } = payload {
78
        // Update subscription status
79
        sqlx::query!(
80
            "UPDATE subscriptions SET status = $1 WHERE id = $2",
81
            SubscriptionStatus::Expired as SubscriptionStatus,
82
            subscription_id,
83
        )
84
        .execute(pool)
85
        .await?;
86

87
        // Revoke API access
88
        revoke_api_keys(user_id).await?;
89

90
        // Send notification email
91
        send_expiry_notification(user_id).await?;
92
    }
93

94
    Ok(())
95
}

Why schedule-only? Subscription expiry is never urgent. Whether it happens at exactly midnight or 60 seconds later doesn’t matter. Polling every minute is sufficient, and we avoid the complexity of eager publishing entirely.

Poll-Only Events (No MQ): Token Usage Accumulation

Every AI API call generates token usage. Tracking this in real-time would overwhelm the system with millions of events per day. Instead, we accumulate usage locally and batch-update periodically.

1
// Background job runs every 5 minutes
2
async fn accumulate_token_usage(pool: &PgPool) -> Result<()> {
3
    // Find all pending token usage from accumulation buffer
4
    let usage_records = sqlx::query!(
5
        r#"
6
        SELECT id, user_id, tokens, model, created_at
7
        FROM accumulation_buffer
8
        LIMIT 10000
9
        "#,
10
    )
11
    .fetch_all(pool)
12
    .await?;
13

14
    if usage_records.is_empty() {
15
        return Ok(());
16
    }
17

18
    // Group by user_id and sum tokens
19
    let mut usage_by_user: HashMap<Uuid, i32> = HashMap::new();
20
    for record in &usage_records {
21
        *usage_by_user.entry(record.user_id).or_insert(0) += record.tokens;
22
    }
23

24
    // Batch update user quotas
25
    let mut tx = pool.begin().await?;
26
    // do something here
27
    tx.commit().await?;
28

29
    tracing::info!(
30
        "Accumulated {} token usage records for {} users",
31
        usage_records.len(),
32
        usage_by_user.len()
33
    );
34

35
    Ok(())
36
}
37

38
// Run accumulator periodically
39
async fn run_token_accumulator(pool: PgPool) {
40
    let mut interval = tokio::time::interval(Duration::from_secs(300)); // 5 minutes
41

42
    loop {
43
        interval.tick().await;
44

45
        if let Err(e) = accumulate_token_usage(&pool).await {
46
            tracing::error!("Token accumulator error: {:?}", e);
47
        }
48
    }
49
}

Why poll-only? Token usage tracking is:

Not latency-sensitive: Users check their quota in dashboards, not real-time
High volume: Millions of tiny events per day
Naturally batched: Accumulating every 5 minutes is perfectly fine

Using MQ here would be wasteful. The accumulation_buffer table acts as a simple batching mechanism, and periodic polling aggregates efficiently.

Choose the Right Tool

1
Should this event use outbox_events, scheduled_events, or accumulation_buffer?
2

3
├─ Known future execution time?
4
│  └─ YES → scheduled_events
5
│
6
├─ High volume (>1000/sec) + aggregatable?
7
│  └─ YES → accumulation_buffer
8
│
9
└─ Needs low latency delivery?
10
   ├─ YES → outbox_events (hybrid)
11
   └─ NO → classic_outbox (poll-only)

Use Case	Table	Pattern	Latency	Why
AI Response Sync	`outbox_events`	Hybrid (eager + scanner)	<1s (99%), <30s (99.99%)	User-facing analytics need speed
Payment Processing	`outbox_events`	Hybrid (eager + scanner)	<1s (99%), <30s (99.99%)	Service activation should be quick
Subscription Expiry	`scheduled_events`	Scheduled (poll-only)	~60s	Exact timing doesn’t matter
Token Usage	`accumulation_buffer`	Poll-only (no MQ)	~5 min	High volume, not time-sensitive

Why and Why Not

Why Not One Table?

Using three dedicated tables provides significant architectural benefits that a single unified table cannot match:

1. Optimized Indexes

outbox_events: Index on (status, created_at) for fast scanning of recent failures
scheduled_events: Index on (scheduled_at, status) for efficient time-based queries
accumulation_buffer: Index on (user_id, created_at) for fast grouping during aggregation

A single table would require multiple indexes covering different access patterns, causing index bloat and slower writes.

2. Independent Cleanup Strategies

outbox_events: Archive after 30 days (audit trail)
scheduled_events: Delete immediately after execution (no historical value)
accumulation_buffer: Delete after aggregation (already in users.tokens_used)

This prevents the table from growing indefinitely and keeps query performance consistent.

3. Isolated Performance Characteristics

High-volume token usage writes don’t block latency-sensitive AI response events
Scheduled event scans don’t interfere with outbox scanner performance
Each table can be tuned independently (vacuum settings, autovacuum thresholds)

4. Clear Operational Boundaries Different teams or services can own different tables:

Payment team: outbox_events (critical path)
Subscription team: scheduled_events (background jobs)
Analytics team: accumulation_buffer (data pipeline)

Why Only Pass the Event ID?

By only publishing event IDs to the message queue instead of full payloads (also known as claim check), we gain several advantages:

1. Reduced Message Queue Load

Tiny messages (just a UUID) vs potentially large event payloads
Lower network bandwidth usage between MQ and consumers
MQ can handle significantly higher throughput with smaller messages

2. Avoids Message Size Limits

Most message queues have size limits (e.g., RabbitMQ 128MB default, SQS 256KB)
AI responses with embeddings or large context can exceed these limits
Event payloads are unlimited in PostgreSQL

3. Single Source of Truth

Event data lives only in the database, not duplicated in MQ
Updates to event processing logic can query the latest data
No stale payload issues when consumers are slow

4. Better Resource Utilization

Database optimized for storing structured data with indexes
Message queue optimized for fast delivery, not storage
Each system does what it’s best at

5. Simplified Debugging

Query database directly to inspect event details
No need to capture messages from MQ for investigation
Event history preserved independently of MQ retention

The trade-off is an additional database query per event in the consumer, but for our use case with thousands (not millions) of events per second, this is negligible compared to the benefits.

Trade-offs and Considerations

Potential Duplicate Deliveries

If eager publishing succeeds but updating the status fails, the scanner will republish. Your consumers must be idempotent. For the AI chat platform:

AI response sync: Check if content is already set before updating
Payment processing: Use payment gateway’s idempotency keys
Token accumulation: Naturally idempotent (already aggregated by ID)

Additional Database Load

Every hybrid consumer must query the database to fetch event details. For high throughput:

Use read replicas for consumer queries
Add connection pooling (e.g., pgBouncer)
Cache frequently accessed events in Redis

What three tables helps here:

Each table is smaller = better cache hit rates
Scanners don’t compete on the same indexes
Write-heavy accumulation_buffer doesn’t block reads on outbox_events

Event Cleanup Strategy

Each table has its own cleanup strategy based on its purpose:

Cleanup for outbox_events: Archive after 30 days (audit trail)
Cleanup for scheduled_events: Delete immediately after execution
Cleanup for accumulation_buffer: Delete after aggregation (see above). This happens inline during the accumulation process.

Benefits of table-specific cleanup:

No “one size fits all” retention policy compromises
Smaller tables = faster queries and better vacuum performance
Clear data lifecycle management per use case

Scanner Interval Tuning

The 30-second backlog threshold and 5-minute token accumulation intervals are design choices. Tune based on your requirements:

Table	Scanner Type	Interval	Rationale
`outbox_events`	Hybrid backlog scanner	30-60s	Balance between latency and database load
`scheduled_events`	Scheduled event scanner	1 min	Subscription expiry doesn’t need sub-minute precision
`accumulation_buffer`	Token accumulator	5-10 min	High volume, users check quotas infrequently

Pro tip: Start with longer intervals and decrease based on actual user needs. Premature optimization wastes resources.

Rust Make This Pattern Better

Beyond the code examples above, Rust provides unique advantages for this architecture:

Type-Safe Schema and Event State

Use strong typing for event payloads with serde eliminate all kinds of serialization and deserialization vulnerability while also enable a graceful way with ADT and pattern matching to handle enumerate data.

1
#[derive(Debug, Clone, Copy, PartialEq, Eq, sqlx::Type)]
2
#[sqlx(type_name = "event_status", rename_all = "lowercase")]
3
pub enum EventStatus {
4
    Pending,    // Awaiting delivery
5
    Processed,  // Consumer completed
6
    Failed,     // Permanent failure
7
}

1
#[derive(Debug, Serialize, Deserialize)]
2
#[serde(tag = "event_name")]
3
pub enum EventPayload {
4
    AiResponseReady {
5
        conversation_id: Uuid,
6
        message_id: Uuid,
7
        redis_key: String,
8
        provider: AiProvider,
9
        model: String,
10
        tokens_used: i32,
11
    },
12
    PaymentCompleted {
13
        user_id: Uuid,
14
        payment_id: Uuid,
15
        plan: SubscriptionPlan,
16
        amount: Decimal,
17
    },
18
    // ... other hybrid events
19
}

1
#[derive(Debug, sqlx::FromRow)]
2
pub struct OutboxEvent {
3
    pub id: Uuid,
4
    pub event_type: String,
5
    pub payload: sqlx::types::Json<EventPayload>,
6
    pub status: EventStatus,
7
    pub created_at: DateTime<Utc>,
8
    pub sent_at: Option<DateTime<Utc>>,
9
    pub processed_at: Option<DateTime<Utc>>,
10
}
11

12
#[derive(Debug, sqlx::FromRow)]
13
pub struct ScheduledEvent {
14
    // ...
15
}
16

17
#[derive(Debug, sqlx::FromRow)]
18
pub struct AccumulationRecord {
19
    // ...
20
}

Compile-Time Database Schema Validation

This is the killer feature. When you run cargo build, sqlx:

Connects to your development database
Validates every query against the actual schema
Generates type-safe Rust structs
Catches mismatches before deployment

1
$ cargo sqlx prepare
2
Connecting to database...
3
Building query metadata for 94 queries...
4
Successfully saved query metadata to .sqlx/
5

6
$ cargo build
7
   Compiling outbox-service v0.1.0
8
    Finished dev [unoptimized + debuginfo] target(s) in 1.14s

If you change the database schema, queries break at compile time:

1
$ cargo build
2
error: error returned from database: column "scheduled_for" does not exist
3
  --> src/scanner.rs:23:5

This eliminates an entire class of production bugs.

When to Use This Pattern

Great fit when you have:

Mixed latency requirements across different event types
High-volume events that don’t need MQ overhead
Time-based events with known schedules
Need for auditability and event replay
Rust/TypeScript stack with strong typing requirements

Not ideal when:

All events have identical requirements (use simpler CDC or pure polling)
You need cross-datacenter replication (consider event streaming platforms)
Events are truly ephemeral with no persistence needs
Your team lacks operational capacity for managing scanners