2024 · Tech lead

Real-Time Event Pipeline

A 12M-req/day event ingest and replay system built on Kafka and Postgres, with per-tenant schema evolution and deterministic replay.

stack
Go, Kafka, Postgres, Terraform, AWS
tags
backend, distributed-systems, data
status
shipped

Code

Problem

The legacy ingest path was a synchronous monolith. Every customer onboarding required a deploy because the schema lived in code. p99 latency was 1.4s on a good day. Schema mistakes leaked into production roughly every six weeks.

Constraints

  • Zero downtime cutover — customers were live the whole time.
  • Schemas had to be customer-editable but contract-tested before activation.
  • Replay had to be deterministic for compliance audit.

What I did

Designed a Kafka-backed pipeline with a schema-registry layer that enforces contract tests on every schema change. Each tenant gets a logical topic; partitioning keeps replay deterministic. The Postgres write path uses logical replication slots to power the audit-replay tool.

// Excerpt from the contract-test runner. Real version handles
// schema evolution graphs; this is the inner loop.
func runContractTests(prev, next *Schema, fixtures []Event) error {
    for _, ev := range fixtures {
        if err := validate(prev, ev); err != nil {
            continue // expected: this fixture is for the new shape
        }
        if err := validate(next, ev); err != nil {
            return fmt.Errorf("regression on fixture %s: %w", ev.ID, err)
        }
    }
    return nil
}

See the full validator on GitHub →

Outcome

  • p99 latency: 1.4s → 220ms
  • Schema-related incidents: ~9/year → 0 in the last 12 months
  • Onboarding new customer schemas: 3-day deploy cycle → 20-minute self-serve

What I’d do differently

I underestimated how much operational tooling we’d need around replay. If I were doing it again I’d build the replay UI in parallel with the ingest path, not six months later.