<- back to architecture

Klarna SEPA Instant Payments

Klarna | Software Engineer II

Oct 2021 - Oct 2022

paymentssepa-instantreliabilityobservabilitydistributed-systems

Context and business stakes

At Klarna, this project introduced SEPA Instant capabilities for customers who were previously limited to slower transfer rails. The shift mattered because it changed the core customer experience from scheduled settlement to near real-time money movement.

The external envelope was strict: transfer handling had to fit within a hard timing budget, and that budget was shared across multiple institutions, not only our own systems. That means every internal stage had to be fast, observable, and predictable.

From a business standpoint, this was both a product unlock and a trust contract. If we shipped without operational control, even a technically working path could generate support volume and customer anxiety.

Constraints and non-goals

Hard constraints:

  • MVP delivery window around 6 to 8 weeks
  • strict end-to-end timing budget per transfer
  • dependency chain across Form3, threat/security checks, internal accounting systems, and clearinghouse handoff
  • compliance and auditability requirements for payment state transitions

Non-goals for phase one:

  • perfect automation for every rare failure branch at initial launch
  • full retry and DLQ sophistication before happy-path readiness
  • broad optimization of non-critical routes before core rail stability

We intentionally staged this system into phases so we could ship value without pretending the hardening work was already complete.

Architecture overview

The core architecture was an event-driven, multi-stage processing chain with explicit stage ownership and latency monitoring.

SEPA Instant transfer lifecycle
rendering diagram...

A second practical path handled internal transfers more efficiently:

Intra-Klarna short-circuit path
rendering diagram...

This short-circuit reduced avoidable external dependencies for internal transfers and improved latency headroom.

Critical design decisions and tradeoffs

1) Two-phase delivery plan

I proposed a split:

  • phase 1: stable happy path with end-to-end integration
  • phase 2: retries, DLQs, richer alerting, and broader edge-case automation

Tradeoff: some failure automation arrived after MVP, but we protected launch confidence and deadline commitments.

2) Real integration testing over mock-heavy confidence

We leaned on Form3 staging integration with dedicated queues and realistic message flows. This surfaced timing and contract issues earlier than synthetic tests would.

Tradeoff: test cycles were heavier, but fidelity was much higher and reduced production surprise.

3) Audit and observability as architecture, not logging afterthought

We introduced fine-grained transfer lifecycle logs and Grafana stage metrics so we could trace where a transfer was and how long each hop consumed.

Tradeoff: upfront engineering effort increased, but operational debugging time dropped significantly after launch.

4) Contract alignment across sister teams

The flow crossed team boundaries. We aligned expectations and stage-level behaviors with Core Account and Threat Service teams so SLA assumptions were explicit.

Tradeoff: additional coordination overhead, but fewer hidden integration mismatches late in the cycle.

Failure modes and mitigations

Stage timeout and queue buildup

Risk: transfers breach SLA when one stage slows down. Mitigation: stage latency dashboards, alert thresholds, and queue-level monitoring.

Unprocessed messages

Risk: silent transfer stagnation in asynchronous workflows. Mitigation: retry policies with exponential backoff and DLQ routing with on-call Slack alerting.

Ambiguous transfer state for support

Risk: support teams escalate blindly when status is unclear. Mitigation: lifecycle audit logs and per-stage state visibility to reduce guesswork.

External dependency volatility

Risk: upstream or downstream behavior drifts unexpectedly. Mitigation: end-to-end integration validation in staging and clear operational runbooks for known degradation modes.

The key reliability move was to make bad states visible and actionable, not hidden behind partial success logs.

Outcomes with concrete metrics

The delivery sequence achieved the intended business and operational outcomes:

  • MVP shipped on schedule for high-priority use cases
  • production latency stayed comfortably within SLA budget, with P95 around 4 to 5 seconds
  • phase 2 added robust retries, DLQ support, and richer failure diagnostics
  • support and on-call burden for stuck or unclear transfers dropped after observability hardening
  • implementation patterns became a template for future real-time scheme work

This project reinforced a lesson I value: in payments, throughput matters, but state clarity matters more when systems fail.

What I'd change now

I would prioritize transaction observability even earlier in the timeline. We reached the right end state, but earlier stage-level instrumentation would have reduced integration uncertainty sooner.

I would also add one more design upgrade from day one:

  • a transfer-state timeline view that combines queue metadata, stage latencies, and policy decisions in one operational surface for engineers and support teams

The principle I would keep unchanged is phased delivery with explicit reliability milestones. In systems with external dependencies and hard SLAs, pretending everything is solved in one release is riskier than shipping a disciplined phase one and hardening fast.