Hasura Cloud Billing Engine

Context and business stakes

When I joined this project, Hasura Cloud billing was still optimized for a smaller stage of the business. Plans were mostly fixed tiers, for example a flat monthly amount, and that worked until enterprise conversations became more advanced.

The sales conversations changed first. Customers were asking for contracts that reflected real usage patterns across request volume, throughput, and workloads. Our billing system could not express those deals without heavy engineering support every single time. That created a bottleneck in two places:

revenue operations, because pricing iteration depended on engineering cycles
customer onboarding, because final contract to invoice mapping was slow and fragile

The business stakes were immediate. We had a hard deadline of roughly 10 weeks before several enterprise deals needed a working usage-based model. Missing that window would not just delay one feature; it would directly slow revenue expansion and reduce confidence in our enterprise motion.

Constraints and non-goals

We had to design this as a financial system, not just another backend service. That meant correctness, determinism, and recoverability were first-class requirements.

Hard constraints:

invoice correctness had to be defensible under audit scrutiny
usage inputs were distributed across Prometheus and BigQuery
all monthly processing had to run across multiple cloud regions reliably
implementation timeline was fixed by business commitments

Non-goals for V1:

a fully real-time usage dashboard in-product
complete automation for every rare failure mode in month one
immediate removal of all legacy fixed-tier plan behavior

Those non-goals were intentional. We cut scope on things that were visible but non-critical, while preserving things that directly affected money movement and customer trust.

Architecture overview

The architecture separated data aggregation, pricing logic, and invoicing so each layer could evolve independently.

Billing system architecture

rendering diagram...

Key mechanics:

Regional workers ran daily jobs to aggregate usage per project, customer, and metric.
Aggregates were persisted in partitioned PostgreSQL tables using a time-based strategy (yyyymm with quarterly operational windows).
Pricing configuration was modeled as composable metric formulas, so plans could represent combinations like base price plus metric multipliers.
Stripe handled invoice artifacts while Hasura controlled usage calculation and charge inputs.

This split gave us a practical control plane: business teams could iterate on plan structure, while engineering retained a reliable calculation core.

Critical design decisions and tradeoffs

1) Postgres as transactional core

For billing, ACID guarantees were non-negotiable. We selected PostgreSQL for the source of financial truth and used partitioning to keep write throughput and report queries predictable as volume grew.

Tradeoff: we accepted tighter vertical constraints on the transactional core while still using distributed systems for high-volume telemetry sources.

2) Daily aggregation over raw-stream billing

We intentionally used daily aggregated usage snapshots rather than trying to price directly on raw high-cardinality streams in V1.

Why this worked:

easier reconciliation with finance
deterministic daily checkpoints for investigation and backfills
simpler query patterns for invoice generation

Tradeoff: less granular intra-day visibility, which we planned to address through later dashboard work.

3) Formula-driven pricing model

Instead of hardcoded plans, we modeled prices as configuration-backed metric formulas. This reduced engineering handoffs during deal structuring.

Tradeoff: we added strict validation guardrails to prevent invalid or ambiguous formulas from reaching billing execution.

4) Staged rollout and selective manual operations

To hit the 10-week deadline safely, we deferred some low-frequency automation:

rare retry edge cases initially surfaced through Slack alerts and were handled with controlled manual intervention
legacy fixed-tier support remained behind explicit guardrails during transition

Tradeoff: more operational runbook load in the short term, in exchange for shipping core correctness and business enablement on time.

Failure modes and mitigations

We treated failure-path design as part of launch criteria, not a post-launch clean-up.

Incomplete or delayed usage data

Risk: delayed aggregation could produce missing invoice inputs. Mitigation: daily checkpoint jobs, explicit completeness validation, and rerun-capable workers by region.

Incorrect plan configuration

Risk: invalid formulas could generate billing anomalies. Mitigation: strict config validation before activation plus internal review workflows for high-value plans.

Invoice generation bottlenecks

Risk: large monthly invoice runs could regress into multi-day cycles. Mitigation: batched query patterns on partitioned tables and focused operational dashboards for job progress.

Regulatory retention drift

Risk: storing usage records beyond policy windows. Mitigation: enforced two-year retention window and controlled cleanup policies.

The guiding idea was simple: if something fails, we should know where it failed, why it failed, and how to recover without financial ambiguity.

Outcomes with concrete metrics

The launch landed on schedule and supported immediate business outcomes:

shipped within the 10-week delivery window
onboarded 3 major enterprise customers at launch
increased paid conversion by roughly 20 percent compared to the old fixed-tier motion
reduced invoice generation from multi-day operations to same-day processing
established the foundation for Hasura's later pay-as-you-go expansion

Beyond the numbers, the system changed team dynamics. Finance and sales gained controlled flexibility in plan management, and engineering moved from reactive pricing edits to platform stewardship.

What I'd change now

If I were rebuilding the same platform today, I would ship two upgrades earlier:

A near real-time customer usage visibility layer with clear alignment to invoice snapshots, so customers can self-verify trends before billing cutoffs.
A policy-driven retry and exception workflow that preserves the same correctness guarantees but reduces manual intervention for low-frequency edge cases.

I would still keep the same principle stack:

deterministic usage checkpoints
strict transactional billing core
configuration-driven pricing
explicit guardrails around every state transition that affects money

This project was a strong reminder that billing systems are never just technical plumbing. They are revenue infrastructure, and the right architecture has to balance correctness, speed, and adaptability at the same time.