Real-Time API Usage Analytics & Billing with ClickHouse and Redis Streams
How GoGoDuk meters API usage at scale with ClickHouse — Redis Streams ingestion, a ReplacingMergeTree schema, and AggregatingMergeTree materialized views for real-time per-customer dashboards and usage-based billing.
Every request to the GoGoDuk map API — an address autocomplete, a reverse geocode, a directions call — has to be metered. We need per-customer usage dashboards, rate-limit accounting, and accurate usage-based billing. The naive approach is to write one row to Postgres per request, then GROUP BY it later. At any serious request volume, that approach quietly destroys your primary database.
This post is the analytics layer that sits behind the rate limiter. If rate limiting is about enforcement — deciding whether a request is allowed right now — analytics is about measurement: counting, aggregating, and billing after the fact. We covered the enforcement side in Building Rate Limiters with Redis Lua Scripts. Here we build the measurement side with ClickHouse and Redis Streams.
Why Postgres and Redis Aren't Enough for Usage Analytics
The two stores we already run are great at what they do, and wrong for this job.
Redis holds live counters. It answers "is this API key over its limit right now?" in microseconds. But it keeps no history — you cannot ask it "how many requests did this customer make per endpoint over the last 30 days?"
Postgres is transactional (OLTP). It is excellent for reads and writes against indexed rows, which is exactly why we lean on it for the geocoding hot path — see PostGIS Performance Tuning. But analytical scans — GROUP BY organization, day across hundreds of millions of rows — are a different shape of problem. They bloat the primary DB, compete with production traffic for buffer cache, and get slower every week.
The trigger for us was concrete: a per-customer "usage over the last 30 days, broken down by endpoint" dashboard query started timing out. That is the moment you stop bolting analytics onto your OLTP database and move it to a columnar OLAP store. ClickHouse is built exactly for this: columnar storage, aggressive compression, and aggregation queries that scan billions of rows in well under a second.
Don't Write Analytics on the Hot Path
The first instinct — INSERT INTO usage_events synchronously inside the request handler — is a trap for two reasons.
First, it adds a network round-trip and a write to the latency of every single API call. Analytics should never make the product slower.
Second, ClickHouse hates many tiny inserts. Each INSERT creates a new data "part" on disk that the engine must later merge in the background. Thousands of single-row inserts per second produce a storm of tiny parts and crushing merge pressure. ClickHouse strongly prefers fewer, larger, batched inserts.
Both problems have the same answer: get the event off the request path immediately, and batch the writes elsewhere. The request handler emits a usage event to a Redis Stream (usage_logs_stream) and returns. A separate background worker drains that stream and writes to ClickHouse in batches.
API request ──emit──▶ Redis Stream (usage_logs_stream)
│
▼
Go consumer-group worker
(batch up to 1000)
│
▼
ClickHouse usage_eventsRedis Streams is a deliberate choice over a plain list or pub/sub: it is a durable, append-only log with consumer groups, which gives us at-least-once delivery, acknowledgements, and the ability to recover messages a crashed worker never finished.
The Ingestion Worker: a Redis Streams Consumer Group
The worker is a small Go service. It reads new messages with XReadGroup, processes them in batches, and acknowledges them. The important detail is that it also reclaims messages that were delivered to a consumer that died before acknowledging, using XAutoClaim — so a crash mid-batch never silently loses events.
const (
StreamName = "usage_logs_stream"
GroupName = "clickhouse_consumer"
BatchSize = 1000
IdleTimeout = 2 * time.Second
)
func (w *ClickhouseWorker) consume(repo *clickhouse.Repository) error {
for {
// 1) Reclaim messages a dead consumer never ACKed.
pending, _, err := w.rdb.XAutoClaim(w.ctx, &redis.XAutoClaimArgs{
Stream: StreamName, Group: GroupName, Consumer: "worker-1",
MinIdle: 5 * time.Second, Start: "0-0", Count: BatchSize,
}).Result()
if err != nil && !errors.Is(err, redis.Nil) {
return err
}
if len(pending) > 0 {
if err := w.processMessages(repo, pending); err != nil {
return err
}
continue
}
// 2) Read a fresh batch, blocking up to IdleTimeout.
streams, err := w.rdb.XReadGroup(w.ctx, &redis.XReadGroupArgs{
Group: GroupName, Consumer: "worker-1",
Streams: []string{StreamName, ">"},
Count: BatchSize, Block: IdleTimeout,
}).Result()
if errors.Is(err, redis.Nil) {
continue
}
if err != nil {
return err
}
for _, s := range streams {
if err := w.processMessages(repo, s.Messages); err != nil {
return err
}
}
}
}processMessages parses each entry into a typed UsageEvent, drops anything malformed into a dead-letter stream (usage_logs_dead_letter) instead of crashing the pipeline, batch-inserts the good rows, and only then XAcks the whole batch. The ClickHouse write uses a prepared batch — one INSERT for up to a thousand rows:
batch, _ := conn.PrepareBatch(ctx, `INSERT INTO usage_events (...)`)
for _, e := range events {
batch.Append(e.RequestID, e.OrganizationID, e.Endpoint, /* ... */)
}
if err := batch.Send(); err != nil {
return err
}A property worth calling out: if ClickHouse is unavailable, the worker logs it and keeps Postgres metering active while the events accumulate safely in the Redis Stream. The worker reconnects on a fixed delay and drains the backlog. ClickHouse being down degrades analytics freshness — it never drops events and never touches the request path.
At-Least-Once Delivery, Effectively-Once Billing
Consumer groups guarantee at-least-once delivery: after a crash-and-reclaim, the same event can be processed twice. For billing, double-counting is unacceptable. We handle it on two layers:
- The worker deduplicates within a batch by
requestIdbefore inserting. - The table is a
ReplacingMergeTree(ingestedAt)keyed so that duplicaterequestIds collapse to a single row during background merges.
Together these turn at-least-once delivery into effectively-once accounting — the foundation of a billing system you can trust.
Designing the usage_events Schema
Here is the raw events table, trimmed to its load-bearing columns:
CREATE TABLE usage_events
(
requestId String,
organizationId String,
apiKeyId Nullable(String),
operation LowCardinality(String),
endpoint LowCardinality(String),
statusCode UInt16,
statusClass UInt8,
elapsedMs UInt32,
providerUsed LowCardinality(String) DEFAULT '',
country LowCardinality(String) DEFAULT '',
createdAt DateTime64(3),
ingestedAt DateTime64(3) DEFAULT now64(3)
)
ENGINE = ReplacingMergeTree(ingestedAt)
PARTITION BY toYYYYMM(createdAt)
ORDER BY (organizationId, toDate(createdAt), endpoint, createdAt, requestId)
TTL createdAt + INTERVAL 90 DAY DELETE;A few decisions matter here:
LowCardinality(String)onendpoint,operation,providerUsed,country, and similar columns. These have a handful of distinct values across the whole table;LowCardinalitydictionary-encodes them for a large compression and speed win versus plainString.ORDER BY (organizationId, toDate(createdAt), endpoint, ...)— the sort key leads withorganizationIdbecause the dominant access pattern is per-customer: every dashboard and every invoice filters by organization first. Leading the primary key with the column you filter on lets ClickHouse skip almost all the data.PARTITION BY toYYYYMM(createdAt)plusTTL ... 90 DAY DELETEmakes retention nearly free: old months are whole partitions that get dropped, not row-by-row deletes.ReplacingMergeTree(ingestedAt)gives us the idempotency described above, using the latestingestedAtas the version.
Pre-Aggregating with Materialized Views
Reading raw events for every dashboard load is wasteful: a dashboard wants "requests per endpoint per hour for this customer", not a billion individual rows. So we roll the raw stream up as it arrives, using materialized views that write into AggregatingMergeTree tables.
CREATE TABLE usage_hourly_identity
(
hour DateTime,
organizationId String,
endpoint LowCardinality(String),
requests AggregateFunction(uniqCombined64, String),
errors AggregateFunction(uniqCombined64, String),
latencyMs AggregateFunction(quantilesTiming(0.5, 0.95, 0.99), UInt32)
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (organizationId, hour, endpoint);
CREATE MATERIALIZED VIEW usage_hourly_identity_mv TO usage_hourly_identity AS
SELECT
toStartOfHour(createdAt) AS hour, organizationId, endpoint,
uniqCombined64State(requestId) AS requests,
uniqCombined64IfState(requestId, statusCode >= 400) AS errors,
quantilesTimingState(0.5, 0.95, 0.99)(elapsedMs) AS latencyMs
FROM usage_events
GROUP BY hour, organizationId, endpoint;The *State combinators (uniqCombined64State, quantilesTimingState) store partial aggregation states, not final values — so percentiles and distinct counts stay correct as ClickHouse merges hourly buckets over time. A dashboard reads from this compact rollup and responds fast, regardless of how large the raw table grows. This same rollup powers the 80% and 100% quota thresholds and the monthly invoice totals that customers see in their GoGoDuk dashboard.
Querying for Real-Time Insight
To read an AggregatingMergeTree rollup you finalize the stored states with the matching *Merge combinator. For example, a per-customer usage summary over a window:
SELECT
uniqCombined64Merge(requests) AS requests,
uniqCombined64Merge(errors) AS errors,
quantilesTimingMerge(0.5, 0.95, 0.99)(latencyMs) AS latency
FROM usage_hourly_identity
WHERE organizationId = {org:String}
AND hour >= {from:DateTime} AND hour < {to:DateTime};Top endpoints by p95 latency over the last seven days:
SELECT
endpoint,
quantilesTimingMerge(0.95)(latencyMs)[1] AS p95_ms,
uniqCombined64Merge(requests) AS requests
FROM usage_hourly_identity
WHERE hour >= now() - INTERVAL 7 DAY
GROUP BY endpoint
ORDER BY p95_ms DESC;Because we read pre-aggregated hourly buckets instead of raw events, these stay fast as history grows — the cost scales with the number of hours, not the number of requests.
Operational Lessons and Pitfalls
- Batch size is a trade-off. Too small and you recreate the tiny-inserts problem. Too large and a crash loses a bigger in-flight window (and uses more memory). A batch around a thousand rows with a short idle flush has been a good balance for us.
- Async means eventual. Dashboards lag reality by the flush interval. That is perfectly fine for analytics and billing, but it is exactly why Redis — not ClickHouse — still owns live rate-limit enforcement. The two-system split is intentional: Redis for the decision now, ClickHouse for the truth over time.
- Watch the dead-letter stream. A spike in
usage_logs_dead_lettermeans a producer changed its event shape. Alert on it; don't let bad events vanish silently. - Let partitions and TTL do retention. Dropping a month-old partition is nearly instant; deleting rows one by one is not.
Wrapping Up
The architecture comes down to a clean division of labor: Redis enforces limits in real time, ClickHouse measures usage at scale, and materialized views serve dashboards and billing instantly. Three takeaways worth carrying to any high-volume API:
- Never meter synchronously on the hot path — emit to a durable log and batch the writes.
- Match the storage engine to the access pattern: OLTP (Postgres) for transactions, OLAP (ClickHouse) for aggregation.
- Pre-aggregate with materialized views so dashboard cost scales with time, not traffic.
For the enforcement side of this same system, see Building Rate Limiters with Redis Lua Scripts.
Facing performance issues or scaling challenges?
I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.
Related Articles
8 Jun 2026
Redis 8.8 Released: New Native Rate Limiter, Array Data Structure, and Up to +83% Performance Boost
A deep technical breakdown of the newly released Redis 8.8 (June 2, 2026). Explore the new O(1) sparse-friendly Array structure by antirez, the native INCREX rate-limiting command, and Hash field-level subkey notifications.
8 Jun 2026
PostGIS Performance Tuning: From 2s to 10ms for Vietnamese Spatial Queries (Gogoduk Case Study)
Explore practical PostGIS database optimization techniques from the Gogoduk Map API project. Learn how migrating from Geometry to Geography, designing Partial GIST indexes, and simplifying polygons can achieve 10ms query times.
8 Jun 2026
Redis Lua Script & SETNX: High-Performance Rate Limiting & Quota Alerting for APIs
Learn how Gogoduk builds API Rate Limiting and Quota Alerting systems with Redis and Go. Discover how to use Lua scripts for atomicity, prevent memory leaks, and leverage SETNX to deduplicate notifications.
4 Jun 2026
PostgreSQL Full-Text Search: Optimizing Fast Address Autocomplete for Vietnamese Text
A comprehensive guide to building a fast, accent-insensitive, and typo-tolerant Vietnamese address autocomplete system in PostgreSQL using unaccent, FTS, and Trigram indexing.