Observability Hub: Event-Driven Observability Platform
A distributed, event-driven log collection platform where Node.js microservices publish structured events to RabbitMQ, and a Go collector with a 20-goroutine worker pool ingests them into PostgreSQL, Redis, and Elasticsearch — with strict-typed contracts via JSON Schema enforced across both TypeScript and Go.
Gallery
Problem Statement
In distributed microservice systems, there is no natural contract between the services emitting logs and the infrastructure consuming them. The result: different services write logs in different shapes, correlation between requests is lost across service boundaries, and there is no guarantee that what a producer emits will be parseable by the collector. The challenge was designing a system where the data contract is enforced at both ends — without blocking the services that emit logs.
Role: Lead Developer & Architect
Observability in microservices is only effective when the data flowing through it is structured, validated, and consistently stored. Without a centralized schema contract, each service emits logs in its own format — leading to broken parsing, lost request context, and an observability layer that cannot be trusted.
Observability Hub addresses this with a layered architecture: a shared @observability-hub/observability package provides all Node.js services with a Circuit Breaker-protected RabbitMQ publisher, a Correlation ID middleware, and Prometheus metrics. A Go collector with a 20-goroutine worker pool consumes these events and fans them out to three storage sinks — PostgreSQL for durable batch writes, Redis for deduplication and metadata caching, and Elasticsearch for full-text search indexing — all while exposing its own Prometheus metrics and handling failures via a Dead Letter Queue.
Architecture & Engineering Approach
A. Message Contracts and Type Safety
Cross-language schema enforcement
Used JSON Schema Draft 7+ to define event structures (Logs, Metrics, Traces). From these schemas, strict TypeScript interfaces and Go structs with json/validate tags are generated. The shared @observability-hub/event-contracts package guarantees both producers and consumers adhere to the exact same data definitions.
B. High-Performance Validation Pipeline
Processing 35K+ operations per second
Developed highly optimized validators including a Simple Validator capable of 35,714 validations/second and a Schema Validator reaching 8,333 ops/second. It enforces field-level error reporting, auto-detection of schema types, and batched validations to keep overhead negligible.
C. The Go Collector
Worker pool, multi-sink fan-out, fault isolation
The collector starts a configurable goroutine pool (default: 20) where all workers race on the same RabbitMQ deliveries channel — zero coordination overhead. Each message is fanned out to three sinks: PostgreSQL via pq.CopyIn batch writes (coordinated by an Adaptive Batch Optimizer that tunes batch size based on Redis cache hit ratio), Redis for EventID-based deduplication (24h TTL) and per-service metadata caching, and Elasticsearch via a fire-and-forget goroutine so that ES unavailability never blocks RabbitMQ acknowledgement. Malformed messages that fail JSON unmarshal receive Nack(false, false) and are routed to a Dead Letter Queue, isolating them from the healthy message flow without infinite retry loops.
D. Versioning and Log Client Library
Seamless backward compatibility
Implemented a rigorous Message Versioning Strategy (SemVer) with a migration framework to ensure older log structures don't break current analytics. A reusable @observability-hub/log-client was created for Node.js Express microservices, enabling developers to easily log business, trace, and metric events with injected correlation IDs.
E. Storage Architecture & Observability Suite
Three-tier storage: durability, speed, and search
**PostgreSQL** is the source of truth. The collector writes to it in batches using pq.CopyIn (the COPY protocol), which dramatically reduces round-trips compared to individual INSERTs. **Redis** plays a dual role inside the collector: it deduplicates events by storing EventID+CorrelationID keys with a 24-hour TTL, and it caches per-service metadata so the Adaptive Batch Optimizer can tune batch sizes at runtime without additional DB queries. **Elasticsearch** is the search layer — events are indexed asynchronously into per-service, per-month indices (logs-{service}-{YYYY-MM}), and failures there never affect the main pipeline because the goroutine writing to ES is decoupled from RabbitMQ acknowledgement.
Resilience: circuit breaker, DLQ, graceful shutdown
The Node.js @observability-hub/observability package embeds a CLOSED/OPEN/HALF_OPEN circuit breaker in the log publisher. If RabbitMQ becomes unavailable, after N failures the circuit opens and subsequent log calls fail fast instead of timing out — preventing cascading failures into HTTP handlers. On the collector side, malformed messages go to a Dead Letter Queue (DLQ) instead of retrying infinitely. On shutdown (SIGINT/SIGTERM), a 10-second drain window lets all in-flight workers finish and flushes the remaining batch buffer before closing connections.
Full distributed tracing and metrics visibility
**Jaeger** traces each request across service boundaries via the OpenTelemetry-compatible @obs/observability tracing module — a single trace ID reveals exactly where latency accumulates. **Prometheus** scrapes collector-level metrics (messages_processed, flush_duration_seconds, cache_hit_ratio, batch_size_optimized) and service-level metrics (request rate, error rate) from every Node.js service. **Grafana** unifies all of this into real-time dashboards. **RabbitMQ:** As the event backbone, it decouples producers from the collector entirely — traffic spikes buffer in the queue instead of back-pressuring the microservices.
Outcome: Measurable benefit
A production-grade observability pipeline where data contract enforcement, fault isolation, and storage fan-out are first-class concerns — not afterthoughts.
- 100% schema coverage across log, trace, and metric event types enforced at both producer and collector.
- TypeScript Simple Validator benchmarks at 35,714 ops/sec; Go collector sustains throughput under continuous load with batch write optimization.
- Three-sink fan-out (PostgreSQL + Redis + Elasticsearch) with independent failure modes — a dead ES node does not affect message acknowledgement.
- Full request traceability across service boundaries via Jaeger and injected correlation IDs.
Engineering Decisions
Architecture Diagram
Cookies & analytics
We use anonymous page views and usage statistics to improve the site. Data is used only for measurement; no personal data is collected.
