OpsConsole — Deployment Pipeline Observability Dashboard
A demo snapshot of an internal ops tool automating pipeline observability and on-call routing for small dev teams.
Demo Snapshot
This is a curated demo snapshot. Real project data is reviewed before publishing.
PRD
Section Layout ↔ Industry-Standard PRD Template
| Industry-Standard PRD Section | This Fixture's Section | Depth Notes |
|---|---|---|
| Problem Statement | Problem Definition | Slack channel saturation; PagerDuty lacks unified context view |
| Success Metrics | Success Metrics | Median MTTR 12 min to 4 min 30 s |
| Non-Functional Requirements | Non-Functional Requirements | viewer/operator/admin (3 roles); dashboard p95 under 3 s |
| (Domain-specific) | On-Call Flow | severity, throttle (10 min), rotation, escalation (15 min) |
Project Overview
OpsConsole is an internal operations tool that helps 5-20 engineer teams observe deploy pipeline events on a single dashboard and route alerts to the right on-call responder without drowning in Slack noise. As CI/CD runs grow to hundreds per day, shared Slack channels saturate with system logs and real outage signals get buried — OpsConsole targets that failure mode head-on. The product goal is a sub-5-minute MTTR from detect to first human ack; to hit that, severity classification, throttle windows, and escalation policies are first-class primitives, not afterthoughts. Vooster was used to structure the requirements and auto-distribute sprint tasks so that a single SRE plus a reviewer can ship and maintain the tool. It closes two gaps — pipeline visibility and alert fidelity — without introducing a separate observability SaaS (Datadog, NewRelic). The Vooster-generated PRD and task tree were fed directly into Claude Code, shipping webhook ingestion, the throttle engine, and the Slack template service in 2-3 day cycles each. After one month in production, median MTTR dropped from 12 minutes to 4 minutes 30 seconds, making OpsConsole the "fastest ROI internal tool" on the team's list.
Problem Statement
Growth-stage startup ops teams cannot separate signal from noise when deploy events spike, and the incident response lag grows accordingly. The incumbent options fall short in three distinct ways. First, raw Slack webhook integrations mix successful builds with failures, so severity is indistinguishable at a glance. Second, stand-alone PagerDuty can page the on-call but cannot show "what broke, in which environment, and how many times in a row" in a single dashboard view. Third, log viewers like Grafana or Loki lack a semantic classification of pipeline events (build/test/deploy), forcing each team to rebuild a dashboard from scratch. Fourth, Slack App workflows on their own cannot accumulate or query state such as throttle and escalation, which makes post-incident retros impossible. In practice, real outage notifications arrive 7-10 minutes late on average, and alert floods from repeat errors train responders to tune out notifications entirely. OpsConsole fuses "semantic event classification + severity-based throttling + time-aware routing" into a single workflow that unifies the dashboard with paging. The guiding principle is that an on-call engineer should be able to answer "what is broken right now, and who does it belong to" in under five seconds. On top of that, every alert has a permanent history so retros can examine the bottlenecks the next week.
Target Personas
Persona A — On-call SRE / Platform Engineer (alias: Hyunwoo Jung)
Role: Platform team engineer at a 10-15 engineer org, weekly night on-call rotation Context: Must respond instantly to Slack alerts outside business hours and decide severity within the first five minutes Daily Pain:
- Wastes time hand-filtering prod failures out of a webhook stream mixed with staging noise
- Builds alert fatigue from repeat errors and starts ignoring new, important signals
- Must open GitHub/GitLab tabs repeatedly to see the diff, commit, and author for a given failure
- Scrolls Slack endlessly during post-mortems to reconstruct "what happened at which minute"
- Cannot trace who handed off to whom because the tooling does not record handoffs Usage Context:
- A 2 AM Slack DM should land on a dashboard link that shows pipeline state, logs, and deploy diff in one view
- First five minutes of an incident should be enough to decide whether to roll back
- Shares the auto-generated incident summary page the next morning for the team retro Current alternatives: Slack webhooks + PagerDuty starter plan, homegrown log scrapers Adoption trigger: Last quarter's 22-minute MTTR outage from a missed notification, called out in the retro Core needs: severity-based throttle + environment color coding + a unified view of recent deploy diffs
Persona B — Deploy-approving Tech Lead (alias: Sejung Yoon)
Role: Product team tech lead, reviews 40-70 PRs per week and approves prod deploys Context: Watches success/failure signals for the ten minutes after a deploy and moves on Daily Pain:
- Opens multiple windows to check pipeline status across several services
- Discovers teammate deploy failures first and has to DM the author manually
- Approval workflow and monitoring are separate, so some approved deploys drift out of sight
- Hand-computes weekly deploy success rate for the release notes
- Re-explains "where everything runs" verbally every time a new hire onboards Usage Context:
- Filters currently running pipelines on the dashboard and auto-mentions the author on failure
- Pulls the week's success rate and MTTR for retros
- Shares the dashboard link with new hires as an instant service map Current alternatives: GitHub Actions dashboard, personal Slack DM channel Adoption trigger: Service count crosses five and tab-switching cost exceeds the threshold Core needs: per-service filters, role-based access, weekly reports
Persona C — QA / Release Manager (alias: Soomin Han)
Role: Release manager at a mid-market B2B SaaS team, owns the twice-weekly prod release cadence Context: Watches pipeline state around the release window and coordinates hotfix flows on failure Daily Pain:
- Collects post-release success/failure context from Slack, PR comments, and dashboards by hand
- Copies weekly deploy success rate into an exec-facing slide manually
- Struggles to distinguish staging pre-check failures from prod failures in Slack Usage Context:
- Applies a 24-hour window filter around the release to snapshot every pipeline in scope
- Exports the weekly report as CSV to roll up quarterly retro decks Current alternatives: Spreadsheet-based manual tracking plus GitHub Actions tabs Adoption trigger: Release cadence grows and manual tracking becomes the bottleneck Core needs: time-range filters, CSV export, success-rate breakdown by environment
User Stories
-
US-001: As an on-call SRE, I want to see only critical-severity pipeline failures in my paging feed, so that I can focus on real incidents without alert fatigue. Acceptance Criteria:
- Critical events only in the default view; info/warn are collapsed by default
- Repeat events within 10 min on the same pipeline are throttled to a counter
- Each critical card has a colored environment badge (prod/staging)
- Tapping the card jumps to the last three deploy diffs and author info
- Empty state explicitly signals "all clear" when no critical events exist
-
US-002: As an SRE, I want unacknowledged pages to escalate to a secondary on-call after 15 minutes, so that a single missed notification does not prolong downtime. Acceptance Criteria:
- Pages escalate to secondary after 15 minutes without ack
- Escalation events link back to the original alert thread as replies
- Escalation rules are configurable per service
- Audit log records timestamp, original responder, and final responder
- After 30 minutes broadcast to the admin group; after 60 minutes page the org owner
-
US-003: As a tech lead, I want to filter pipeline events by service name and environment, so that I can scope my review to the services I own. Acceptance Criteria:
- Top-bar filter allows multi-select of service and environment
- Filter state is reflected in the URL for bookmarking and sharing
- Empty filter results show an explicit empty state
- Five most recent filter combinations are available as quick-access chips
- Filters can be saved as personal or team-shared presets
-
US-004: As an on-call responder, I want the Slack alert message to include the last successful deploy's commit SHA, so that I can instantly compare what changed before rolling back. Acceptance Criteria:
- Slack Block Kit template includes short hashes for last-green SHA and current failing SHA
- "View Diff" button deep-links to GitHub/GitLab
- If the previous green was more than 24 hours ago, a warning icon is rendered
- Footer exposes throttle state and cumulative suppressed count
- Message carries three interactive buttons: ack, silence 30m, escalate now
-
US-005: As an admin, I want to manage on-call rotation and routing rules through a UI, so that I do not need to edit JSON files manually for every schedule change. Acceptance Criteria:
- Weekly rotation editor supports drag-and-drop for people and time slots
- Changes simulate the next 24 hours of paging in dry-run mode before save
- Save blocks conflicting rules with validation errors
- Audit log captures the diff against the previous saved state
- Supports holiday overrides (country calendar defaults plus custom dates)
-
US-006: As a tech lead, I want a weekly summary of deploy success rate and MTTR, so that I can discuss operational health in team retros with data. Acceptance Criteria:
- Auto-posts summary message to Slack every Monday 9 AM
- Summary lists success rate, median MTTR, and the top five flaky pipelines
- Export button emits a CSV
- Week-over-week deltas are color-coded
- Message carries a deep link to the dashboard scoped to that week
- Report cadence shifts earlier automatically on holiday weeks
-
US-007: As a viewer role team member, I want to browse the dashboard without modifying routing rules, so that I can stay informed without risk of accidental configuration changes. Acceptance Criteria:
- Routing and alert-template editors are read-only for viewer role
- Operator role can edit, admin role can manage roles and audit log
- Out-of-scope actions return HTTP 403 with an explicit UI banner
- Role changes only from the admin page
- Role-grant requests are routed to admins as in-app notifications
-
US-008: As a release manager, I want to filter deploy events within a specific release window and export success rate as CSV, so that I can include production data in executive release reports. Acceptance Criteria:
- Dashboard supports a custom time-range filter
- Summary row aggregates success rate and failure counts by environment and service
- CSV export is gated to operator+ and logged in the audit trail
- Filename encodes org, time range, and generation run
- Large exports are queued asynchronously and deliver a Slack DM on completion
-
US-009: As an on-call SRE, I want throttled suppression counts to be visible when the next real alert finally arrives, so that I understand the scale of the noise that was held back. Acceptance Criteria:
- The first message after suppression release shows the suppressed count in its footer
- Suppression period and release timestamp are both shown
- Changing throttle settings re-evaluates existing suppression state immediately
- Pipeline detail page renders cumulative suppression history as a timeline
- The list of suppressed source messages is paginated in suppression detail
-
US-010: As a responder, I want incident post-mortem summaries generated automatically, so that I can focus on writing what went wrong instead of compiling raw timestamps. Acceptance Criteria:
- Acknowledged critical events auto-generate an "incident summary" page
- Timeline covers exact timestamps for detect, page, ack, and resolved
- Relevant deploy diffs and author/reviewer are inlined
- Suppressed alert count and re-occurrence graph are included
- Summary pages are shareable via link scoped to the org
Core Feature Specs
F1. Webhook event ingestion & classification
Purpose: Ingest GitHub Actions / GitLab CI webhooks and auto-classify build/test/deploy events by environment (staging/prod) Behavior:
- Verify the webhook signature before parsing the payload
- Resolve the service name from repository, workflow name, and branch pattern
- Normalize event type to build/test/deploy and append to the timeline
- Merge re-deliveries via an idempotency key
- Default severity is info; prod deploy failures auto-promote to critical
- Payloads that fail environment classification are routed to a dead-letter queue with an admin alert
- First-seen repositories land in an admin-approval quarantine Priority: MUST Success signal: median webhook-to-dashboard lag under 3 seconds, classification accuracy above 98%
F2. Severity-based throttle & escalation engine
Purpose: Damp repeat alert floods and route unacknowledged pages to a secondary responder Behavior:
- Same pipeline + severity within 10 minutes only bumps the counter — no new alert
- Critical pages without ack in 15 minutes escalate to the secondary
- Throttle window and escalation time are overridable per service
- Escalation events link as replies to the original alert thread to preserve history
- Per-responder unack rates are collected as metrics
- Changing throttle settings re-evaluates existing suppression state immediately
- Manual "silence" button pauses alerts during a planned maintenance window Priority: MUST Success signal: repeat-alert suppression above 80%, escalation rate below 5%
F3. Slack alert template & deploy-diff context
Purpose: Provide Block Kit message templates by environment and severity plus immediate deploy-diff context Behavior:
- Block Kit templates keyed by environment (staging/prod) × severity (info/warn/critical)
- Messages carry last-green SHA, current failing SHA, and a diff link
- Responder mentions are dynamically resolved from the current rotation
- Message footer shows throttle state and a dashboard deep link
- Template edits go through preview-before-save in the admin UI
- Templates are versioned and can be rolled back
- Both ko/en templates are maintained; channel settings pick the language Priority: MUST Success signal: "View Diff" CTR above 40% in the alert message
F4. Role-based dashboard & weekly report
Purpose: Provide viewer/operator/admin access control and weekly aggregate reports for operational transparency Behavior:
- Middleware enforces role-based route access and toggles UI element visibility
- Dashboard filters and bookmarks reflect into the URL query
- Weekly aggregate reports auto-post to Slack and email
- CSV export is gated to operator+ roles
- Every state change is recorded in the audit log
- Role changes require a 2FA re-auth on the admin page
- Weekly report highlights the top five flaky pipelines Priority: SHOULD Success signal: weekly report open rate above 70% per team
F5. Audit log & change history viewer
Purpose: Persist routing/alert-template/rotation changes as diffs so admins can look them up retrospectively Behavior:
- Record every admin/operator write with author, timestamp, and diff
- Admin page filters by time range, actor, and target object
- Each record exposes a "restore to this point" button (with preview)
- Records older than 90 days move to cold storage for cost efficiency
- Supports JSON-L export to external SIEM systems
- Saved search presets manage filter combinations
- Major events (role grants, rotation changes) auto-notify the org owner Priority: COULD Success signal: audit lookup usage of 5+ queries per month in incident retros
Success Metrics
Each metric is aggregated in real time on the internal dashboard and auto-exports into the weekly retro packet.
- Median MTTR: from prod critical event to first ack. Target under 5 minutes; current baseline around 12 minutes → 58% reduction.
- Acknowledge rate: share of critical pages acknowledged within 15 minutes. Target above 95%; baseline 74%.
- Alert noise suppression rate (per 10 min): share of duplicate alerts damped by throttle. Target above 80%.
- Incident-to-Slack lag: webhook receipt to Slack message dispatch. Target p95 under 3 seconds.
- Weekly false-positive alerts: critical alerts with no resulting action. Target under 5 per week.
- Dashboard TTI: first-interactive time on dashboard entry within 2 seconds, p95.
- Deploy success visibility coverage: 95%+ of services auto-classified and surfaced on the dashboard.
- Onboarding time for a new service: webhook wired, first alert routed correctly within 30 minutes.
- Retro coverage: post-incident summary pages are referenced in 90%+ of weekly retros.
Non-Functional Requirements
- Performance: webhook-to-dashboard lag p95 under 3 seconds; Slack dispatch p95 under 5 seconds
- Reliability: idempotency keys collapse duplicate webhooks; at-least-once delivery becomes exactly-once processing
- Authorization: three-tier role model (viewer/operator/admin); all writes require operator+ and emit an audit log
- Audit: routing rules and alert templates are diff-persisted in the audit log, queryable for 90 days
- Observability: internal Prometheus metrics cover throttle backlogs, escalation rates, and Slack dispatch latency
- Localization: ko/en UI ship together; per-channel language settings drive alert copy
- Security: mandatory HMAC-SHA256 webhook signature verification; admin tokens rotate every 90 days per org
- Disaster recovery: daily DB snapshot plus transaction log retention; RPO 15 min, RTO 2 hours
On-Call Flow & Alert Routing
OpsConsole's on-call and routing pipeline runs through four discrete stages: severity classification → throttle window → rotation match → escalation. Every stage is independently configurable, and edits are recorded as diffs in the audit log so retros can trace back changes over time.
- Severity buckets:
- info: successful builds, staging failures, feature flag changes
- warn: prod test failures, same error three or more times in a row, slow-query threshold breach
- critical: prod deploy failures, health-check failures, manual critical flag, connection-pool exhaustion
- Throttle window: same pipeline + severity within 10 minutes is suppressed and the counter is incremented; the first post-suppression message exposes the suppressed count so the responder can size the silence.
- Time-aware rotation: business hours (09:00-18:00 KST) go to the primary tech lead, nights and weekends to the primary SRE on-call. Each service can override the rotation, and holidays are handled via admin override.
- Escalation rules: critical pages without ack in 15 minutes auto-route to the secondary; 30 minutes broadcasts to the admin group; 60 minutes falls back to an SMS to the org owner.
- Slack Block Kit templates: six key combinations of environment (staging/prod) × severity (info/warn/critical) — edited with preview-before-save, versioned, rollback-safe.
- Event flow: webhook ingest → signature verify → severity classify → throttle check → rotation resolve → Block Kit render → Slack dispatch → ack wait → escalation if no ack. Each stage logs independently for post-incident forensics.
- Ack paths: interactive Slack button, dashboard ack button, and REST API all accepted. Whichever path resolves, the event state is propagated everywhere.
- Safety net: if the escalation chain finishes without ack, a banner goes up on the dashboard and hand-off to the next rotation happens automatically.
- Post-incident review: acknowledged events auto-generate an "incident summary" page that colocates the timeline, related deploy diffs, and suppressed alert counts.
Scope Boundaries
V1 excludes the following.
- ML-based anomaly detection — rule-based severity ships first; ML is post-PMF
- Automated rollback actions — paging only; rollback stays a human decision
- Customer-facing status page — internal focus only; external status is delegated to a dedicated SaaS (StatusPage)
- Infra cost and resource analysis — pipeline events only, not cost metrics
- Long-horizon log aggregation — beyond 90 days delegated to an existing log platform
- Kubernetes and runtime health checks — pipeline events only, runtime state lives in the existing stack
- SSO/SAML integration — MVP uses Clerk's default auth; enterprise SSO waits until post-PMF
- Mobile app — responsive web is enough for on-call; native apps wait for demand
- Multi-cloud unification — GCP/AWS/Azure routing consolidation stays out of scope
Tech Stack & Architecture
- Frontend: Next.js + React, Tailwind CSS, shadcn/ui components. Dashboard is SSR with SSE for real-time refresh.
- API: tRPC end-to-end for type-safe routing, rule, and audit APIs.
- DB: PostgreSQL (Prisma ORM). Events table uses a composite index on (timestamp, service).
- Queue: Webhook ingestion is Vercel Edge Function + Postgres Listen/Notify — no external MQ, scales inside Postgres.
- External: Slack Web API (Block Kit), PagerDuty (optional, escalation tier), GitHub/GitLab webhooks.
- Deployment: Vercel app + Supabase DB; Vercel Cron drives weekly reports.
- Observability: OpenTelemetry captures tRPC and webhook spans, exported to Grafana Cloud.
The constraining assumption is the "internal tool" positioning — a single org is assumed, so multi-tenancy is an org-ID partition rather than full isolation. Offering this as an external SaaS would require a separate architecture pass. Incident automation itself (auto-rollback, auto-recovery playbooks) is explicitly out of scope for this phase; OpsConsole concentrates on the three axes of observe, route, and record.
태스크 트리
Sprint 1 — Ingestion & Observability
Pipeline event ingestion and base dashboard
-
[TASK-001] Pipeline event ingestion endpoint
DONE(complexity: 3, urgency: 5, importance: MUST)Build webhook endpoint to receive GitHub Actions / GitLab CI events, classify by type and environment, and persist.
- ST-001-1: Webhook receiver route
DONE - ST-001-2: Event classification logic
DONE
- ST-001-1: Webhook receiver route
-
[TASK-002] Deployment status dashboard (base view)
IN_PROGRESS(complexity: 4, urgency: 4, importance: MUST)Build real-time dashboard to visualize pipeline events by stage and environment.
- ST-002-1: Pipeline list component
DONE - ST-002-2: Real-time polling or SSE
IN_PROGRESS
- ST-002-1: Pipeline list component
Sprint 2 — On-call & Alerts
On-call routing rule engine and Slack alert templates
-
[TASK-003] On-call routing rule engine
BACKLOG(complexity: 5, urgency: 3, importance: MUST)Build a rule engine that maps service/time-zone/severity to on-call assignees using a priority-ordered JSON rule table. Higher-priority rules are evaluated first; first match wins on conflict.
-
[TASK-004] Slack notification template engine
BACKLOG(complexity: 4, urgency: 3, importance: SHOULD)Build an engine that resolves Slack Block Kit templates keyed by environment and severity, integrated with throttle logic to suppress duplicate alerts.
- ST-004-1: Block Kit template definitions
BACKLOG
- ST-004-1: Block Kit template definitions