Discord

PRD

Section Layout ↔ Industry-Standard PRD Template

Industry-Standard PRD Section	This Fixture's Section	Depth Notes
Problem Statement	Problem Definition	Slack channel saturation; PagerDuty lacks unified context view
Success Metrics	Success Metrics	Median MTTR 12 min to 4 min 30 s
Non-Functional Requirements	Non-Functional Requirements	viewer/operator/admin (3 roles); dashboard p95 under 3 s
(Domain-specific)	On-Call Flow	severity, throttle (10 min), rotation, escalation (15 min)

Project Overview

OpsConsole is an internal operations tool that helps 5-20 engineer teams observe deploy pipeline events on a single dashboard and route alerts to the right on-call responder without drowning in Slack noise. As CI/CD runs grow to hundreds per day, shared Slack channels saturate with system logs and real outage signals get buried — OpsConsole targets that failure mode head-on. The product goal is a sub-5-minute MTTR from detect to first human ack; to hit that, severity classification, throttle windows, and escalation policies are first-class primitives, not afterthoughts. Vooster was used to structure the requirements and auto-distribute sprint tasks so that a single SRE plus a reviewer can ship and maintain the tool. It closes two gaps — pipeline visibility and alert fidelity — without introducing a separate observability SaaS (Datadog, NewRelic). The Vooster-generated PRD and task tree were fed directly into Claude Code, shipping webhook ingestion, the throttle engine, and the Slack template service in 2-3 day cycles each. After one month in production, median MTTR dropped from 12 minutes to 4 minutes 30 seconds, making OpsConsole the "fastest ROI internal tool" on the team's list.

Problem Statement

Growth-stage startup ops teams cannot separate signal from noise when deploy events spike, and the incident response lag grows accordingly. The incumbent options fall short in three distinct ways. First, raw Slack webhook integrations mix successful builds with failures, so severity is indistinguishable at a glance. Second, stand-alone PagerDuty can page the on-call but cannot show "what broke, in which environment, and how many times in a row" in a single dashboard view. Third, log viewers like Grafana or Loki lack a semantic classification of pipeline events (build/test/deploy), forcing each team to rebuild a dashboard from scratch. Fourth, Slack App workflows on their own cannot accumulate or query state such as throttle and escalation, which makes post-incident retros impossible. In practice, real outage notifications arrive 7-10 minutes late on average, and alert floods from repeat errors train responders to tune out notifications entirely. OpsConsole fuses "semantic event classification + severity-based throttling + time-aware routing" into a single workflow that unifies the dashboard with paging. The guiding principle is that an on-call engineer should be able to answer "what is broken right now, and who does it belong to" in under five seconds. On top of that, every alert has a permanent history so retros can examine the bottlenecks the next week.

Target Personas

Persona A — On-call SRE / Platform Engineer (alias: Hyunwoo Jung)

Role: Platform team engineer at a 10-15 engineer org, weekly night on-call rotation Context: Must respond instantly to Slack alerts outside business hours and decide severity within the first five minutes Daily Pain:

Wastes time hand-filtering prod failures out of a webhook stream mixed with staging noise
Builds alert fatigue from repeat errors and starts ignoring new, important signals
Must open GitHub/GitLab tabs repeatedly to see the diff, commit, and author for a given failure
Scrolls Slack endlessly during post-mortems to reconstruct "what happened at which minute"
Cannot trace who handed off to whom because the tooling does not record handoffs Usage Context:
A 2 AM Slack DM should land on a dashboard link that shows pipeline state, logs, and deploy diff in one view
First five minutes of an incident should be enough to decide whether to roll back
Shares the auto-generated incident summary page the next morning for the team retro Current alternatives: Slack webhooks + PagerDuty starter plan, homegrown log scrapers Adoption trigger: Last quarter's 22-minute MTTR outage from a missed notification, called out in the retro Core needs: severity-based throttle + environment color coding + a unified view of recent deploy diffs

Persona B — Deploy-approving Tech Lead (alias: Sejung Yoon)

Role: Product team tech lead, reviews 40-70 PRs per week and approves prod deploys Context: Watches success/failure signals for the ten minutes after a deploy and moves on Daily Pain:

Opens multiple windows to check pipeline status across several services
Discovers teammate deploy failures first and has to DM the author manually
Approval workflow and monitoring are separate, so some approved deploys drift out of sight
Hand-computes weekly deploy success rate for the release notes
Re-explains "where everything runs" verbally every time a new hire onboards Usage Context:
Filters currently running pipelines on the dashboard and auto-mentions the author on failure
Pulls the week's success rate and MTTR for retros
Shares the dashboard link with new hires as an instant service map Current alternatives: GitHub Actions dashboard, personal Slack DM channel Adoption trigger: Service count crosses five and tab-switching cost exceeds the threshold Core needs: per-service filters, role-based access, weekly reports

Persona C — QA / Release Manager (alias: Soomin Han)

Role: Release manager at a mid-market B2B SaaS team, owns the twice-weekly prod release cadence Context: Watches pipeline state around the release window and coordinates hotfix flows on failure Daily Pain:

Collects post-release success/failure context from Slack, PR comments, and dashboards by hand
Copies weekly deploy success rate into an exec-facing slide manually
Struggles to distinguish staging pre-check failures from prod failures in Slack Usage Context:
Applies a 24-hour window filter around the release to snapshot every pipeline in scope
Exports the weekly report as CSV to roll up quarterly retro decks Current alternatives: Spreadsheet-based manual tracking plus GitHub Actions tabs Adoption trigger: Release cadence grows and manual tracking becomes the bottleneck Core needs: time-range filters, CSV export, success-rate breakdown by environment

User Stories

US-001: As an on-call SRE, I want to see only critical-severity pipeline failures in my paging feed, so that I can focus on real incidents without alert fatigue. Acceptance Criteria:
- Critical events only in the default view; info/warn are collapsed by default
- Repeat events within 10 min on the same pipeline are throttled to a counter
- Each critical card has a colored environment badge (prod/staging)
- Tapping the card jumps to the last three deploy diffs and author info
- Empty state explicitly signals "all clear" when no critical events exist
US-002: As an SRE, I want unacknowledged pages to escalate to a secondary on-call after 15 minutes, so that a single missed notification does not prolong downtime. Acceptance Criteria:
- Pages escalate to secondary after 15 minutes without ack
- Escalation events link back to the original alert thread as replies
- Escalation rules are configurable per service
- Audit log records timestamp, original responder, and final responder
- After 30 minutes broadcast to the admin group; after 60 minutes page the org owner
US-003: As a tech lead, I want to filter pipeline events by service name and environment, so that I can scope my review to the services I own. Acceptance Criteria:
- Top-bar filter allows multi-select of service and environment
- Filter state is reflected in the URL for bookmarking and sharing
- Empty filter results show an explicit empty state
- Five most recent filter combinations are available as quick-access chips
- Filters can be saved as personal or team-shared presets
US-004: As an on-call responder, I want the Slack alert message to include the last successful deploy's commit SHA, so that I can instantly compare what changed before rolling back. Acceptance Criteria:
- Slack Block Kit template includes short hashes for last-green SHA and current failing SHA
- "View Diff" button deep-links to GitHub/GitLab
- If the previous green was more than 24 hours ago, a warning icon is rendered
- Footer exposes throttle state and cumulative suppressed count
- Message carries three interactive buttons: ack, silence 30m, escalate now
US-005: As an admin, I want to manage on-call rotation and routing rules through a UI, so that I do not need to edit JSON files manually for every schedule change. Acceptance Criteria:
- Weekly rotation editor supports drag-and-drop for people and time slots
- Changes simulate the next 24 hours of paging in dry-run mode before save
- Save blocks conflicting rules with validation errors
- Audit log captures the diff against the previous saved state
- Supports holiday overrides (country calendar defaults plus custom dates)
US-006: As a tech lead, I want a weekly summary of deploy success rate and MTTR, so that I can discuss operational health in team retros with data. Acceptance Criteria:
- Auto-posts summary message to Slack every Monday 9 AM
- Summary lists success rate, median MTTR, and the top five flaky pipelines
- Export button emits a CSV
- Week-over-week deltas are color-coded
- Message carries a deep link to the dashboard scoped to that week
- Report cadence shifts earlier automatically on holiday weeks
US-007: As a viewer role team member, I want to browse the dashboard without modifying routing rules, so that I can stay informed without risk of accidental configuration changes. Acceptance Criteria:
- Routing and alert-template editors are read-only for viewer role
- Operator role can edit, admin role can manage roles and audit log
- Out-of-scope actions return HTTP 403 with an explicit UI banner
- Role changes only from the admin page
- Role-grant requests are routed to admins as in-app notifications
US-008: As a release manager, I want to filter deploy events within a specific release window and export success rate as CSV, so that I can include production data in executive release reports. Acceptance Criteria:
- Dashboard supports a custom time-range filter
- Summary row aggregates success rate and failure counts by environment and service
- CSV export is gated to operator+ and logged in the audit trail
- Filename encodes org, time range, and generation run
- Large exports are queued asynchronously and deliver a Slack DM on completion
US-009: As an on-call SRE, I want throttled suppression counts to be visible when the next real alert finally arrives, so that I understand the scale of the noise that was held back. Acceptance Criteria:
- The first message after suppression release shows the suppressed count in its footer
- Suppression period and release timestamp are both shown
- Changing throttle settings re-evaluates existing suppression state immediately
- Pipeline detail page renders cumulative suppression history as a timeline
- The list of suppressed source messages is paginated in suppression detail
US-010: As a responder, I want incident post-mortem summaries generated automatically, so that I can focus on writing what went wrong instead of compiling raw timestamps. Acceptance Criteria:
- Acknowledged critical events auto-generate an "incident summary" page
- Timeline covers exact timestamps for detect, page, ack, and resolved
- Relevant deploy diffs and author/reviewer are inlined
- Suppressed alert count and re-occurrence graph are included
- Summary pages are shareable via link scoped to the org

Core Feature Specs

F1. Webhook event ingestion & classification

Purpose: Ingest GitHub Actions / GitLab CI webhooks and auto-classify build/test/deploy events by environment (staging/prod) Behavior:

Verify the webhook signature before parsing the payload
Resolve the service name from repository, workflow name, and branch pattern
Normalize event type to build/test/deploy and append to the timeline
Merge re-deliveries via an idempotency key
Default severity is info; prod deploy failures auto-promote to critical
Payloads that fail environment classification are routed to a dead-letter queue with an admin alert
First-seen repositories land in an admin-approval quarantine Priority: MUST Success signal: median webhook-to-dashboard lag under 3 seconds, classification accuracy above 98%

F2. Severity-based throttle & escalation engine

Purpose: Damp repeat alert floods and route unacknowledged pages to a secondary responder Behavior:

Same pipeline + severity within 10 minutes only bumps the counter — no new alert
Critical pages without ack in 15 minutes escalate to the secondary
Throttle window and escalation time are overridable per service
Escalation events link as replies to the original alert thread to preserve history
Per-responder unack rates are collected as metrics
Changing throttle settings re-evaluates existing suppression state immediately
Manual "silence" button pauses alerts during a planned maintenance window Priority: MUST Success signal: repeat-alert suppression above 80%, escalation rate below 5%

F3. Slack alert template & deploy-diff context

Purpose: Provide Block Kit message templates by environment and severity plus immediate deploy-diff context Behavior:

Block Kit templates keyed by environment (staging/prod) × severity (info/warn/critical)
Messages carry last-green SHA, current failing SHA, and a diff link
Responder mentions are dynamically resolved from the current rotation
Message footer shows throttle state and a dashboard deep link
Template edits go through preview-before-save in the admin UI
Templates are versioned and can be rolled back
Both ko/en templates are maintained; channel settings pick the language Priority: MUST Success signal: "View Diff" CTR above 40% in the alert message

F4. Role-based dashboard & weekly report

Purpose: Provide viewer/operator/admin access control and weekly aggregate reports for operational transparency Behavior:

Middleware enforces role-based route access and toggles UI element visibility
Dashboard filters and bookmarks reflect into the URL query
Weekly aggregate reports auto-post to Slack and email
CSV export is gated to operator+ roles
Every state change is recorded in the audit log
Role changes require a 2FA re-auth on the admin page
Weekly report highlights the top five flaky pipelines Priority: SHOULD Success signal: weekly report open rate above 70% per team

F5. Audit log & change history viewer

Purpose: Persist routing/alert-template/rotation changes as diffs so admins can look them up retrospectively Behavior:

Record every admin/operator write with author, timestamp, and diff
Admin page filters by time range, actor, and target object
Each record exposes a "restore to this point" button (with preview)
Records older than 90 days move to cold storage for cost efficiency
Supports JSON-L export to external SIEM systems
Saved search presets manage filter combinations
Major events (role grants, rotation changes) auto-notify the org owner Priority: COULD Success signal: audit lookup usage of 5+ queries per month in incident retros

Success Metrics

Each metric is aggregated in real time on the internal dashboard and auto-exports into the weekly retro packet.

Median MTTR: from prod critical event to first ack. Target under 5 minutes; current baseline around 12 minutes → 58% reduction.
Acknowledge rate: share of critical pages acknowledged within 15 minutes. Target above 95%; baseline 74%.
Alert noise suppression rate (per 10 min): share of duplicate alerts damped by throttle. Target above 80%.
Incident-to-Slack lag: webhook receipt to Slack message dispatch. Target p95 under 3 seconds.
Weekly false-positive alerts: critical alerts with no resulting action. Target under 5 per week.
Dashboard TTI: first-interactive time on dashboard entry within 2 seconds, p95.
Deploy success visibility coverage: 95%+ of services auto-classified and surfaced on the dashboard.
Onboarding time for a new service: webhook wired, first alert routed correctly within 30 minutes.
Retro coverage: post-incident summary pages are referenced in 90%+ of weekly retros.

Non-Functional Requirements

Performance: webhook-to-dashboard lag p95 under 3 seconds; Slack dispatch p95 under 5 seconds
Reliability: idempotency keys collapse duplicate webhooks; at-least-once delivery becomes exactly-once processing
Authorization: three-tier role model (viewer/operator/admin); all writes require operator+ and emit an audit log
Audit: routing rules and alert templates are diff-persisted in the audit log, queryable for 90 days
Observability: internal Prometheus metrics cover throttle backlogs, escalation rates, and Slack dispatch latency
Localization: ko/en UI ship together; per-channel language settings drive alert copy
Security: mandatory HMAC-SHA256 webhook signature verification; admin tokens rotate every 90 days per org
Disaster recovery: daily DB snapshot plus transaction log retention; RPO 15 min, RTO 2 hours

On-Call Flow & Alert Routing

OpsConsole's on-call and routing pipeline runs through four discrete stages: severity classification → throttle window → rotation match → escalation. Every stage is independently configurable, and edits are recorded as diffs in the audit log so retros can trace back changes over time.

Severity buckets:
- info: successful builds, staging failures, feature flag changes
- warn: prod test failures, same error three or more times in a row, slow-query threshold breach
- critical: prod deploy failures, health-check failures, manual critical flag, connection-pool exhaustion
Throttle window: same pipeline + severity within 10 minutes is suppressed and the counter is incremented; the first post-suppression message exposes the suppressed count so the responder can size the silence.
Time-aware rotation: business hours (09:00-18:00 KST) go to the primary tech lead, nights and weekends to the primary SRE on-call. Each service can override the rotation, and holidays are handled via admin override.
Escalation rules: critical pages without ack in 15 minutes auto-route to the secondary; 30 minutes broadcasts to the admin group; 60 minutes falls back to an SMS to the org owner.
Slack Block Kit templates: six key combinations of environment (staging/prod) × severity (info/warn/critical) — edited with preview-before-save, versioned, rollback-safe.
Event flow: webhook ingest → signature verify → severity classify → throttle check → rotation resolve → Block Kit render → Slack dispatch → ack wait → escalation if no ack. Each stage logs independently for post-incident forensics.
Ack paths: interactive Slack button, dashboard ack button, and REST API all accepted. Whichever path resolves, the event state is propagated everywhere.
Safety net: if the escalation chain finishes without ack, a banner goes up on the dashboard and hand-off to the next rotation happens automatically.
Post-incident review: acknowledged events auto-generate an "incident summary" page that colocates the timeline, related deploy diffs, and suppressed alert counts.

Scope Boundaries

V1 excludes the following.

ML-based anomaly detection — rule-based severity ships first; ML is post-PMF
Automated rollback actions — paging only; rollback stays a human decision
Customer-facing status page — internal focus only; external status is delegated to a dedicated SaaS (StatusPage)
Infra cost and resource analysis — pipeline events only, not cost metrics
Long-horizon log aggregation — beyond 90 days delegated to an existing log platform
Kubernetes and runtime health checks — pipeline events only, runtime state lives in the existing stack
SSO/SAML integration — MVP uses Clerk's default auth; enterprise SSO waits until post-PMF
Mobile app — responsive web is enough for on-call; native apps wait for demand
Multi-cloud unification — GCP/AWS/Azure routing consolidation stays out of scope

Tech Stack & Architecture

Frontend: Next.js + React, Tailwind CSS, shadcn/ui components. Dashboard is SSR with SSE for real-time refresh.
API: tRPC end-to-end for type-safe routing, rule, and audit APIs.
DB: PostgreSQL (Prisma ORM). Events table uses a composite index on (timestamp, service).
Queue: Webhook ingestion is Vercel Edge Function + Postgres Listen/Notify — no external MQ, scales inside Postgres.
External: Slack Web API (Block Kit), PagerDuty (optional, escalation tier), GitHub/GitLab webhooks.
Deployment: Vercel app + Supabase DB; Vercel Cron drives weekly reports.
Observability: OpenTelemetry captures tRPC and webhook spans, exported to Grafana Cloud.

The constraining assumption is the "internal tool" positioning — a single org is assumed, so multi-tenancy is an org-ID partition rather than full isolation. Offering this as an external SaaS would require a separate architecture pass. Incident automation itself (auto-rollback, auto-recovery playbooks) is explicitly out of scope for this phase; OpsConsole concentrates on the three axes of observe, route, and record.

태스크 트리

Sprint 1 — Ingestion & Observability

Pipeline event ingestion and base dashboard

[TASK-001] Pipeline event ingestion endpoint DONE (complexity: 3, urgency: 5, importance: MUST)

Build webhook endpoint to receive GitHub Actions / GitLab CI events, classify by type and environment, and persist.
- ST-001-1: Webhook receiver route DONE
- ST-001-2: Event classification logic DONE
[TASK-002] Deployment status dashboard (base view) IN_PROGRESS (complexity: 4, urgency: 4, importance: MUST)

Build real-time dashboard to visualize pipeline events by stage and environment.
- ST-002-1: Pipeline list component DONE
- ST-002-2: Real-time polling or SSE IN_PROGRESS

Sprint 2 — On-call & Alerts

On-call routing rule engine and Slack alert templates

[TASK-003] On-call routing rule engine BACKLOG (complexity: 5, urgency: 3, importance: MUST)

Build a rule engine that maps service/time-zone/severity to on-call assignees using a priority-ordered JSON rule table. Higher-priority rules are evaluated first; first match wins on conflict.
[TASK-004] Slack notification template engine BACKLOG (complexity: 4, urgency: 3, importance: SHOULD)

Build an engine that resolves Slack Block Kit templates keyed by environment and severity, integrated with throttle logic to suppress duplicate alerts.
- ST-004-1: Block Kit template definitions BACKLOG