Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install jmagly-aiwg-plugins-sdlc-skills-flow-hypercare-monitoringgit clone https://github.com/jmagly/aiwg.gitcp aiwg/SKILL.MD ~/.claude/skills/jmagly-aiwg-plugins-sdlc-skills-flow-hypercare-monitoring/SKILL.md---
namespace: aiwg
name: flow-hypercare-monitoring
platforms: [all]
description: Orchestrate hypercare monitoring period with 24/7 support, SLO tracking, and rapid issue response
commandHint:
argumentHint: '[hypercare-duration-days] [project-directory] [--guidance "text"] [--interactive]'
allowedTools: 'Task, Read, Write, Glob, TodoWrite'
model: opus
category: sdlc-orchestration
orchestration: true
---
# Hypercare Monitoring Flow
**You are the Core Orchestrator** for the post-deployment hypercare monitoring period.
## Your Role
**You orchestrate multi-agent workflows. You do NOT execute bash scripts.**
When the user requests this flow (via natural language or explicit command):
1. **Interpret the request** and confirm understanding
2. **Read this template** as your orchestration guide
3. **Extract agent assignments** and workflow steps
4. **Launch agents via Task tool** in correct sequence
5. **Synthesize results** and finalize artifacts
6. **Report completion** with summary
## Hypercare Overview
**Definition**: Hypercare is an elevated support period immediately following production deployment, characterized by heightened monitoring, rapid response, and intensive issue resolution.
**Typical Duration**: 7-14 days (configurable based on release complexity and risk)
**Focus Areas**:
- Production stability and SLO compliance
- Rapid incident identification and response
- User adoption and feedback collection
- Support team enablement
- Smooth transition to business-as-usual operations
**Exit Criteria**:
- Zero P0 (Critical) incidents in last 48 hours
- Zero P1 (High) incidents in last 24 hours
- All SLOs met for 72 consecutive hours
- User adoption metrics trending positive
- Support team ready for standard operations
- Hypercare report complete and approved
**Expected Duration**: 7-14 days (typical), 20-30 minutes orchestration
## Natural Language Triggers
Users may say:
- "Start hypercare"
- "Begin hypercare period"
- "Post-launch monitoring"
- "24/7 support period"
- "Activate hypercare monitoring"
- "Launch post-deployment support"
You recognize these as requests for this orchestration flow.
## Parameter Handling
### Hypercare Duration Parameter
**Purpose**: Specify hypercare period length
**Examples**:
```
/flow-hypercare-monitoring 7 .
/flow-hypercare-monitoring 14 .
```
**Default**: 7 days (low-risk deployments), 14 days (high-risk deployments)
### --guidance Parameter
**Purpose**: User provides upfront direction to tailor hypercare priorities
**Examples**:
```
--guidance "Focus on security monitoring, financial transaction integrity critical"
--guidance "Performance is key, sub-200ms p95 response time SLO"
--guidance "First production launch, team needs extra support and documentation"
--guidance "High-traffic deployment, anticipate 100K daily active users"
```
**How to Apply**:
- Parse guidance for keywords: security, performance, compliance, scale, team experience
- Adjust agent assignments (add security-gatekeeper, performance-engineer for specific focuses)
- Modify monitoring depth (lightweight vs comprehensive based on complexity)
- Influence priority ordering (stability vs. adoption focus)
### --interactive Parameter
**Purpose**: You ask 5-8 strategic questions to understand project context
**Questions to Ask** (if --interactive):
```
I'll ask 8 strategic questions to tailor hypercare to your needs:
Q1: What are your top priorities for hypercare?
(e.g., stability validation, user adoption, performance monitoring)
Q2: What's the deployment risk level?
(Helps determine monitoring intensity and duration)
Q3: What are your critical SLOs?
(Availability, response time, error rate targets)
Q4: What's your expected user volume?
(Helps set alert thresholds and capacity monitoring)
Q5: What's your support team's experience level?
(Influences runbook detail and escalation paths)
Q6: What are your biggest concerns about this deployment?
(These become focus areas for monitoring and validation)
Q7: Are there regulatory or compliance requirements?
(e.g., HIPAA, SOC2, PCI-DSS - affects audit logging and security monitoring)
Q8: What's your incident response capability?
(24/7 on-call? Business hours? Helps plan escalation and response)
Based on your answers, I'll adjust:
- Monitoring intensity (alert thresholds, dashboard focus)
- Agent assignments (add specialized monitoring agents)
- Exit criteria strictness (standard vs. elevated)
- Support team guidance level (detailed runbooks vs. minimal)
```
**Synthesize Guidance**: Combine answers into structured guidance string for execution
## Artifacts to Generate
**Primary Deliverables**:
- **Hypercare Team Roster**: Roles, on-call rotation, contacts → `.aiwg/deployment/hypercare-team-roster.md`
- **Production Health Dashboard**: Real-time monitoring config → `.aiwg/deployment/production-dashboard-config.md`
- **Alert Escalation Matrix**: Severity definitions and response SLAs → `.aiwg/deployment/alert-escalation-matrix.md`
- **Daily Hypercare Standups**: Status reports (daily) → `.aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md`
- **Incident Response Logs**: All P0/P1 incidents → `.aiwg/deployment/incidents/incident-{ID}.md`
- **Risk Retirement Report**: Validation evidence → `.aiwg/risks/hypercare-risk-validation.md`
- **Hypercare Exit Report**: Final status and transition plan → `.aiwg/reports/hypercare-exit-report.md`
**Supporting Artifacts**:
- SLO tracking logs (hourly updates)
- User adoption metrics (daily updates)
- Support ticket analysis (daily summary)
- Post-incident reviews (PIRs) for all P0/P1
- Corrective action tracker
## Multi-Agent Orchestration Workflow
### Step 1: Establish Hypercare Team and Schedule
**Purpose**: Create dedicated support structure with clear ownership and 24/7 coverage
**Your Actions**:
1. **Read Deployment Context**:
```
Read:
- .aiwg/deployment/operational-readiness-review.md (team assignments, contacts)
- .aiwg/deployment/slo-sli-definition.md (SLO targets, monitoring approach)
- .aiwg/deployment/incident-response-runbook.md (escalation paths)
```
2. **Launch Hypercare Planning Agents** (parallel):
```
# Agent 1: Operations Manager
Task(
subagent_type="operations-manager",
description="Create hypercare team roster and on-call rotation",
prompt="""
Read ORR team assignments and contacts
Create Hypercare Team Roster:
## Core Team
- Hypercare Lead: {name} (overall coordination, daily standups)
- On-Call Engineers: {rotation-schedule} (24/7 coverage)
- Reliability Engineer: {name} (SLO monitoring, performance analysis)
- Support Lead: {name} (user-facing issues, ticket triage)
- DevOps Engineer: {name} (rapid deployment, rollback authority)
## Extended Team
- Product Owner: {name} (prioritization, user impact)
- Security Gatekeeper: {name} (security incidents)
- Component Owners: {list by component}
Create 24/7 On-Call Rotation ({duration} days):
- Primary on-call schedule (8-hour shifts or daily rotation)
- Backup on-call contacts
- Escalation path (P0/P1/P2/P3 response procedures)
Schedule Daily Standups:
- Time: {suggest optimal time}
- Duration: 30 minutes
- Attendees: Core team (mandatory), Extended team (optional)
Save to: .aiwg/deployment/hypercare-team-roster.md
"""
)
# Agent 2: Reliability Engineer
Task(
subagent_type="reliability-engineer",
description="Configure production monitoring and alerting",
prompt="""
Read SLO/SLI definitions
Configure Production Health Dashboard:
## Key Metrics (Auto-Refresh: 30s)
**Availability**
- Current Uptime: {percentage}% (Target: ≥99.9%)
- Service Health: {GREEN | YELLOW | RED}
- Failed Health Checks: {count}
**Performance (Last 5 min)**
- Response Time (p50/p95/p99): {value}ms
- Throughput: {requests-per-second} req/s
- Target: p95 < {SLA}ms
**Errors (Last 5 min)**
- Error Rate: {percentage}% (Target: <0.1%)
- 4xx/5xx Errors: {count}
- Database Errors: {count}
**Business Metrics**
- Active Users (Current): {count}
- Successful Transactions: {count}
- Transaction Success Rate: {percentage}%
**Infrastructure**
- CPU/Memory Utilization: {percentage}%
- Disk I/O, Network Traffic
Define alert thresholds for P0/P1/P2/P3 severity levels
Save to: .aiwg/deployment/production-dashboard-config.md
"""
)
# Agent 3: Support Lead
Task(
subagent_type="support-lead",
description="Define alert escalation and incident response",
prompt="""
Read incident response runbook
Create Alert Escalation Matrix:
## P0 (Critical) - Page Immediately
- Availability <99%
- Error rate >1%
- All instances down
- Security breach detected
Action: Page on-call engineer + Hypercare Lead
Response SLA: Immediate acknowledgment, 15 min time-to-engage
## P1 (High) - Alert Within 5 Minutes
- Availability <99.5%
- Error rate >0.5%
- Response time p95 >2x SLA
Action: Alert on-call engineer via Slack + SMS
Response SLA: 30 min acknowledgment, 1 hour time-to-mitigation
## P2 (Medium) - Alert Within 30 Minutes
- Availability <99.9%
- Error rate >0.1%
- Resource utilization >80%
Action: Alert on-call engineer via Slack
Response SLA: 4 hours
## P3 (Low) - Log and Review
- Minor performance degradation
- Non-critical errors
Action: Create ticket for review
Response SLA: 1 business day
Document incident response workflow (5 phases):
1. Detection (Target: <5 min)
2. Triage (Target: <15 min)
3. Investigation (P0=30min, P1=1h)
4. Mitigation (P0=1h, P1=4h)
5. Resolution (P0=2h, P1=8h)
6. Post-Incident Review (Within 48h)
Save to: .aiwg/deployment/alert-escalation-matrix.md
"""
)
```
3. **Synthesize Hypercare Setup Plan**:
```
# You do this directly (no agent needed)
Read all hypercare planning artifacts
Validate completeness:
- Team roster: All roles assigned?
- On-call rotation: 24/7 coverage confirmed?
- Monitoring: All SLOs tracked?
- Escalation: Response SLAs defined?
Create dedicated communication channel: #hypercare-{project-name}-{YYYY-MM}
```
**Communicate Progress**:
```
✓ Initialized hypercare setup
⏳ Establishing hypercare team and monitoring...
✓ Hypercare team roster created (Core + Extended teams)
✓ 24/7 on-call rotation scheduled ({duration} days)
✓ Production dashboard configured (5 metric categories)
✓ Alert escalation matrix defined (P0/P1/P2/P3)
✓ Hypercare infrastructure ready: .aiwg/deployment/
```
### Step 2: Monitor Production Stability and SLOs (Daily)
**Purpose**: Continuously validate production system meets SLO targets and stability expectations
**Your Actions**:
1. **Launch SLO Monitoring Agents** (automated, repeat daily):
```
# Agent 1: Reliability Engineer (Daily SLO Report)
Task(
subagent_type="reliability-engineer",
description="Generate daily SLO compliance report",
prompt="""
Read production metrics from monitoring dashboard
Read SLO definitions: .aiwg/deployment/slo-sli-definition.md
Generate Daily SLO Report:
## SLO Tracking (Updated Hourly)
### Availability SLO
- Target: ≥99.9% uptime
- Current (24h): {percentage}%
- Current (7d): {percentage}%
- Error Budget Remaining: {percentage}%
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Performance SLO
- Target: p95 response time <{value}ms
- Current p95 (24h): {value}ms
- Current p95 (7d): {value}ms
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Error Rate SLO
- Target: <0.1% error rate
- Current (24h): {percentage}%
- Current (7d): {percentage}%
- Status: {ON TARGET | AT RISK | EXCEEDED}
### Throughput SLO
- Target: Handle {value} req/s
- Current Peak: {value} req/s
- Current Average: {value} req/s
- Status: {ON TARGET | AT RISK | EXCEEDED}
Calculate Error Budget Burn Rate:
- Monthly error budget: {value} minutes downtime allowed
- Hypercare period budget: {value} minutes
- Current burn rate: {value} minutes consumed
- Budget remaining: {percentage}%
- Assessment: {HEALTHY | MONITOR | CRITICAL}
If CRITICAL: Recommend incident freeze, focus on stability
If MONITOR: Recommend increased monitoring, defer risky changes
Save to: .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
"""
)
# Agent 2: Support Lead (Daily Support Analysis)
Task(
subagent_type="support-lead",
description="Analyze user adoption and support tickets",
prompt="""
Read support ticket system
Read user analytics
Generate User Adoption Dashboard:
### Active Users
- DAU (Daily Active Users): {count} (Target: >{target})
- WAU/MAU: {count}
- User Growth Rate: {+/-percentage}%
### Feature Adoption (New Features)
For each new feature:
- Total Users: {count}
- Users Engaged: {count} ({percentage}%)
- Adoption Rate: {percentage}% (Target: >{target}%)
- Trend: {INCREASING | STABLE | DECREASING}
### Support Ticket Analysis
- Total Tickets (24h): {count}
- By Category: Bug Reports, How-To, Performance, etc.
- Critical Issues: {count} (blockers)
- Average Response Time: {value}h (Target: <{SLA}h)
### User Feedback Summary
- Sentiment: {POSITIVE | NEUTRAL | NEGATIVE} ({percentage}%)
- Top Issues: {list top 3}
- Top Praises: {list top 3}
Flag Critical User Blockers (if any)
Save to: .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
"""
)
```
2. **Incident Tracking** (on-demand per incident):
```
# When incident detected:
Task(
subagent_type="devops-engineer",
description="Document and respond to incident {incident-ID}",
prompt="""
Incident detected: {incident-description}
Severity: {P0 | P1 | P2 | P3}
Follow Incident Response Workflow:
1. Detection (<5 min):
- Alert acknowledged
- Initial severity assessment
- Create incident channel: #incident-{YYYY-MM-DD}-{ID}
2. Triage (<15 min):
- Gather evidence (logs, metrics, user reports)
- Identify affected systems/users
- Estimate business impact
- Engage Component Owners
- Update severity if needed
3. Investigation (P0=30min, P1=1h):
- Review logs/metrics for root cause
- Check recent deployments/changes
- Reproduce in non-prod if possible
- Identify probable root cause
4. Mitigation (P0=1h, P1=4h):
- Execute mitigation (rollback/hotfix/config change)
- Validate effectiveness
- Monitor for regression
5. Resolution (P0=2h, P1=8h):
- Confirm fully resolved
- Validate SLOs back to normal
- Close incident
Document incident timeline and actions
Save to: .aiwg/deployment/incidents/incident-{ID}.md
If P0/P1: Schedule post-incident review within 48h
"""
)
```
**Communicate Progress** (daily update):
```
✓ Hypercare Day {N} of {duration}
⏳ Monitoring production stability...
✓ SLO compliance: {percentage}% of SLOs met (target: 100%)
✓ Incidents (24h): {count} total (P0: {count}, P1: {count}, P2: {count})
✓ User adoption: {percentage}% ({trend})
✓ Support tickets: {count} (Trend: {↑/→/↓})
✓ Daily reports: .aiwg/deployment/slo-report-{date}.md, user-adoption-{date}.md
```
### Step 3: Conduct Daily Hypercare Standups
**Purpose**: Maintain team alignment, surface issues early, coordinate rapid response
**Your Actions**:
1. **Generate Daily Standup Report** (automated):
```
Task(
subagent_type="operations-manager",
description="Generate daily hypercare standup report",
prompt="""
Read daily reports:
- .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
- .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
- .aiwg/deployment/incidents/* (all open/recent incidents)
Create Daily Standup Agenda:
## Hypercare Daily Standup - Day {N} of {duration}
**Date**: {YYYY-MM-DD}
**Facilitator**: {Hypercare Lead}
### 1. Production Health Review (5 min)
**Presented by**: Reliability Engineer
- Availability: {percentage}% (Target: ≥99.9%) - {STATUS}
- Performance: p95 {value}ms (Target: <{SLA}ms) - {STATUS}
- Error Rate: {percentage}% (Target: <0.1%) - {STATUS}
- Error Budget: {percentage}% remaining - {STATUS}
Overall Health: {GREEN | YELLOW | RED}
### 2. Incident Summary (Last 24h) (10 min)
**Presented by**: On-Call Engineer
Total Incidents: {count}
- P0 (Critical): {count} - {list titles if any}
- P1 (High): {count} - {list titles if any}
- P2 (Medium): {count}
- P3 (Low): {count}
Key Incidents:
For each P0/P1:
- Incident-ID: {title}
- Status: {Open/Resolved/Closed}
- Impact: {user-count} users, {duration} minutes
- Root Cause: {brief description}
- Action Items: {list}
Patterns/Trends: {emerging issues or recurring problems}
### 3. User Feedback Review (5 min)
**Presented by**: Support Lead
- Support Tickets (24h): {count} (Trend: {↑/→/↓})
- Critical User Issues: {count}
- Top Complaints: {list top 3}
- Top Praises: {list top 3}
- Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
Blockers for Users: {list critical issues}
### 4. SLO/SLI Status (5 min)
**Presented by**: Reliability Engineer
| SLO | Target | Current (24h) | Status |
|-----|--------|---------------|--------|
| Availability | ≥99.9% | {percentage}% | {✓/⚠/✗} |
| Response Time | p95<{value}ms | {value}ms | {✓/⚠/✗} |
| Error Rate | <0.1% | {percentage}% | {✓/⚠/✗} |
| Throughput | >{value} req/s | {value} req/s | {✓/⚠/✗} |
### 5. Action Items and Blockers (5 min)
Open Action Items:
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
{list open actions}
New Blockers:
{list blockers requiring escalation}
Tomorrow's On-Call: {name} (taking over at {HH:MM})
---
Overall Status: {GREEN | YELLOW | RED}
Key Decisions Made:
{list decisions from standup}
New Action Items:
{list new actions assigned}
Save to: .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
"""
)
```
2. **Weekly Summary** (if hypercare > 7 days):
```
# On Day 7, 14, etc.:
Task(
subagent_type="operations-manager",
description="Generate weekly hypercare summary",
prompt="""
Read all daily standups for week: .aiwg/deployment/hypercare-standup-*.md
Create Weekly Summary:
## Hypercare Week {N} Summary
**Week**: {date-range}
**Overall Status**: {GREEN | YELLOW | RED}
### Production Stability
- Availability: {percentage}% (Target: ≥99.9%)
- Total Incidents: {count} (P0: {count}, P1: {count})
- MTTR: {value} min
- SLO Compliance: {percentage}%
### User Adoption
- Active Users: {count} ({+/-percentage}% vs. previous week)
- Feature Adoption: {percentage}%
- User Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}
### Support Health
- Support Tickets: {count} ({+/-percentage}% vs. previous week)
- Critical Issues: {count}
- Response Time: {value}h (Target: <{SLA}h)
### Accomplishments
{list accomplishments}
### Challenges
{list challenges}
### Next Week Focus
{list focus areas}
Save to: .aiwg/reports/hypercare-week-{N}-summary.md
"""
)
```
**Communicate Progress**:
```
⏳ Conducting daily standup...
✓ Daily standup report generated: .aiwg/deployment/hypercare-standup-{date}.md
- Overall Health: {GREEN | YELLOW | RED}
- Key Decisions: {count}
- New Action Items: {count}
- Escalations: {count}
```
### Step 4: Post-Incident Reviews (For P0/P1 Incidents)
**Purpose**: Document root cause and corrective actions for all critical incidents
**Your Actions**:
1. **For Each P0/P1 Incident** (within 48h of resolution):
```
Task(
subagent_type="reliability-engineer",
description="Conduct post-incident review for {incident-ID}",
prompt="""
Read incident log: .aiwg/deployment/incidents/incident-{ID}.md
Create Post-Incident Review (PIR):
## Post-Incident Review: {Incident-ID}
**Date**: {YYYY-MM-DD}
**Severity**: {P0/P1/P2/P3}
**Duration**: {detection-to-resolution}
**Impact**: {user-count} users, {downtime-minutes} minutes downtime
### Incident Summary
{1-2 sentence description of what happened}
### Timeline
| Time | Event | Actor |
|------|-------|-------|
{incident timeline from detection to resolution}
### Root Cause
{Detailed technical root cause analysis}
### Contributing Factors
1. {Factor 1 - e.g., insufficient testing}
2. {Factor 2 - e.g., monitoring gap}
3. {Factor 3 - e.g., unclear runbook}
### Corrective Actions
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
{list corrective actions to prevent recurrence}
### Lessons Learned
- What went well: {list}
- What could improve: {list}
- Process changes needed: {list}
Save to: .aiwg/deployment/incidents/pir-{ID}.md
Update incident log with PIR link
Track corrective actions in action tracker
"""
)
```
**Communicate Progress**:
```
⏳ Conducting post-incident reviews...
✓ PIR complete: Incident-{ID} ({title})
- Root cause: {summary}
- Corrective actions: {count} assigned
- Status: Tracking to completion
```
### Step 5: Validate Exit Criteria and Generate Hypercare Report
**Purpose**: Ensure production is stable and support team is ready before ending hypercare
**Your Actions**:
1. **Validate Exit Criteria** (on final day or when user requests):
```
Task(
subagent_type="operations-manager",
description="Validate hypercare exit criteria",
prompt="""
Read all hypercare artifacts
Validate Hypercare Exit Criteria:
## Hypercare Exit Criteria Validation
**Hypercare Period**: Day {N} of {duration}
**Validation Date**: {YYYY-MM-DD}
### Production Stability
- [ ] Zero P0 (Critical) incidents in last 48 hours
- [ ] Zero P1 (High) incidents in last 24 hours
- [ ] All SLOs met for 72 consecutive hours
- [ ] Availability ≥99.9%
- [ ] Response time p95 <{SLA}ms
- [ ] Error rate <0.1%
- [ ] Throughput >{target} req/s
- [ ] Error budget healthy: >{percentage}% remaining
- [ ] No open P0/P1 incidents
### User Adoption
- [ ] User adoption trending positive ({percentage}% growth)
- [ ] Feature adoption >{target}% for critical features
- [ ] User sentiment majority positive (≥70%)
- [ ] Support ticket volume stable or decreasing
- [ ] No critical user blockers unresolved
### Support Readiness
- [ ] Support team trained and confident
- [ ] Runbooks validated (all common issues documented)
- [ ] Escalation paths tested and effective
- [ ] Knowledge base updated with hypercare learnings
- [ ] On-call rotation transitioned to standard support
### Documentation Complete
- [ ] Hypercare report completed
- [ ] Post-incident reviews completed (all P0/P1)
- [ ] Corrective actions tracked (assigned, due dates set)
- [ ] Lessons learned documented
- [ ] Runbooks updated
Overall Exit Criteria Status: {PASS | CONDITIONAL | FAIL}
Decision: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
Save to: .aiwg/reports/hypercare-exit-criteria.md
"""
)
```
2. **Generate Hypercare Exit Report** (comprehensive final report):
```
Task(
subagent_type="operations-manager",
description="Generate comprehensive hypercare exit report",
prompt="""
Read all hypercare artifacts:
- .aiwg/deployment/hypercare-team-roster.md
- .aiwg/deployment/slo-report-*.md (all days)
- .aiwg/deployment/user-adoption-*.md (all days)
- .aiwg/deployment/hypercare-standup-*.md (all days)
- .aiwg/deployment/incidents/*.md (all incidents)
- .aiwg/reports/hypercare-exit-criteria.md
Generate Hypercare Exit Report:
# Hypercare Report: {Project-Name}
**Hypercare Period**: {start-date} to {end-date} ({duration} days)
**Report Date**: {YYYY-MM-DD}
**Report Author**: {Hypercare Lead}
## Executive Summary
{2-3 sentence summary of hypercare outcomes}
Overall Status: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
Key Metrics:
- Availability: {percentage}%
- Total Incidents: {count} (P0: {count}, P1: {count})
- User Adoption: {percentage}%
- Support Tickets: {count}
## Production Stability Summary
### SLO Performance
| SLO | Target | Achieved | Status |
|-----|--------|----------|--------|
{SLO compliance table}
SLO Compliance Rate: {percentage}%
### Incident Summary
Total Incidents: {count}
- P0 (Critical): {count}
- P1 (High): {count}
- P2 (Medium): {count}
- P3 (Low): {count}
Key Metrics:
- MTTD (Mean Time to Detect): {value} min
- MTTA (Mean Time to Acknowledge): {value} min
- MTTR (Mean Time to Resolve): {value} min
Major Incidents:
For each P0/P1:
- Incident-ID: {title}
- Date, Duration, Impact, Root Cause, Resolution
- Corrective Actions: {count} assigned
### Performance Trends
- Response Time: {IMPROVED | STABLE | DEGRADED} ({+/-percentage}% vs. pre-deployment)
- Error Rate: {IMPROVED | STABLE | DEGRADED}
- Resource Utilization: {HEALTHY | CONCERNING}
## User Adoption Summary
### Adoption Metrics
- Active Users: {count} ({+/-percentage}% vs. pre-deployment)
- Feature Adoption: {percentage}% (Target: >{target}%)
- User Retention (Day 14): {percentage}%
### User Feedback
- Total Feedback Items: {count}
- Sentiment: {percentage}% positive
- Net Promoter Score: {value}
Top Praises: {list top 3}
Top Complaints: {list top 3 with resolution status}
## Support Summary
### Ticket Volume
- Total Support Tickets: {count}
- Daily Average: {count} tickets/day
- Trend: {DECREASING | STABLE | INCREASING}
### Support Performance
- Average Response Time: {value}h (Target: <{SLA}h) - {✓/⚠/✗}
- First Contact Resolution: {percentage}%
### Support Team Readiness
- Team Confidence Level: {HIGH | MEDIUM | LOW}
- Runbook Completeness: {percentage}%
## Lessons Learned
### What Went Well
{list successes}
### What Could Improve
{list improvements}
### Process Recommendations
{list recommendations for future deployments}
## Corrective Actions
Total Actions Identified: {count}
| Action | Category | Owner | Due Date | Status |
|--------|----------|-------|----------|--------|
{corrective actions table}
## Handover to Standard Support
### Transition Plan
- [ ] Standard on-call rotation activated (starting {date})
- [ ] Support runbooks transferred
- [ ] Knowledge base published
- [ ] Support team training complete
- [ ] Escalation paths updated for BAU
### Post-Hypercare Monitoring
- Duration: {duration} days continued close monitoring
- Responsible: {Support Lead}
- Review Cadence: Weekly check-ins for {duration} weeks
## Conclusion
{2-3 sentence summary and readiness for standard support}
Recommendation: {END HYPERCARE | EXTEND HYPERCARE}
Signoff:
- Hypercare Lead: {name} - {date}
- Reliability Engineer: {name} - {date}
- Support Lead: {name} - {date}
- Product Owner: {name} - {date}
- Project Manager: {name} - {date}
Save to: .aiwg/reports/hypercare-exit-report.md
"""
)
```
3. **Present Exit Summary to User**:
```
# You present this directly (not via agent)
Read .aiwg/reports/hypercare-exit-report.md
Present summary:
─────────────────────────────────────────────
Hypercare Monitoring Period Complete
─────────────────────────────────────────────
**Hypercare Period**: {start-date} to {end-date} ({duration} days)
**Overall Status**: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}
**Key Metrics**:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Total Incidents: {count} (P0: {count}, P1: {count})
✓ User Adoption: {percentage}% of target
✓ Support Readiness: Team confident and ready
**Exit Criteria Status**:
✓ Production Stability: {PASS | CONDITIONAL | FAIL}
✓ User Adoption: {PASS | CONDITIONAL | FAIL}
✓ Support Readiness: {PASS | CONDITIONAL | FAIL}
✓ Documentation: {PASS | CONDITIONAL | FAIL}
**Decision**: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}
**Artifacts Generated**:
- Hypercare Team Roster (.aiwg/deployment/hypercare-team-roster.md)
- Production Dashboard Config (.aiwg/deployment/production-dashboard-config.md)
- Alert Escalation Matrix (.aiwg/deployment/alert-escalation-matrix.md)
- Daily Standup Reports (.aiwg/deployment/hypercare-standup-*.md, {count} files)
- SLO Reports (.aiwg/deployment/slo-report-*.md, {count} files)
- User Adoption Reports (.aiwg/deployment/user-adoption-*.md, {count} files)
- Incident Logs (.aiwg/deployment/incidents/*.md, {count} files)
- Post-Incident Reviews (.aiwg/deployment/incidents/pir-*.md, {count} files)
- Hypercare Exit Report (.aiwg/reports/hypercare-exit-report.md)
**Next Steps**:
- Review hypercare exit report with stakeholders
- Obtain formal signoffs (5 required signatures)
- If END HYPERCARE: Transition to standard support (run handoff workflow)
- If EXTEND HYPERCARE: Address gaps, continue monitoring
- If ESCALATE: Executive decision required
**Transition to Standard Support**:
- Standard on-call rotation activated: {date}
- Continued monitoring period: {duration} days
- Weekly check-ins scheduled
─────────────────────────────────────────────
```
**Communicate Progress**:
```
⏳ Validating hypercare exit criteria...
✓ Exit criteria validated: {PASS | CONDITIONAL | FAIL}
✓ Hypercare Exit Report generated: .aiwg/reports/hypercare-exit-report.md
✓ Transition plan documented
```
## Quality Gates
Before marking workflow complete, verify:
- [ ] Hypercare team established with 24/7 coverage
- [ ] Production monitoring operational (dashboards, alerts)
- [ ] Daily standups conducted and documented
- [ ] All P0/P1 incidents have post-incident reviews
- [ ] SLO compliance tracked daily
- [ ] User adoption monitored and reported
- [ ] Exit criteria validated
- [ ] Hypercare exit report complete and approved
- [ ] Transition to standard support planned
## User Communication
**At start**: Confirm understanding and list activities
```
Understood. I'll orchestrate the hypercare monitoring period.
Hypercare Duration: {duration} days
Hypercare Period: {start-date} to {estimated-end-date}
This will establish:
- Hypercare team roster and 24/7 on-call rotation
- Production health monitoring dashboards
- Alert escalation and incident response procedures
- Daily standup coordination
- SLO tracking and user adoption monitoring
- Post-incident review process
- Hypercare exit criteria validation
I'll coordinate multiple agents for comprehensive monitoring and support.
Expected setup: 20-30 minutes.
Starting orchestration...
```
**During**: Update progress with clear indicators
```
✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed
```
**Daily**: Provide daily status summary
```
Hypercare Day {N} of {duration}: {GREEN | YELLOW | RED}
Production Health:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Performance: p95 {value}ms (Target: <{SLA}ms)
{⚠️ | ✓} Error Rate: {percentage}% (Target: <0.1%)
Incidents (24h):
- P0: {count}
- P1: {count}
- P2: {count}
User Adoption: {percentage}% ({trend})
Daily reports: .aiwg/deployment/hypercare-standup-{date}.md
```
**At end**: Summary report (see Step 5.3 above)
## Error Handling
**If P0 Incident During Hypercare**:
```
❌ Critical incident detected - immediate response initiated
Incident: {incident-ID} - {title}
Severity: P0 (Complete outage / Data loss / Security breach)
Impact: {user-count} users affected
Actions:
1. On-call engineer + Hypercare Lead paged
2. Incident war room created: #incident-{date}-{ID}
3. Executive Sponsor notified
4. Status page updated
Response Timeline:
- Detection: {timestamp}
- Acknowledgment: {timestamp} (Target: Immediate)
- Time-to-engage: {minutes} min (Target: <15 min)
Current Status: {INVESTIGATING | MITIGATING | RESOLVED}
Impact on Exit Criteria: P0 incident resets 48h "zero critical incidents" requirement
Monitoring incident response...
```
**If SLO Breach**:
```
⚠️ SLO breach detected - immediate investigation required
SLO Breached: {SLO-name}
- Target: {target-value}
- Current: {actual-value}
- Duration: {duration} (continuous breach)
Impact:
- Error budget consumed: {percentage}%
- User impact: {description}
Actions:
1. Reliability Engineer investigating root cause
2. Metrics and logs under review
3. Mitigation plan in progress
If breach persists >24h: Recommend extending hypercare period
If error budget critically low: Recommend incident freeze
Monitoring for improvement...
```
**If User Adoption Low**:
```
⚠️ User adoption below target
Current Adoption: {percentage}% (Target: >{target}%)
Gap: {percentage} points
Analysis:
- Top User Issues: {list issues}
- Support Ticket Themes: {list themes}
- Potential Blockers: {list blockers}
Actions:
1. Product Owner engaged for adoption analysis
2. Support team reviewing common user issues
3. Documentation and training gaps identified
Decision Point:
- If blockers identified: Prioritize fixes, may extend hypercare
- If education needed: Launch awareness campaign
- If feature not valuable: Escalate to stakeholders
Impact on Exit Criteria: User adoption trend must improve before exit approval
```
**If Support Team Overwhelmed**:
```
⚠️ Support team capacity exceeded
Support Volume: {count} tickets/day (Capacity: {capacity})
Team Status: {STRESSED | OVERWHELMED}
Root Cause Analysis:
- Top Issue Categories: {list categories with counts}
- Product Bugs vs User Education: {ratio}
Immediate Relief Actions:
1. Additional support staff brought in (temp)
2. Engineering team handling overflow tickets
3. Workarounds created for top issues
4. FAQ and self-service guides published
Mitigation:
- Deploy hotfixes for high-frequency bugs
- Update documentation for common questions
- Additional training sessions scheduled
Impact on Exit Criteria: Support team must be confident and staffed before exit
```
**If Exit Criteria Not Met**:
```
⚠️ Hypercare exit criteria not met - extension recommended
Exit Criteria Status: {FAIL | CONDITIONAL}
Gaps Identified:
{list unmet criteria with details}
Recommendation: {EXTEND HYPERCARE | CONDITIONAL EXIT | ESCALATE}
Extension Plan:
- Additional Duration: {days} days
- Focus Areas: {list areas needing improvement}
- Re-validation Date: {date}
Escalating to user for decision...
```
## Success Criteria
This orchestration succeeds when:
- [ ] Hypercare team established with 24/7 coverage
- [ ] Production monitoring operational (dashboards, alerts, SLO tracking)
- [ ] Daily standups conducted for entire hypercare period
- [ ] All incidents documented and P0/P1 incidents have PIRs
- [ ] SLO compliance ≥95% (all SLOs met for 72+ consecutive hours)
- [ ] Zero P0/P1 incidents in final 48/24 hours
- [ ] User adoption trending positive (≥target%)
- [ ] Support team ready and confident
- [ ] Exit criteria validated: PASS or CONDITIONAL PASS
- [ ] Hypercare exit report complete and approved
- [ ] Transition to standard support planned and executed
## Metrics to Track
**During orchestration, track**:
- SLO compliance rate: % of SLOs met (target: 100% for 72h before exit)
- Incident frequency: # of P0/P1/P2/P3 incidents (target: P0/P1 = 0 in final 48/24h)
- Mean time to detect (MTTD): Minutes from incident to detection (target: <5 min)
- Mean time to resolve (MTTR): Minutes from detection to resolution (target: P0 <120 min, P1 <480 min)
- Error budget burn rate: % of monthly budget consumed (target: <50% during hypercare)
- User adoption rate: % of target users actively engaged (target: ≥70%)
- Support ticket volume: # of tickets/day (target: decreasing trend)
- Support response time: Hours to first response (target: <SLA)
## References
**Templates** (via $AIWG_ROOT):
- Operational Readiness Review: `templates/deployment/operational-readiness-review-template.md`
- SLO/SLI Definition: `templates/deployment/slo-sli-template.md`
- Incident Response Runbook: `templates/support/incident-response-runbook-template.md`
- Support Plan: `templates/support/support-plan-template.md`
**Related Flows**:
- Gate Check: `commands/flow-gate-check.md`
- Handoff Checklist: `commands/flow-handoff-checklist.md`
- Deployment Workflow: `commands/flow-deployment-workflow.md`
**SDLC Phase Context**:
- Phase: Transition (Deployment → Operations)
- Milestone: Hypercare Complete (transition to BAU support)