flow-hypercare-monitoring

Show SKILL.md content (~10.4k tokens)
---
namespace: aiwg
name: flow-hypercare-monitoring
platforms: [all]
description: Orchestrate hypercare monitoring period with 24/7 support, SLO tracking, and rapid issue response
commandHint:
  argumentHint: '[hypercare-duration-days] [project-directory] [--guidance "text"] [--interactive]'
  allowedTools: 'Task, Read, Write, Glob, TodoWrite'
  model: opus
  category: sdlc-orchestration
  orchestration: true
---

# Hypercare Monitoring Flow

**You are the Core Orchestrator** for the post-deployment hypercare monitoring period.

## Your Role

**You orchestrate multi-agent workflows. You do NOT execute bash scripts.**

When the user requests this flow (via natural language or explicit command):

1. **Interpret the request** and confirm understanding
2. **Read this template** as your orchestration guide
3. **Extract agent assignments** and workflow steps
4. **Launch agents via Task tool** in correct sequence
5. **Synthesize results** and finalize artifacts
6. **Report completion** with summary

## Hypercare Overview

**Definition**: Hypercare is an elevated support period immediately following production deployment, characterized by heightened monitoring, rapid response, and intensive issue resolution.

**Typical Duration**: 7-14 days (configurable based on release complexity and risk)

**Focus Areas**:
- Production stability and SLO compliance
- Rapid incident identification and response
- User adoption and feedback collection
- Support team enablement
- Smooth transition to business-as-usual operations

**Exit Criteria**:
- Zero P0 (Critical) incidents in last 48 hours
- Zero P1 (High) incidents in last 24 hours
- All SLOs met for 72 consecutive hours
- User adoption metrics trending positive
- Support team ready for standard operations
- Hypercare report complete and approved

**Expected Duration**: 7-14 days (typical), 20-30 minutes orchestration

## Natural Language Triggers

Users may say:
- "Start hypercare"
- "Begin hypercare period"
- "Post-launch monitoring"
- "24/7 support period"
- "Activate hypercare monitoring"
- "Launch post-deployment support"

You recognize these as requests for this orchestration flow.

## Parameter Handling

### Hypercare Duration Parameter

**Purpose**: Specify hypercare period length

**Examples**:
```
/flow-hypercare-monitoring 7 .
/flow-hypercare-monitoring 14 .
```

**Default**: 7 days (low-risk deployments), 14 days (high-risk deployments)

### --guidance Parameter

**Purpose**: User provides upfront direction to tailor hypercare priorities

**Examples**:
```
--guidance "Focus on security monitoring, financial transaction integrity critical"
--guidance "Performance is key, sub-200ms p95 response time SLO"
--guidance "First production launch, team needs extra support and documentation"
--guidance "High-traffic deployment, anticipate 100K daily active users"
```

**How to Apply**:
- Parse guidance for keywords: security, performance, compliance, scale, team experience
- Adjust agent assignments (add security-gatekeeper, performance-engineer for specific focuses)
- Modify monitoring depth (lightweight vs comprehensive based on complexity)
- Influence priority ordering (stability vs. adoption focus)

### --interactive Parameter

**Purpose**: You ask 5-8 strategic questions to understand project context

**Questions to Ask** (if --interactive):

```
I'll ask 8 strategic questions to tailor hypercare to your needs:

Q1: What are your top priorities for hypercare?
    (e.g., stability validation, user adoption, performance monitoring)

Q2: What's the deployment risk level?
    (Helps determine monitoring intensity and duration)

Q3: What are your critical SLOs?
    (Availability, response time, error rate targets)

Q4: What's your expected user volume?
    (Helps set alert thresholds and capacity monitoring)

Q5: What's your support team's experience level?
    (Influences runbook detail and escalation paths)

Q6: What are your biggest concerns about this deployment?
    (These become focus areas for monitoring and validation)

Q7: Are there regulatory or compliance requirements?
    (e.g., HIPAA, SOC2, PCI-DSS - affects audit logging and security monitoring)

Q8: What's your incident response capability?
    (24/7 on-call? Business hours? Helps plan escalation and response)

Based on your answers, I'll adjust:
- Monitoring intensity (alert thresholds, dashboard focus)
- Agent assignments (add specialized monitoring agents)
- Exit criteria strictness (standard vs. elevated)
- Support team guidance level (detailed runbooks vs. minimal)
```

**Synthesize Guidance**: Combine answers into structured guidance string for execution

## Artifacts to Generate

**Primary Deliverables**:
- **Hypercare Team Roster**: Roles, on-call rotation, contacts → `.aiwg/deployment/hypercare-team-roster.md`
- **Production Health Dashboard**: Real-time monitoring config → `.aiwg/deployment/production-dashboard-config.md`
- **Alert Escalation Matrix**: Severity definitions and response SLAs → `.aiwg/deployment/alert-escalation-matrix.md`
- **Daily Hypercare Standups**: Status reports (daily) → `.aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md`
- **Incident Response Logs**: All P0/P1 incidents → `.aiwg/deployment/incidents/incident-{ID}.md`
- **Risk Retirement Report**: Validation evidence → `.aiwg/risks/hypercare-risk-validation.md`
- **Hypercare Exit Report**: Final status and transition plan → `.aiwg/reports/hypercare-exit-report.md`

**Supporting Artifacts**:
- SLO tracking logs (hourly updates)
- User adoption metrics (daily updates)
- Support ticket analysis (daily summary)
- Post-incident reviews (PIRs) for all P0/P1
- Corrective action tracker

## Multi-Agent Orchestration Workflow

### Step 1: Establish Hypercare Team and Schedule

**Purpose**: Create dedicated support structure with clear ownership and 24/7 coverage

**Your Actions**:

1. **Read Deployment Context**:
   ```
   Read:
   - .aiwg/deployment/operational-readiness-review.md (team assignments, contacts)
   - .aiwg/deployment/slo-sli-definition.md (SLO targets, monitoring approach)
   - .aiwg/deployment/incident-response-runbook.md (escalation paths)
   ```

2. **Launch Hypercare Planning Agents** (parallel):
   ```
   # Agent 1: Operations Manager
   Task(
       subagent_type="operations-manager",
       description="Create hypercare team roster and on-call rotation",
       prompt="""
       Read ORR team assignments and contacts

       Create Hypercare Team Roster:

       ## Core Team
       - Hypercare Lead: {name} (overall coordination, daily standups)
       - On-Call Engineers: {rotation-schedule} (24/7 coverage)
       - Reliability Engineer: {name} (SLO monitoring, performance analysis)
       - Support Lead: {name} (user-facing issues, ticket triage)
       - DevOps Engineer: {name} (rapid deployment, rollback authority)

       ## Extended Team
       - Product Owner: {name} (prioritization, user impact)
       - Security Gatekeeper: {name} (security incidents)
       - Component Owners: {list by component}

       Create 24/7 On-Call Rotation ({duration} days):
       - Primary on-call schedule (8-hour shifts or daily rotation)
       - Backup on-call contacts
       - Escalation path (P0/P1/P2/P3 response procedures)

       Schedule Daily Standups:
       - Time: {suggest optimal time}
       - Duration: 30 minutes
       - Attendees: Core team (mandatory), Extended team (optional)

       Save to: .aiwg/deployment/hypercare-team-roster.md
       """
   )

   # Agent 2: Reliability Engineer
   Task(
       subagent_type="reliability-engineer",
       description="Configure production monitoring and alerting",
       prompt="""
       Read SLO/SLI definitions

       Configure Production Health Dashboard:

       ## Key Metrics (Auto-Refresh: 30s)

       **Availability**
       - Current Uptime: {percentage}% (Target: ≥99.9%)
       - Service Health: {GREEN | YELLOW | RED}
       - Failed Health Checks: {count}

       **Performance (Last 5 min)**
       - Response Time (p50/p95/p99): {value}ms
       - Throughput: {requests-per-second} req/s
       - Target: p95 < {SLA}ms

       **Errors (Last 5 min)**
       - Error Rate: {percentage}% (Target: <0.1%)
       - 4xx/5xx Errors: {count}
       - Database Errors: {count}

       **Business Metrics**
       - Active Users (Current): {count}
       - Successful Transactions: {count}
       - Transaction Success Rate: {percentage}%

       **Infrastructure**
       - CPU/Memory Utilization: {percentage}%
       - Disk I/O, Network Traffic

       Define alert thresholds for P0/P1/P2/P3 severity levels

       Save to: .aiwg/deployment/production-dashboard-config.md
       """
   )

   # Agent 3: Support Lead
   Task(
       subagent_type="support-lead",
       description="Define alert escalation and incident response",
       prompt="""
       Read incident response runbook

       Create Alert Escalation Matrix:

       ## P0 (Critical) - Page Immediately
       - Availability <99%
       - Error rate >1%
       - All instances down
       - Security breach detected

       Action: Page on-call engineer + Hypercare Lead
       Response SLA: Immediate acknowledgment, 15 min time-to-engage

       ## P1 (High) - Alert Within 5 Minutes
       - Availability <99.5%
       - Error rate >0.5%
       - Response time p95 >2x SLA

       Action: Alert on-call engineer via Slack + SMS
       Response SLA: 30 min acknowledgment, 1 hour time-to-mitigation

       ## P2 (Medium) - Alert Within 30 Minutes
       - Availability <99.9%
       - Error rate >0.1%
       - Resource utilization >80%

       Action: Alert on-call engineer via Slack
       Response SLA: 4 hours

       ## P3 (Low) - Log and Review
       - Minor performance degradation
       - Non-critical errors

       Action: Create ticket for review
       Response SLA: 1 business day

       Document incident response workflow (5 phases):
       1. Detection (Target: <5 min)
       2. Triage (Target: <15 min)
       3. Investigation (P0=30min, P1=1h)
       4. Mitigation (P0=1h, P1=4h)
       5. Resolution (P0=2h, P1=8h)
       6. Post-Incident Review (Within 48h)

       Save to: .aiwg/deployment/alert-escalation-matrix.md
       """
   )
   ```

3. **Synthesize Hypercare Setup Plan**:
   ```
   # You do this directly (no agent needed)

   Read all hypercare planning artifacts

   Validate completeness:
   - Team roster: All roles assigned?
   - On-call rotation: 24/7 coverage confirmed?
   - Monitoring: All SLOs tracked?
   - Escalation: Response SLAs defined?

   Create dedicated communication channel: #hypercare-{project-name}-{YYYY-MM}
   ```

**Communicate Progress**:
```
✓ Initialized hypercare setup
⏳ Establishing hypercare team and monitoring...
  ✓ Hypercare team roster created (Core + Extended teams)
  ✓ 24/7 on-call rotation scheduled ({duration} days)
  ✓ Production dashboard configured (5 metric categories)
  ✓ Alert escalation matrix defined (P0/P1/P2/P3)
✓ Hypercare infrastructure ready: .aiwg/deployment/
```

### Step 2: Monitor Production Stability and SLOs (Daily)

**Purpose**: Continuously validate production system meets SLO targets and stability expectations

**Your Actions**:

1. **Launch SLO Monitoring Agents** (automated, repeat daily):
   ```
   # Agent 1: Reliability Engineer (Daily SLO Report)
   Task(
       subagent_type="reliability-engineer",
       description="Generate daily SLO compliance report",
       prompt="""
       Read production metrics from monitoring dashboard
       Read SLO definitions: .aiwg/deployment/slo-sli-definition.md

       Generate Daily SLO Report:

       ## SLO Tracking (Updated Hourly)

       ### Availability SLO
       - Target: ≥99.9% uptime
       - Current (24h): {percentage}%
       - Current (7d): {percentage}%
       - Error Budget Remaining: {percentage}%
       - Status: {ON TARGET | AT RISK | EXCEEDED}

       ### Performance SLO
       - Target: p95 response time <{value}ms
       - Current p95 (24h): {value}ms
       - Current p95 (7d): {value}ms
       - Status: {ON TARGET | AT RISK | EXCEEDED}

       ### Error Rate SLO
       - Target: <0.1% error rate
       - Current (24h): {percentage}%
       - Current (7d): {percentage}%
       - Status: {ON TARGET | AT RISK | EXCEEDED}

       ### Throughput SLO
       - Target: Handle {value} req/s
       - Current Peak: {value} req/s
       - Current Average: {value} req/s
       - Status: {ON TARGET | AT RISK | EXCEEDED}

       Calculate Error Budget Burn Rate:
       - Monthly error budget: {value} minutes downtime allowed
       - Hypercare period budget: {value} minutes
       - Current burn rate: {value} minutes consumed
       - Budget remaining: {percentage}%
       - Assessment: {HEALTHY | MONITOR | CRITICAL}

       If CRITICAL: Recommend incident freeze, focus on stability
       If MONITOR: Recommend increased monitoring, defer risky changes

       Save to: .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
       """
   )

   # Agent 2: Support Lead (Daily Support Analysis)
   Task(
       subagent_type="support-lead",
       description="Analyze user adoption and support tickets",
       prompt="""
       Read support ticket system
       Read user analytics

       Generate User Adoption Dashboard:

       ### Active Users
       - DAU (Daily Active Users): {count} (Target: >{target})
       - WAU/MAU: {count}
       - User Growth Rate: {+/-percentage}%

       ### Feature Adoption (New Features)
       For each new feature:
       - Total Users: {count}
       - Users Engaged: {count} ({percentage}%)
       - Adoption Rate: {percentage}% (Target: >{target}%)
       - Trend: {INCREASING | STABLE | DECREASING}

       ### Support Ticket Analysis
       - Total Tickets (24h): {count}
       - By Category: Bug Reports, How-To, Performance, etc.
       - Critical Issues: {count} (blockers)
       - Average Response Time: {value}h (Target: <{SLA}h)

       ### User Feedback Summary
       - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE} ({percentage}%)
       - Top Issues: {list top 3}
       - Top Praises: {list top 3}

       Flag Critical User Blockers (if any)

       Save to: .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
       """
   )
   ```

2. **Incident Tracking** (on-demand per incident):
   ```
   # When incident detected:
   Task(
       subagent_type="devops-engineer",
       description="Document and respond to incident {incident-ID}",
       prompt="""
       Incident detected: {incident-description}
       Severity: {P0 | P1 | P2 | P3}

       Follow Incident Response Workflow:

       1. Detection (<5 min):
          - Alert acknowledged
          - Initial severity assessment
          - Create incident channel: #incident-{YYYY-MM-DD}-{ID}

       2. Triage (<15 min):
          - Gather evidence (logs, metrics, user reports)
          - Identify affected systems/users
          - Estimate business impact
          - Engage Component Owners
          - Update severity if needed

       3. Investigation (P0=30min, P1=1h):
          - Review logs/metrics for root cause
          - Check recent deployments/changes
          - Reproduce in non-prod if possible
          - Identify probable root cause

       4. Mitigation (P0=1h, P1=4h):
          - Execute mitigation (rollback/hotfix/config change)
          - Validate effectiveness
          - Monitor for regression

       5. Resolution (P0=2h, P1=8h):
          - Confirm fully resolved
          - Validate SLOs back to normal
          - Close incident

       Document incident timeline and actions

       Save to: .aiwg/deployment/incidents/incident-{ID}.md

       If P0/P1: Schedule post-incident review within 48h
       """
   )
   ```

**Communicate Progress** (daily update):
```
✓ Hypercare Day {N} of {duration}
⏳ Monitoring production stability...
  ✓ SLO compliance: {percentage}% of SLOs met (target: 100%)
  ✓ Incidents (24h): {count} total (P0: {count}, P1: {count}, P2: {count})
  ✓ User adoption: {percentage}% ({trend})
  ✓ Support tickets: {count} (Trend: {↑/→/↓})
✓ Daily reports: .aiwg/deployment/slo-report-{date}.md, user-adoption-{date}.md
```

### Step 3: Conduct Daily Hypercare Standups

**Purpose**: Maintain team alignment, surface issues early, coordinate rapid response

**Your Actions**:

1. **Generate Daily Standup Report** (automated):
   ```
   Task(
       subagent_type="operations-manager",
       description="Generate daily hypercare standup report",
       prompt="""
       Read daily reports:
       - .aiwg/deployment/slo-report-{YYYY-MM-DD}.md
       - .aiwg/deployment/user-adoption-{YYYY-MM-DD}.md
       - .aiwg/deployment/incidents/* (all open/recent incidents)

       Create Daily Standup Agenda:

       ## Hypercare Daily Standup - Day {N} of {duration}

       **Date**: {YYYY-MM-DD}
       **Facilitator**: {Hypercare Lead}

       ### 1. Production Health Review (5 min)
       **Presented by**: Reliability Engineer

       - Availability: {percentage}% (Target: ≥99.9%) - {STATUS}
       - Performance: p95 {value}ms (Target: <{SLA}ms) - {STATUS}
       - Error Rate: {percentage}% (Target: <0.1%) - {STATUS}
       - Error Budget: {percentage}% remaining - {STATUS}

       Overall Health: {GREEN | YELLOW | RED}

       ### 2. Incident Summary (Last 24h) (10 min)
       **Presented by**: On-Call Engineer

       Total Incidents: {count}
       - P0 (Critical): {count} - {list titles if any}
       - P1 (High): {count} - {list titles if any}
       - P2 (Medium): {count}
       - P3 (Low): {count}

       Key Incidents:
       For each P0/P1:
       - Incident-ID: {title}
       - Status: {Open/Resolved/Closed}
       - Impact: {user-count} users, {duration} minutes
       - Root Cause: {brief description}
       - Action Items: {list}

       Patterns/Trends: {emerging issues or recurring problems}

       ### 3. User Feedback Review (5 min)
       **Presented by**: Support Lead

       - Support Tickets (24h): {count} (Trend: {↑/→/↓})
       - Critical User Issues: {count}
       - Top Complaints: {list top 3}
       - Top Praises: {list top 3}
       - Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}

       Blockers for Users: {list critical issues}

       ### 4. SLO/SLI Status (5 min)
       **Presented by**: Reliability Engineer

       | SLO | Target | Current (24h) | Status |
       |-----|--------|---------------|--------|
       | Availability | ≥99.9% | {percentage}% | {✓/⚠/✗} |
       | Response Time | p95<{value}ms | {value}ms | {✓/⚠/✗} |
       | Error Rate | <0.1% | {percentage}% | {✓/⚠/✗} |
       | Throughput | >{value} req/s | {value} req/s | {✓/⚠/✗} |

       ### 5. Action Items and Blockers (5 min)

       Open Action Items:
       | Action | Owner | Due Date | Status |
       |--------|-------|----------|--------|
       {list open actions}

       New Blockers:
       {list blockers requiring escalation}

       Tomorrow's On-Call: {name} (taking over at {HH:MM})

       ---

       Overall Status: {GREEN | YELLOW | RED}

       Key Decisions Made:
       {list decisions from standup}

       New Action Items:
       {list new actions assigned}

       Save to: .aiwg/deployment/hypercare-standup-{YYYY-MM-DD}.md
       """
   )
   ```

2. **Weekly Summary** (if hypercare > 7 days):
   ```
   # On Day 7, 14, etc.:
   Task(
       subagent_type="operations-manager",
       description="Generate weekly hypercare summary",
       prompt="""
       Read all daily standups for week: .aiwg/deployment/hypercare-standup-*.md

       Create Weekly Summary:

       ## Hypercare Week {N} Summary

       **Week**: {date-range}
       **Overall Status**: {GREEN | YELLOW | RED}

       ### Production Stability
       - Availability: {percentage}% (Target: ≥99.9%)
       - Total Incidents: {count} (P0: {count}, P1: {count})
       - MTTR: {value} min
       - SLO Compliance: {percentage}%

       ### User Adoption
       - Active Users: {count} ({+/-percentage}% vs. previous week)
       - Feature Adoption: {percentage}%
       - User Sentiment: {POSITIVE | NEUTRAL | NEGATIVE}

       ### Support Health
       - Support Tickets: {count} ({+/-percentage}% vs. previous week)
       - Critical Issues: {count}
       - Response Time: {value}h (Target: <{SLA}h)

       ### Accomplishments
       {list accomplishments}

       ### Challenges
       {list challenges}

       ### Next Week Focus
       {list focus areas}

       Save to: .aiwg/reports/hypercare-week-{N}-summary.md
       """
   )
   ```

**Communicate Progress**:
```
⏳ Conducting daily standup...
✓ Daily standup report generated: .aiwg/deployment/hypercare-standup-{date}.md
  - Overall Health: {GREEN | YELLOW | RED}
  - Key Decisions: {count}
  - New Action Items: {count}
  - Escalations: {count}
```

### Step 4: Post-Incident Reviews (For P0/P1 Incidents)

**Purpose**: Document root cause and corrective actions for all critical incidents

**Your Actions**:

1. **For Each P0/P1 Incident** (within 48h of resolution):
   ```
   Task(
       subagent_type="reliability-engineer",
       description="Conduct post-incident review for {incident-ID}",
       prompt="""
       Read incident log: .aiwg/deployment/incidents/incident-{ID}.md

       Create Post-Incident Review (PIR):

       ## Post-Incident Review: {Incident-ID}

       **Date**: {YYYY-MM-DD}
       **Severity**: {P0/P1/P2/P3}
       **Duration**: {detection-to-resolution}
       **Impact**: {user-count} users, {downtime-minutes} minutes downtime

       ### Incident Summary
       {1-2 sentence description of what happened}

       ### Timeline
       | Time | Event | Actor |
       |------|-------|-------|
       {incident timeline from detection to resolution}

       ### Root Cause
       {Detailed technical root cause analysis}

       ### Contributing Factors
       1. {Factor 1 - e.g., insufficient testing}
       2. {Factor 2 - e.g., monitoring gap}
       3. {Factor 3 - e.g., unclear runbook}

       ### Corrective Actions
       | Action | Owner | Due Date | Status |
       |--------|-------|----------|--------|
       {list corrective actions to prevent recurrence}

       ### Lessons Learned
       - What went well: {list}
       - What could improve: {list}
       - Process changes needed: {list}

       Save to: .aiwg/deployment/incidents/pir-{ID}.md

       Update incident log with PIR link
       Track corrective actions in action tracker
       """
   )
   ```

**Communicate Progress**:
```
⏳ Conducting post-incident reviews...
✓ PIR complete: Incident-{ID} ({title})
  - Root cause: {summary}
  - Corrective actions: {count} assigned
  - Status: Tracking to completion
```

### Step 5: Validate Exit Criteria and Generate Hypercare Report

**Purpose**: Ensure production is stable and support team is ready before ending hypercare

**Your Actions**:

1. **Validate Exit Criteria** (on final day or when user requests):
   ```
   Task(
       subagent_type="operations-manager",
       description="Validate hypercare exit criteria",
       prompt="""
       Read all hypercare artifacts

       Validate Hypercare Exit Criteria:

       ## Hypercare Exit Criteria Validation

       **Hypercare Period**: Day {N} of {duration}
       **Validation Date**: {YYYY-MM-DD}

       ### Production Stability
       - [ ] Zero P0 (Critical) incidents in last 48 hours
       - [ ] Zero P1 (High) incidents in last 24 hours
       - [ ] All SLOs met for 72 consecutive hours
         - [ ] Availability ≥99.9%
         - [ ] Response time p95 <{SLA}ms
         - [ ] Error rate <0.1%
         - [ ] Throughput >{target} req/s
       - [ ] Error budget healthy: >{percentage}% remaining
       - [ ] No open P0/P1 incidents

       ### User Adoption
       - [ ] User adoption trending positive ({percentage}% growth)
       - [ ] Feature adoption >{target}% for critical features
       - [ ] User sentiment majority positive (≥70%)
       - [ ] Support ticket volume stable or decreasing
       - [ ] No critical user blockers unresolved

       ### Support Readiness
       - [ ] Support team trained and confident
       - [ ] Runbooks validated (all common issues documented)
       - [ ] Escalation paths tested and effective
       - [ ] Knowledge base updated with hypercare learnings
       - [ ] On-call rotation transitioned to standard support

       ### Documentation Complete
       - [ ] Hypercare report completed
       - [ ] Post-incident reviews completed (all P0/P1)
       - [ ] Corrective actions tracked (assigned, due dates set)
       - [ ] Lessons learned documented
       - [ ] Runbooks updated

       Overall Exit Criteria Status: {PASS | CONDITIONAL | FAIL}
       Decision: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}

       Save to: .aiwg/reports/hypercare-exit-criteria.md
       """
   )
   ```

2. **Generate Hypercare Exit Report** (comprehensive final report):
   ```
   Task(
       subagent_type="operations-manager",
       description="Generate comprehensive hypercare exit report",
       prompt="""
       Read all hypercare artifacts:
       - .aiwg/deployment/hypercare-team-roster.md
       - .aiwg/deployment/slo-report-*.md (all days)
       - .aiwg/deployment/user-adoption-*.md (all days)
       - .aiwg/deployment/hypercare-standup-*.md (all days)
       - .aiwg/deployment/incidents/*.md (all incidents)
       - .aiwg/reports/hypercare-exit-criteria.md

       Generate Hypercare Exit Report:

       # Hypercare Report: {Project-Name}

       **Hypercare Period**: {start-date} to {end-date} ({duration} days)
       **Report Date**: {YYYY-MM-DD}
       **Report Author**: {Hypercare Lead}

       ## Executive Summary

       {2-3 sentence summary of hypercare outcomes}

       Overall Status: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}

       Key Metrics:
       - Availability: {percentage}%
       - Total Incidents: {count} (P0: {count}, P1: {count})
       - User Adoption: {percentage}%
       - Support Tickets: {count}

       ## Production Stability Summary

       ### SLO Performance
       | SLO | Target | Achieved | Status |
       |-----|--------|----------|--------|
       {SLO compliance table}

       SLO Compliance Rate: {percentage}%

       ### Incident Summary
       Total Incidents: {count}
       - P0 (Critical): {count}
       - P1 (High): {count}
       - P2 (Medium): {count}
       - P3 (Low): {count}

       Key Metrics:
       - MTTD (Mean Time to Detect): {value} min
       - MTTA (Mean Time to Acknowledge): {value} min
       - MTTR (Mean Time to Resolve): {value} min

       Major Incidents:
       For each P0/P1:
       - Incident-ID: {title}
         - Date, Duration, Impact, Root Cause, Resolution
         - Corrective Actions: {count} assigned

       ### Performance Trends
       - Response Time: {IMPROVED | STABLE | DEGRADED} ({+/-percentage}% vs. pre-deployment)
       - Error Rate: {IMPROVED | STABLE | DEGRADED}
       - Resource Utilization: {HEALTHY | CONCERNING}

       ## User Adoption Summary

       ### Adoption Metrics
       - Active Users: {count} ({+/-percentage}% vs. pre-deployment)
       - Feature Adoption: {percentage}% (Target: >{target}%)
       - User Retention (Day 14): {percentage}%

       ### User Feedback
       - Total Feedback Items: {count}
       - Sentiment: {percentage}% positive
       - Net Promoter Score: {value}

       Top Praises: {list top 3}
       Top Complaints: {list top 3 with resolution status}

       ## Support Summary

       ### Ticket Volume
       - Total Support Tickets: {count}
       - Daily Average: {count} tickets/day
       - Trend: {DECREASING | STABLE | INCREASING}

       ### Support Performance
       - Average Response Time: {value}h (Target: <{SLA}h) - {✓/⚠/✗}
       - First Contact Resolution: {percentage}%

       ### Support Team Readiness
       - Team Confidence Level: {HIGH | MEDIUM | LOW}
       - Runbook Completeness: {percentage}%

       ## Lessons Learned

       ### What Went Well
       {list successes}

       ### What Could Improve
       {list improvements}

       ### Process Recommendations
       {list recommendations for future deployments}

       ## Corrective Actions

       Total Actions Identified: {count}

       | Action | Category | Owner | Due Date | Status |
       |--------|----------|-------|----------|--------|
       {corrective actions table}

       ## Handover to Standard Support

       ### Transition Plan
       - [ ] Standard on-call rotation activated (starting {date})
       - [ ] Support runbooks transferred
       - [ ] Knowledge base published
       - [ ] Support team training complete
       - [ ] Escalation paths updated for BAU

       ### Post-Hypercare Monitoring
       - Duration: {duration} days continued close monitoring
       - Responsible: {Support Lead}
       - Review Cadence: Weekly check-ins for {duration} weeks

       ## Conclusion

       {2-3 sentence summary and readiness for standard support}

       Recommendation: {END HYPERCARE | EXTEND HYPERCARE}

       Signoff:
       - Hypercare Lead: {name} - {date}
       - Reliability Engineer: {name} - {date}
       - Support Lead: {name} - {date}
       - Product Owner: {name} - {date}
       - Project Manager: {name} - {date}

       Save to: .aiwg/reports/hypercare-exit-report.md
       """
   )
   ```

3. **Present Exit Summary to User**:
   ```
   # You present this directly (not via agent)

   Read .aiwg/reports/hypercare-exit-report.md

   Present summary:
   ─────────────────────────────────────────────
   Hypercare Monitoring Period Complete
   ─────────────────────────────────────────────

   **Hypercare Period**: {start-date} to {end-date} ({duration} days)
   **Overall Status**: {SUCCESS | SUCCESS WITH CONDITIONS | CHALLENGES}

   **Key Metrics**:
   ✓ Availability: {percentage}% (Target: ≥99.9%)
   ✓ Total Incidents: {count} (P0: {count}, P1: {count})
   ✓ User Adoption: {percentage}% of target
   ✓ Support Readiness: Team confident and ready

   **Exit Criteria Status**:
   ✓ Production Stability: {PASS | CONDITIONAL | FAIL}
   ✓ User Adoption: {PASS | CONDITIONAL | FAIL}
   ✓ Support Readiness: {PASS | CONDITIONAL | FAIL}
   ✓ Documentation: {PASS | CONDITIONAL | FAIL}

   **Decision**: {END HYPERCARE | EXTEND HYPERCARE | ESCALATE}

   **Artifacts Generated**:
   - Hypercare Team Roster (.aiwg/deployment/hypercare-team-roster.md)
   - Production Dashboard Config (.aiwg/deployment/production-dashboard-config.md)
   - Alert Escalation Matrix (.aiwg/deployment/alert-escalation-matrix.md)
   - Daily Standup Reports (.aiwg/deployment/hypercare-standup-*.md, {count} files)
   - SLO Reports (.aiwg/deployment/slo-report-*.md, {count} files)
   - User Adoption Reports (.aiwg/deployment/user-adoption-*.md, {count} files)
   - Incident Logs (.aiwg/deployment/incidents/*.md, {count} files)
   - Post-Incident Reviews (.aiwg/deployment/incidents/pir-*.md, {count} files)
   - Hypercare Exit Report (.aiwg/reports/hypercare-exit-report.md)

   **Next Steps**:
   - Review hypercare exit report with stakeholders
   - Obtain formal signoffs (5 required signatures)
   - If END HYPERCARE: Transition to standard support (run handoff workflow)
   - If EXTEND HYPERCARE: Address gaps, continue monitoring
   - If ESCALATE: Executive decision required

   **Transition to Standard Support**:
   - Standard on-call rotation activated: {date}
   - Continued monitoring period: {duration} days
   - Weekly check-ins scheduled

   ─────────────────────────────────────────────
   ```

**Communicate Progress**:
```
⏳ Validating hypercare exit criteria...
✓ Exit criteria validated: {PASS | CONDITIONAL | FAIL}
✓ Hypercare Exit Report generated: .aiwg/reports/hypercare-exit-report.md
✓ Transition plan documented
```

## Quality Gates

Before marking workflow complete, verify:
- [ ] Hypercare team established with 24/7 coverage
- [ ] Production monitoring operational (dashboards, alerts)
- [ ] Daily standups conducted and documented
- [ ] All P0/P1 incidents have post-incident reviews
- [ ] SLO compliance tracked daily
- [ ] User adoption monitored and reported
- [ ] Exit criteria validated
- [ ] Hypercare exit report complete and approved
- [ ] Transition to standard support planned

## User Communication

**At start**: Confirm understanding and list activities

```
Understood. I'll orchestrate the hypercare monitoring period.

Hypercare Duration: {duration} days
Hypercare Period: {start-date} to {estimated-end-date}

This will establish:
- Hypercare team roster and 24/7 on-call rotation
- Production health monitoring dashboards
- Alert escalation and incident response procedures
- Daily standup coordination
- SLO tracking and user adoption monitoring
- Post-incident review process
- Hypercare exit criteria validation

I'll coordinate multiple agents for comprehensive monitoring and support.
Expected setup: 20-30 minutes.

Starting orchestration...
```

**During**: Update progress with clear indicators

```
✓ = Complete
⏳ = In progress
❌ = Error/blocked
⚠️ = Warning/attention needed
```

**Daily**: Provide daily status summary

```
Hypercare Day {N} of {duration}: {GREEN | YELLOW | RED}

Production Health:
✓ Availability: {percentage}% (Target: ≥99.9%)
✓ Performance: p95 {value}ms (Target: <{SLA}ms)
{⚠️ | ✓} Error Rate: {percentage}% (Target: <0.1%)

Incidents (24h):
- P0: {count}
- P1: {count}
- P2: {count}

User Adoption: {percentage}% ({trend})

Daily reports: .aiwg/deployment/hypercare-standup-{date}.md
```

**At end**: Summary report (see Step 5.3 above)

## Error Handling

**If P0 Incident During Hypercare**:
```
❌ Critical incident detected - immediate response initiated

Incident: {incident-ID} - {title}
Severity: P0 (Complete outage / Data loss / Security breach)
Impact: {user-count} users affected

Actions:
1. On-call engineer + Hypercare Lead paged
2. Incident war room created: #incident-{date}-{ID}
3. Executive Sponsor notified
4. Status page updated

Response Timeline:
- Detection: {timestamp}
- Acknowledgment: {timestamp} (Target: Immediate)
- Time-to-engage: {minutes} min (Target: <15 min)

Current Status: {INVESTIGATING | MITIGATING | RESOLVED}

Impact on Exit Criteria: P0 incident resets 48h "zero critical incidents" requirement

Monitoring incident response...
```

**If SLO Breach**:
```
⚠️ SLO breach detected - immediate investigation required

SLO Breached: {SLO-name}
- Target: {target-value}
- Current: {actual-value}
- Duration: {duration} (continuous breach)

Impact:
- Error budget consumed: {percentage}%
- User impact: {description}

Actions:
1. Reliability Engineer investigating root cause
2. Metrics and logs under review
3. Mitigation plan in progress

If breach persists >24h: Recommend extending hypercare period
If error budget critically low: Recommend incident freeze

Monitoring for improvement...
```

**If User Adoption Low**:
```
⚠️ User adoption below target

Current Adoption: {percentage}% (Target: >{target}%)
Gap: {percentage} points

Analysis:
- Top User Issues: {list issues}
- Support Ticket Themes: {list themes}
- Potential Blockers: {list blockers}

Actions:
1. Product Owner engaged for adoption analysis
2. Support team reviewing common user issues
3. Documentation and training gaps identified

Decision Point:
- If blockers identified: Prioritize fixes, may extend hypercare
- If education needed: Launch awareness campaign
- If feature not valuable: Escalate to stakeholders

Impact on Exit Criteria: User adoption trend must improve before exit approval
```

**If Support Team Overwhelmed**:
```
⚠️ Support team capacity exceeded

Support Volume: {count} tickets/day (Capacity: {capacity})
Team Status: {STRESSED | OVERWHELMED}

Root Cause Analysis:
- Top Issue Categories: {list categories with counts}
- Product Bugs vs User Education: {ratio}

Immediate Relief Actions:
1. Additional support staff brought in (temp)
2. Engineering team handling overflow tickets
3. Workarounds created for top issues
4. FAQ and self-service guides published

Mitigation:
- Deploy hotfixes for high-frequency bugs
- Update documentation for common questions
- Additional training sessions scheduled

Impact on Exit Criteria: Support team must be confident and staffed before exit
```

**If Exit Criteria Not Met**:
```
⚠️ Hypercare exit criteria not met - extension recommended

Exit Criteria Status: {FAIL | CONDITIONAL}

Gaps Identified:
{list unmet criteria with details}

Recommendation: {EXTEND HYPERCARE | CONDITIONAL EXIT | ESCALATE}

Extension Plan:
- Additional Duration: {days} days
- Focus Areas: {list areas needing improvement}
- Re-validation Date: {date}

Escalating to user for decision...
```

## Success Criteria

This orchestration succeeds when:
- [ ] Hypercare team established with 24/7 coverage
- [ ] Production monitoring operational (dashboards, alerts, SLO tracking)
- [ ] Daily standups conducted for entire hypercare period
- [ ] All incidents documented and P0/P1 incidents have PIRs
- [ ] SLO compliance ≥95% (all SLOs met for 72+ consecutive hours)
- [ ] Zero P0/P1 incidents in final 48/24 hours
- [ ] User adoption trending positive (≥target%)
- [ ] Support team ready and confident
- [ ] Exit criteria validated: PASS or CONDITIONAL PASS
- [ ] Hypercare exit report complete and approved
- [ ] Transition to standard support planned and executed

## Metrics to Track

**During orchestration, track**:
- SLO compliance rate: % of SLOs met (target: 100% for 72h before exit)
- Incident frequency: # of P0/P1/P2/P3 incidents (target: P0/P1 = 0 in final 48/24h)
- Mean time to detect (MTTD): Minutes from incident to detection (target: <5 min)
- Mean time to resolve (MTTR): Minutes from detection to resolution (target: P0 <120 min, P1 <480 min)
- Error budget burn rate: % of monthly budget consumed (target: <50% during hypercare)
- User adoption rate: % of target users actively engaged (target: ≥70%)
- Support ticket volume: # of tickets/day (target: decreasing trend)
- Support response time: Hours to first response (target: <SLA)

## References

**Templates** (via $AIWG_ROOT):
- Operational Readiness Review: `templates/deployment/operational-readiness-review-template.md`
- SLO/SLI Definition: `templates/deployment/slo-sli-template.md`
- Incident Response Runbook: `templates/support/incident-response-runbook-template.md`
- Support Plan: `templates/support/support-plan-template.md`

**Related Flows**:
- Gate Check: `commands/flow-gate-check.md`
- Handoff Checklist: `commands/flow-handoff-checklist.md`
- Deployment Workflow: `commands/flow-deployment-workflow.md`

**SDLC Phase Context**:
- Phase: Transition (Deployment → Operations)
- Milestone: Hypercare Complete (transition to BAU support)
Get flow-hypercare-monitoring.

vz-bench-debug

vz-scrape-runner

Think you can beat it?