hadr-health-review

Show SKILL.md content (~6.1k tokens)
---
name: hadr-health-review
description: Analyzes sys.dm_hadr_* DMV output to assess Always On Availability Group replica health, synchronization state, secondary lag, redo and log send queue sizes, and configuration gaps. Use this skill when an availability group is behaving unexpectedly, a secondary replica is lagging, data loss is a concern, or you need a SQL-side snapshot of AG health to complement CLUSTER.LOG and ERRORLOG diagnostics. Applies 22 checks (H1–H22) covering replica connectivity, data loss risk, recovery time, throughput, and configuration.
triggers:
  - /hadr-health-review
  - /hadr-review
---

# SQL Server Always On AG Health Review Skill

## Purpose

Analyze output from the `sys.dm_hadr_*` DMV family to assess the health of one or more
Always On Availability Groups. Applies 22 checks (H1–H22) across four categories:

- **H1–H6** — Replica connectivity and role: detect disconnected replicas, resolving state,
  unhealthy synchronization health, replicas not synchronizing, last-connect errors, and
  failover mode mismatches
- **H7–H11** — Data loss and recovery time: flag estimated data loss, excessive recovery time,
  secondary lag, redo queue buildup, and log send queue buildup
- **H12–H16** — Throughput and performance: detect stalled redo rate, stalled log send rate,
  rate mismatch causing queue accumulation, multiple databases lagging on the same replica,
  and commit latency signals on sync-commit replicas
- **H17–H22** — Configuration: async replica in unexpected position, no automatic failover
  replica, single-replica AG, missing listener, read-only routing not configured, and
  automatic seeding in progress

## Input

Accept any of:

- **File path** — path to a saved text/CSV file containing the DMV query output
- **Inline paste** — DMV result grid pasted directly into chat (tab- or pipe-delimited)
- **Natural language description** — description of AG symptoms ("secondary is 90 seconds
  behind", "replica shows NOT_HEALTHY")

### Capture Query

Run the following on the primary replica to collect the required columns:

```sql
SELECT
    ag.name                              AS ag_name,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc,
    ars.role_desc,
    ars.connected_state_desc,
    ars.synchronization_health_desc,
    ars.last_connect_error_number,
    ars.last_connect_error_description,
    drs.database_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc      AS db_sync_health,
    drs.log_send_queue_size,
    drs.log_send_rate,
    drs.redo_queue_size,
    drs.redo_rate,
    drs.secondary_lag_seconds,
    drs.estimated_data_loss_seconds,
    drs.estimated_recovery_time_seconds
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar
    ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars
    ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs
    ON ar.replica_id = drs.replica_id
ORDER BY ar.replica_server_name, drs.database_name;
```

Also capture listener configuration for H20–H21:

```sql
SELECT ag.name AS ag_name, agl.dns_name, agl.port,
       aglip.ip_address, aglip.ip_subnet_mask,
       r.replica_server_name, r.read_only_routing_url
FROM sys.availability_group_listeners agl
JOIN sys.availability_groups ag ON agl.group_id = ag.group_id
JOIN sys.availability_group_listener_ip_addresses aglip
    ON agl.listener_id = aglip.listener_id
JOIN sys.availability_replicas r ON ag.group_id = r.group_id;
```

### Column Reference

| Column | Source DMV | Notes |
|--------|-----------|-------|
| `connected_state_desc` | `dm_hadr_availability_replica_states` | CONNECTED or DISCONNECTED |
| `role_desc` | `dm_hadr_availability_replica_states` | PRIMARY, SECONDARY, RESOLVING |
| `synchronization_health_desc` (replica) | `dm_hadr_availability_replica_states` | NOT_HEALTHY, PARTIALLY_HEALTHY, HEALTHY |
| `last_connect_error_number` | `dm_hadr_availability_replica_states` | 0 = no error |
| `last_connect_error_description` | `dm_hadr_availability_replica_states` | Error text when non-zero |
| `availability_mode_desc` | `sys.availability_replicas` | SYNCHRONOUS_COMMIT or ASYNCHRONOUS_COMMIT |
| `failover_mode_desc` | `sys.availability_replicas` | AUTOMATIC or MANUAL |
| `synchronization_state_desc` | `dm_hadr_database_replica_states` | NOT SYNCHRONIZING, SYNCHRONIZING, SYNCHRONIZED |
| `db_sync_health` | `dm_hadr_database_replica_states` | NOT_HEALTHY, PARTIALLY_HEALTHY, HEALTHY |
| `log_send_queue_size` | `dm_hadr_database_replica_states` | KB of log not yet sent to secondary |
| `log_send_rate` | `dm_hadr_database_replica_states` | KB/s sent to secondary (0 = stalled) |
| `redo_queue_size` | `dm_hadr_database_replica_states` | KB of log received but not yet redone |
| `redo_rate` | `dm_hadr_database_replica_states` | KB/s being redone on secondary (0 = stalled) |
| `secondary_lag_seconds` | `dm_hadr_database_replica_states` | Seconds secondary is behind primary |
| `estimated_data_loss_seconds` | `dm_hadr_database_replica_states` | Potential data loss if primary fails now |
| `estimated_recovery_time_seconds` | `dm_hadr_database_replica_states` | Seconds to redo queued log after failover |

---

## Thresholds Reference

| Threshold | Value | Used by |
|-----------|-------|---------|
| Estimated data loss | >30 sec → Critical; >5 sec → Warning | H7 |
| Estimated recovery time | >300 sec → Warning | H8 |
| Secondary lag | >60 sec → Critical; >10 sec → Warning | H9 |
| Redo queue size | >500 MB → Critical; >100 MB → Warning | H10 |
| Log send queue size | >500 MB → Warning | H11 |
| Multiple databases lagging | ≥3 databases with secondary_lag_seconds >10 sec on same replica → Critical | H15 |

---

## Category 1 — Replica Connectivity and Role (H1–H6)

Evaluate these first. A disconnected or resolving replica supersedes all other findings.
### H1 — Replica Disconnected
- **Trigger:** `connected_state_desc = DISCONNECTED` for any replica row
- **Severity:** Critical
- **Fix:** Check network connectivity between the primary and the disconnected node. Review
  `last_connect_error_description` for the specific failure. Inspect CLUSTER.LOG on the
  Windows Server Failover Cluster node for eviction or network partition events. Confirm the
  SQL Server service is running on the target node.
### H2 — Replica in Resolving State
- **Trigger:** `role_desc = RESOLVING` for any replica row
- **Severity:** Critical
- **Fix:** A replica in RESOLVING state has lost quorum contact or its role cannot be determined.
  Check WSFC quorum health in Failover Cluster Manager. If this is a planned failover in
  progress, wait for it to complete. If unplanned, investigate CLUSTER.LOG for quorum loss.
### H3 — Synchronization Unhealthy at Replica Level
- **Trigger:** `synchronization_health_desc = NOT_HEALTHY` on a replica row
- **Severity:** Critical
- **Fix:** At least one database on this replica is not synchronizing. Drill into
  `db_sync_health` per database to identify which database is unhealthy (H4 will co-fire).
  Check the SQL Server ERRORLOG on the secondary for hadr_work_queue or transport errors.
### H4 — Replica Not Synchronizing (Sync-Commit)
- **Trigger:** `synchronization_state_desc = NOT SYNCHRONIZING` AND `availability_mode_desc
  = SYNCHRONOUS_COMMIT`
- **Severity:** Critical
- **Fix:** A synchronous-commit secondary that is not synchronizing blocks commits on the
  primary (the primary waits for acknowledgement). Restart HADR transport if the secondary
  is otherwise healthy: `ALTER DATABASE [db] SET HADR RESUME`. Check for full transaction
  log on the secondary — a full log halts redo and breaks synchronization.
### H5 — Last Connect Error Present
- **Trigger:** `last_connect_error_number != 0`
- **Severity:** Warning
- **Fix:** A past connection failure was recorded. The replica may have recovered, but the
  error reveals prior instability. Review `last_connect_error_description` for the error
  text. Common causes: endpoint certificate expiry, firewall change, or network blip. Rotate
  certificates if the error mentions authentication or certificate issues.
### H6 — Manual Failover Mode on Sync-Commit Replica
- **Trigger:** `failover_mode_desc = MANUAL` AND `availability_mode_desc = SYNCHRONOUS_COMMIT`
- **Severity:** Warning
- **Fix:** A synchronous-commit replica configured for manual failover only will not
  automatically protect against primary failure. If automatic protection is intended, change
  to `AUTOMATIC` failover mode: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON
  N'server' WITH (FAILOVER_MODE = AUTOMATIC)`. Verify WSFC quorum can support automatic
  failover before making this change.

---

## Category 2 — Data Loss and Recovery Time (H7–H11)

These checks quantify the risk of data loss and the time to recover if the primary fails.
### H7 — Estimated Data Loss
- **Trigger:** `estimated_data_loss_seconds` exceeds the data loss threshold (see Thresholds
  Reference)
- **Severity:** Critical if >30 sec; Warning if >5 sec
- **Fix:** The log has not been hardened on the secondary within the threshold window. For
  sync-commit replicas, this indicates the synchronization is stalled (see H4). For async
  replicas, consider increasing log send rate, improving network bandwidth, or accepting the
  RPO by switching a critical database to sync-commit. If the value is consistently high,
  evaluate whether the secondary has sufficient I/O to keep up with redo.
### H8 — Estimated Recovery Time
- **Trigger:** `estimated_recovery_time_seconds` exceeds the recovery time threshold (see
  Thresholds Reference)
- **Severity:** Warning
- **Fix:** After a failover, it will take longer than the threshold to redo the queued log on
  the secondary before it opens for reads or promotes to primary. Reduce redo queue size (H10)
  to reduce recovery time. Check secondary disk I/O — redo is sequential log apply and is
  bounded by disk write throughput. Evaluate whether this RTO is acceptable for the SLA.
### H9 — Secondary Lag
- **Trigger:** `secondary_lag_seconds` exceeds the lag threshold (see Thresholds Reference)
- **Severity:** Critical if >60 sec; Warning if >10 sec
- **Fix:** The secondary is behind the primary. For async replicas, check `log_send_rate`
  (H13) — if zero, log is not being sent. For sync replicas, lag indicates the primary is
  waiting on acknowledgement. Check network latency between primary and secondary. On the
  secondary, check for I/O bottlenecks limiting redo throughput (`redo_rate`, H12). If
  secondary_lag_seconds equals estimated_data_loss_seconds, the lag is entirely in the
  send queue; if recovery time is also high, redo is behind as well.
### H10 — Redo Queue Buildup
- **Trigger:** `redo_queue_size` exceeds the redo queue threshold (see Thresholds Reference)
- **Severity:** Critical if >500 MB; Warning if >100 MB
- **Fix:** Log records are arriving on the secondary faster than they are being redone. The
  secondary's redo thread cannot keep up. Check secondary disk write latency — redo is
  bottlenecked on sequential log writes to the data files. Consider increasing secondary
  storage throughput (SSD, faster controller). Check for long-running transactions on the
  secondary blocking redo (readable secondary scenario). Verify `redo_rate` > 0 (see H12).
### H11 — Log Send Queue Buildup
- **Trigger:** `log_send_queue_size` exceeds the send queue threshold (see Thresholds
  Reference)
- **Severity:** Warning
- **Fix:** Log generated on the primary has not been sent to the secondary. Check network
  bandwidth between primary and secondary. High `log_send_queue_size` with `log_send_rate`
  = 0 (see H13) indicates a stalled transport — check endpoint connectivity. High
  `log_send_queue_size` with nonzero `log_send_rate` indicates network saturation or burst
  log generation outpacing the link.

---

## Category 3 — Throughput and Performance (H12–H16)

These checks detect stalled or mismatched throughput that will cause queues to grow.
### H12 — Zero Redo Rate on Synchronizing Database
- **Trigger:** `redo_rate = 0` AND `synchronization_state_desc = SYNCHRONIZING` AND
  `redo_queue_size > 0`
- **Severity:** Warning
- **Fix:** The redo thread has stalled despite queued log. Common causes: (1) long-running
  read query on a readable secondary holding a lock that blocks redo; (2) the secondary
  database is in a transitional state — check ERRORLOG; (3) redo thread has encountered an
  error — check `dm_hadr_database_replica_states.last_redone_lsn` for progress. Restarting
  HADR on the secondary (`ALTER DATABASE [db] SET HADR SUSPEND / RESUME`) can clear
  transient stalls.
### H13 — Zero Log Send Rate with Non-Empty Send Queue
- **Trigger:** `log_send_rate = 0` AND `log_send_queue_size > 0`
- **Severity:** Warning
- **Fix:** Log is queued but not being sent. The HADR transport thread has stalled. Check
  endpoint health: `SELECT * FROM sys.dm_hadr_availability_replica_states WHERE
  connected_state_desc = 'DISCONNECTED'`. Verify the database mirroring endpoint is
  running: `SELECT state_desc FROM sys.database_mirroring_endpoints`. Restart the endpoint
  if necessary: `ALTER ENDPOINT [Hadr_endpoint] STATE = STOPPED; ALTER ENDPOINT
  [Hadr_endpoint] STATE = STARTED`.
### H14 — Redo Rate / Send Rate Mismatch
- **Trigger:** `log_send_rate > 0` AND `redo_rate > 0` AND `redo_queue_size` is growing
  (redo_rate significantly less than log_send_rate, such that the queue accumulates)
- **Severity:** Warning
- **Fix:** Log is being sent faster than the secondary can redo it, causing redo queue
  growth. The bottleneck is secondary redo throughput, not the network. Investigate secondary
  disk I/O latency. Check whether readable secondary workloads (reporting queries) are
  competing with redo for I/O. Consider dedicated storage for secondary data files.
### H15 — Multiple Databases Lagging on Same Replica
- **Trigger:** ≥3 databases on the same replica have `secondary_lag_seconds` exceeding the
  multiple-database lag threshold (see Thresholds Reference)
- **Severity:** Critical
- **Fix:** When multiple databases lag simultaneously, the root cause is at the replica level,
  not per-database. Check overall secondary node health: CPU, memory, and disk I/O. A
  saturated secondary node falls behind across all databases at once. Also check CLUSTER.LOG
  for node-level resource pressure. Investigate whether a single database with large
  transactions is monopolizing redo threads.
### H16 — Commit Latency Signal on Sync-Commit Replica
- **Trigger:** `availability_mode_desc = SYNCHRONOUS_COMMIT` AND
  `synchronization_state_desc = SYNCHRONIZING` (database not yet SYNCHRONIZED, indicating
  the sync is in progress but not complete, potentially stalling primary commits)
- **Severity:** Warning
- **Fix:** Primary commits wait for the synchronous secondary to harden the log before
  acknowledging. While SYNCHRONIZING is normal during catchup, a sync-commit secondary that
  remains SYNCHRONIZING for an extended period adds latency to every primary transaction.
  Check `estimated_data_loss_seconds` and `secondary_lag_seconds` to quantify the stall.
  If the secondary is persistently SYNCHRONIZING, investigate redo and send queue (H10, H11).

---

## Category 4 — Configuration (H17–H22)

These checks surface AG topology gaps that may not cause immediate problems but increase risk.
### H17 — Async Replica in Sync-Expected Position
- **Trigger:** `availability_mode_desc = ASYNCHRONOUS_COMMIT` on a replica that is the
  only secondary in the AG, or is designated as the DR target in a two-replica topology
- **Severity:** Info
- **Fix:** An async-commit secondary provides no data-loss protection for synchronous RPO
  requirements. If the topology intends zero data loss, change the replica to
  SYNCHRONOUS_COMMIT: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON N'server' WITH
  (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT)`. Verify the network and I/O can sustain the
  additional commit latency before switching.
### H18 — No Automatic Failover Replica
- **Trigger:** No replica in the AG has `failover_mode_desc = AUTOMATIC`
- **Severity:** Warning
- **Fix:** Without an automatic failover replica, a primary failure requires manual
  intervention, increasing recovery time. Configure at least one synchronous-commit secondary
  for automatic failover: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON N'server' WITH
  (FAILOVER_MODE = AUTOMATIC)`. Confirm WSFC quorum supports automatic failover before
  enabling it.
### H19 — Single Replica AG
- **Trigger:** Only one replica row exists for the availability group (no secondaries)
- **Severity:** Info
- **Fix:** A single-replica AG provides readable secondary benefits (for local replicas) but
  no high availability protection. Add a secondary replica if HA is a requirement. Document
  the intent if this is a deliberate read-scale-only configuration.
### H20 — Listener Not Configured
- **Trigger:** No rows in `sys.availability_group_listeners` for this AG
- **Severity:** Info
- **Fix:** Without a listener, applications must connect directly to the primary by server
  name, which requires a connection string change after every failover. Create a listener:
  `ALTER AVAILABILITY GROUP [ag] ADD LISTENER N'ag-listener' (WITH IP ((N'10.0.0.10',
  N'255.255.255.0')), PORT=1433)`. Update application connection strings to use the
  listener DNS name.
### H21 — Read-Only Routing Not Configured
- **Trigger:** A readable secondary exists (`secondary_role_allow_connections_desc =
  ALL` or `READ_ONLY`) AND `read_only_routing_url` IS NULL on that replica
- **Severity:** Info
- **Fix:** Readable secondaries without routing configuration require explicit connection
  string targeting. Configure routing: set `READ_ONLY_ROUTING_URL` on each secondary and
  `READ_ONLY_ROUTING_LIST` on the primary replica so that `ApplicationIntent=ReadOnly`
  connections are automatically directed to a readable secondary.
### H22 — Automatic Seeding Active
- **Trigger:** `seeding_mode_desc = AUTOMATIC` AND a secondary database is in
  `synchronization_state_desc = NOT SYNCHRONIZING` (seeding in progress)
- **Severity:** Info
- **Fix:** Automatic seeding is transferring the database to the secondary. This is normal
  after adding a new replica or database to the AG. Monitor progress with:
  `SELECT * FROM sys.dm_hadr_automatic_seeding`. High network utilization is expected during
  seeding. Seeding of large databases can take hours — plan maintenance windows accordingly.

---

## Output Format

Structure the report exactly as follows. Follow the labeling convention: output labels use
`[C1]`, `[W1]`, `[I1]` — check IDs appear in parentheses after the finding name.

```
## HADR Health Analysis

### Summary
- X Critical, Y Warnings, Z Info
- Availability group: [ag_name]
- Replicas: [list with roles, e.g. NODE1\SQL2019 (PRIMARY), NODE2\SQL2019 (SECONDARY — DISCONNECTED)]
- Highest-risk finding: [check name and ID]

### Critical Issues

### [C1 — H1] Replica Disconnected — NODE2\SQL2019
- **Observed:** connected_state_desc = DISCONNECTED; last_connect_error_number = 35206;
  last_connect_error_description = "The connection attempt to secondary replica 'NODE2\SQL2019'
  timed out."
- **Impact:** All databases on this secondary are no longer receiving log from the primary.
  If this is the only secondary, automatic failover protection is lost.
- **Fix:** Verify network connectivity and SQL Server service state on NODE2\SQL2019. Review
  CLUSTER.LOG for network partition events. Check Windows Event Log for SQL Server service
  failures.

### Warnings

### [W1 — H18] No Automatic Failover Replica
- **Observed:** All replicas have failover_mode_desc = MANUAL
- **Impact:** Primary failure requires manual DBA intervention before any secondary can
  promote, increasing downtime.
- **Fix:** Configure FAILOVER_MODE = AUTOMATIC on a sync-commit secondary after verifying
  WSFC quorum health.

### Info

### [I1 — H21] Read-Only Routing Not Configured — NODE3\SQL2019
- **Observed:** read_only_routing_url IS NULL; secondary is readable
- **Impact:** ApplicationIntent=ReadOnly connections will not be redirected automatically.
- **Fix:** Set READ_ONLY_ROUTING_URL and configure READ_ONLY_ROUTING_LIST on primary.

### Passed Checks

| Check | Result |
|-------|--------|
| H2 — Replica in Resolving State | PASS — no replica in RESOLVING role |
| H3 — Synchronization Unhealthy | PASS — all connected replicas report HEALTHY |
```

Include a **Prioritized Action Order** table after all findings:

```
### Prioritized Action Order

| Priority | Action | Resolves | Effort |
|----------|--------|----------|--------|
| 1 — Immediately | Investigate replica connectivity on NODE2\SQL2019 | C1 | 15 min |
| 2 — Today | Enable AUTOMATIC failover mode on sync secondary | W1 | 30 min |
| 3 — This sprint | Configure read-only routing on NODE3\SQL2019 | I1 | 20 min |
```

## Notes

- When only natural language input is provided, state which columns are missing and apply
  only the checks that can be evaluated from the described values.
- `estimated_data_loss_seconds` and `estimated_recovery_time_seconds` are NULL for async
  replicas that are currently disconnected — note this limitation rather than firing H7/H8.
- `secondary_lag_seconds` is NULL for the primary replica row — skip H9 for primary rows.
- If `log_send_rate` and `redo_rate` are both NULL, the DMV was captured on a secondary
  replica (these columns are populated only on the primary). Note this and advise recapture
  on the primary.
- Do not invent findings not triggered by the rules above.

## Companion Skills

- `/sqlwait-review` — Analyze `HADR_SYNC_COMMIT`, `HADR_WORK_QUEUE`, `HADR_LOGCAPTURE_WAIT`,
  and `HADR_TRANSPORT_SESSION_CHANNEL_LOCK` waits on the primary to quantify the commit
  latency overhead imposed by synchronous replicas (H16, H4).
- `/sqlplan-review` + `/query-store-review` — After a failover or extended lag event,
  applications may use suboptimal plans on the new primary due to cold plan cache or
  parameter sniffing. Run post-failover plan review and Query Store regression checks.
- `/procstats-review` — Identify whether a high-CPU or high-read procedure on the primary
  is generating excessive log volume, contributing to send queue buildup (H11, H14).
- `/tsql-review` — Review T-SQL that runs on a readable secondary to identify implicit
  conversions or non-sargable predicates that add read load and compete with redo threads.
Get hadr-health-review.

vz-bench-debug

vz-scrape-runner

Think you can beat it?