Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install vanterx-mssql-performance-skills-skills-hadr-health-reviewgit clone https://github.com/vanterx/mssql-performance-skills.gitcp mssql-performance-skills/SKILL.MD ~/.claude/skills/vanterx-mssql-performance-skills-skills-hadr-health-review/SKILL.md---
name: hadr-health-review
description: Analyzes sys.dm_hadr_* DMV output to assess Always On Availability Group replica health, synchronization state, secondary lag, redo and log send queue sizes, and configuration gaps. Use this skill when an availability group is behaving unexpectedly, a secondary replica is lagging, data loss is a concern, or you need a SQL-side snapshot of AG health to complement CLUSTER.LOG and ERRORLOG diagnostics. Applies 22 checks (H1–H22) covering replica connectivity, data loss risk, recovery time, throughput, and configuration.
triggers:
- /hadr-health-review
- /hadr-review
---
# SQL Server Always On AG Health Review Skill
## Purpose
Analyze output from the `sys.dm_hadr_*` DMV family to assess the health of one or more
Always On Availability Groups. Applies 22 checks (H1–H22) across four categories:
- **H1–H6** — Replica connectivity and role: detect disconnected replicas, resolving state,
unhealthy synchronization health, replicas not synchronizing, last-connect errors, and
failover mode mismatches
- **H7–H11** — Data loss and recovery time: flag estimated data loss, excessive recovery time,
secondary lag, redo queue buildup, and log send queue buildup
- **H12–H16** — Throughput and performance: detect stalled redo rate, stalled log send rate,
rate mismatch causing queue accumulation, multiple databases lagging on the same replica,
and commit latency signals on sync-commit replicas
- **H17–H22** — Configuration: async replica in unexpected position, no automatic failover
replica, single-replica AG, missing listener, read-only routing not configured, and
automatic seeding in progress
## Input
Accept any of:
- **File path** — path to a saved text/CSV file containing the DMV query output
- **Inline paste** — DMV result grid pasted directly into chat (tab- or pipe-delimited)
- **Natural language description** — description of AG symptoms ("secondary is 90 seconds
behind", "replica shows NOT_HEALTHY")
### Capture Query
Run the following on the primary replica to collect the required columns:
```sql
SELECT
ag.name AS ag_name,
ar.replica_server_name,
ar.availability_mode_desc,
ar.failover_mode_desc,
ars.role_desc,
ars.connected_state_desc,
ars.synchronization_health_desc,
ars.last_connect_error_number,
ars.last_connect_error_description,
drs.database_name,
drs.synchronization_state_desc,
drs.synchronization_health_desc AS db_sync_health,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.secondary_lag_seconds,
drs.estimated_data_loss_seconds,
drs.estimated_recovery_time_seconds
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar
ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars
ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs
ON ar.replica_id = drs.replica_id
ORDER BY ar.replica_server_name, drs.database_name;
```
Also capture listener configuration for H20–H21:
```sql
SELECT ag.name AS ag_name, agl.dns_name, agl.port,
aglip.ip_address, aglip.ip_subnet_mask,
r.replica_server_name, r.read_only_routing_url
FROM sys.availability_group_listeners agl
JOIN sys.availability_groups ag ON agl.group_id = ag.group_id
JOIN sys.availability_group_listener_ip_addresses aglip
ON agl.listener_id = aglip.listener_id
JOIN sys.availability_replicas r ON ag.group_id = r.group_id;
```
### Column Reference
| Column | Source DMV | Notes |
|--------|-----------|-------|
| `connected_state_desc` | `dm_hadr_availability_replica_states` | CONNECTED or DISCONNECTED |
| `role_desc` | `dm_hadr_availability_replica_states` | PRIMARY, SECONDARY, RESOLVING |
| `synchronization_health_desc` (replica) | `dm_hadr_availability_replica_states` | NOT_HEALTHY, PARTIALLY_HEALTHY, HEALTHY |
| `last_connect_error_number` | `dm_hadr_availability_replica_states` | 0 = no error |
| `last_connect_error_description` | `dm_hadr_availability_replica_states` | Error text when non-zero |
| `availability_mode_desc` | `sys.availability_replicas` | SYNCHRONOUS_COMMIT or ASYNCHRONOUS_COMMIT |
| `failover_mode_desc` | `sys.availability_replicas` | AUTOMATIC or MANUAL |
| `synchronization_state_desc` | `dm_hadr_database_replica_states` | NOT SYNCHRONIZING, SYNCHRONIZING, SYNCHRONIZED |
| `db_sync_health` | `dm_hadr_database_replica_states` | NOT_HEALTHY, PARTIALLY_HEALTHY, HEALTHY |
| `log_send_queue_size` | `dm_hadr_database_replica_states` | KB of log not yet sent to secondary |
| `log_send_rate` | `dm_hadr_database_replica_states` | KB/s sent to secondary (0 = stalled) |
| `redo_queue_size` | `dm_hadr_database_replica_states` | KB of log received but not yet redone |
| `redo_rate` | `dm_hadr_database_replica_states` | KB/s being redone on secondary (0 = stalled) |
| `secondary_lag_seconds` | `dm_hadr_database_replica_states` | Seconds secondary is behind primary |
| `estimated_data_loss_seconds` | `dm_hadr_database_replica_states` | Potential data loss if primary fails now |
| `estimated_recovery_time_seconds` | `dm_hadr_database_replica_states` | Seconds to redo queued log after failover |
---
## Thresholds Reference
| Threshold | Value | Used by |
|-----------|-------|---------|
| Estimated data loss | >30 sec → Critical; >5 sec → Warning | H7 |
| Estimated recovery time | >300 sec → Warning | H8 |
| Secondary lag | >60 sec → Critical; >10 sec → Warning | H9 |
| Redo queue size | >500 MB → Critical; >100 MB → Warning | H10 |
| Log send queue size | >500 MB → Warning | H11 |
| Multiple databases lagging | ≥3 databases with secondary_lag_seconds >10 sec on same replica → Critical | H15 |
---
## Category 1 — Replica Connectivity and Role (H1–H6)
Evaluate these first. A disconnected or resolving replica supersedes all other findings.
### H1 — Replica Disconnected
- **Trigger:** `connected_state_desc = DISCONNECTED` for any replica row
- **Severity:** Critical
- **Fix:** Check network connectivity between the primary and the disconnected node. Review
`last_connect_error_description` for the specific failure. Inspect CLUSTER.LOG on the
Windows Server Failover Cluster node for eviction or network partition events. Confirm the
SQL Server service is running on the target node.
### H2 — Replica in Resolving State
- **Trigger:** `role_desc = RESOLVING` for any replica row
- **Severity:** Critical
- **Fix:** A replica in RESOLVING state has lost quorum contact or its role cannot be determined.
Check WSFC quorum health in Failover Cluster Manager. If this is a planned failover in
progress, wait for it to complete. If unplanned, investigate CLUSTER.LOG for quorum loss.
### H3 — Synchronization Unhealthy at Replica Level
- **Trigger:** `synchronization_health_desc = NOT_HEALTHY` on a replica row
- **Severity:** Critical
- **Fix:** At least one database on this replica is not synchronizing. Drill into
`db_sync_health` per database to identify which database is unhealthy (H4 will co-fire).
Check the SQL Server ERRORLOG on the secondary for hadr_work_queue or transport errors.
### H4 — Replica Not Synchronizing (Sync-Commit)
- **Trigger:** `synchronization_state_desc = NOT SYNCHRONIZING` AND `availability_mode_desc
= SYNCHRONOUS_COMMIT`
- **Severity:** Critical
- **Fix:** A synchronous-commit secondary that is not synchronizing blocks commits on the
primary (the primary waits for acknowledgement). Restart HADR transport if the secondary
is otherwise healthy: `ALTER DATABASE [db] SET HADR RESUME`. Check for full transaction
log on the secondary — a full log halts redo and breaks synchronization.
### H5 — Last Connect Error Present
- **Trigger:** `last_connect_error_number != 0`
- **Severity:** Warning
- **Fix:** A past connection failure was recorded. The replica may have recovered, but the
error reveals prior instability. Review `last_connect_error_description` for the error
text. Common causes: endpoint certificate expiry, firewall change, or network blip. Rotate
certificates if the error mentions authentication or certificate issues.
### H6 — Manual Failover Mode on Sync-Commit Replica
- **Trigger:** `failover_mode_desc = MANUAL` AND `availability_mode_desc = SYNCHRONOUS_COMMIT`
- **Severity:** Warning
- **Fix:** A synchronous-commit replica configured for manual failover only will not
automatically protect against primary failure. If automatic protection is intended, change
to `AUTOMATIC` failover mode: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON
N'server' WITH (FAILOVER_MODE = AUTOMATIC)`. Verify WSFC quorum can support automatic
failover before making this change.
---
## Category 2 — Data Loss and Recovery Time (H7–H11)
These checks quantify the risk of data loss and the time to recover if the primary fails.
### H7 — Estimated Data Loss
- **Trigger:** `estimated_data_loss_seconds` exceeds the data loss threshold (see Thresholds
Reference)
- **Severity:** Critical if >30 sec; Warning if >5 sec
- **Fix:** The log has not been hardened on the secondary within the threshold window. For
sync-commit replicas, this indicates the synchronization is stalled (see H4). For async
replicas, consider increasing log send rate, improving network bandwidth, or accepting the
RPO by switching a critical database to sync-commit. If the value is consistently high,
evaluate whether the secondary has sufficient I/O to keep up with redo.
### H8 — Estimated Recovery Time
- **Trigger:** `estimated_recovery_time_seconds` exceeds the recovery time threshold (see
Thresholds Reference)
- **Severity:** Warning
- **Fix:** After a failover, it will take longer than the threshold to redo the queued log on
the secondary before it opens for reads or promotes to primary. Reduce redo queue size (H10)
to reduce recovery time. Check secondary disk I/O — redo is sequential log apply and is
bounded by disk write throughput. Evaluate whether this RTO is acceptable for the SLA.
### H9 — Secondary Lag
- **Trigger:** `secondary_lag_seconds` exceeds the lag threshold (see Thresholds Reference)
- **Severity:** Critical if >60 sec; Warning if >10 sec
- **Fix:** The secondary is behind the primary. For async replicas, check `log_send_rate`
(H13) — if zero, log is not being sent. For sync replicas, lag indicates the primary is
waiting on acknowledgement. Check network latency between primary and secondary. On the
secondary, check for I/O bottlenecks limiting redo throughput (`redo_rate`, H12). If
secondary_lag_seconds equals estimated_data_loss_seconds, the lag is entirely in the
send queue; if recovery time is also high, redo is behind as well.
### H10 — Redo Queue Buildup
- **Trigger:** `redo_queue_size` exceeds the redo queue threshold (see Thresholds Reference)
- **Severity:** Critical if >500 MB; Warning if >100 MB
- **Fix:** Log records are arriving on the secondary faster than they are being redone. The
secondary's redo thread cannot keep up. Check secondary disk write latency — redo is
bottlenecked on sequential log writes to the data files. Consider increasing secondary
storage throughput (SSD, faster controller). Check for long-running transactions on the
secondary blocking redo (readable secondary scenario). Verify `redo_rate` > 0 (see H12).
### H11 — Log Send Queue Buildup
- **Trigger:** `log_send_queue_size` exceeds the send queue threshold (see Thresholds
Reference)
- **Severity:** Warning
- **Fix:** Log generated on the primary has not been sent to the secondary. Check network
bandwidth between primary and secondary. High `log_send_queue_size` with `log_send_rate`
= 0 (see H13) indicates a stalled transport — check endpoint connectivity. High
`log_send_queue_size` with nonzero `log_send_rate` indicates network saturation or burst
log generation outpacing the link.
---
## Category 3 — Throughput and Performance (H12–H16)
These checks detect stalled or mismatched throughput that will cause queues to grow.
### H12 — Zero Redo Rate on Synchronizing Database
- **Trigger:** `redo_rate = 0` AND `synchronization_state_desc = SYNCHRONIZING` AND
`redo_queue_size > 0`
- **Severity:** Warning
- **Fix:** The redo thread has stalled despite queued log. Common causes: (1) long-running
read query on a readable secondary holding a lock that blocks redo; (2) the secondary
database is in a transitional state — check ERRORLOG; (3) redo thread has encountered an
error — check `dm_hadr_database_replica_states.last_redone_lsn` for progress. Restarting
HADR on the secondary (`ALTER DATABASE [db] SET HADR SUSPEND / RESUME`) can clear
transient stalls.
### H13 — Zero Log Send Rate with Non-Empty Send Queue
- **Trigger:** `log_send_rate = 0` AND `log_send_queue_size > 0`
- **Severity:** Warning
- **Fix:** Log is queued but not being sent. The HADR transport thread has stalled. Check
endpoint health: `SELECT * FROM sys.dm_hadr_availability_replica_states WHERE
connected_state_desc = 'DISCONNECTED'`. Verify the database mirroring endpoint is
running: `SELECT state_desc FROM sys.database_mirroring_endpoints`. Restart the endpoint
if necessary: `ALTER ENDPOINT [Hadr_endpoint] STATE = STOPPED; ALTER ENDPOINT
[Hadr_endpoint] STATE = STARTED`.
### H14 — Redo Rate / Send Rate Mismatch
- **Trigger:** `log_send_rate > 0` AND `redo_rate > 0` AND `redo_queue_size` is growing
(redo_rate significantly less than log_send_rate, such that the queue accumulates)
- **Severity:** Warning
- **Fix:** Log is being sent faster than the secondary can redo it, causing redo queue
growth. The bottleneck is secondary redo throughput, not the network. Investigate secondary
disk I/O latency. Check whether readable secondary workloads (reporting queries) are
competing with redo for I/O. Consider dedicated storage for secondary data files.
### H15 — Multiple Databases Lagging on Same Replica
- **Trigger:** ≥3 databases on the same replica have `secondary_lag_seconds` exceeding the
multiple-database lag threshold (see Thresholds Reference)
- **Severity:** Critical
- **Fix:** When multiple databases lag simultaneously, the root cause is at the replica level,
not per-database. Check overall secondary node health: CPU, memory, and disk I/O. A
saturated secondary node falls behind across all databases at once. Also check CLUSTER.LOG
for node-level resource pressure. Investigate whether a single database with large
transactions is monopolizing redo threads.
### H16 — Commit Latency Signal on Sync-Commit Replica
- **Trigger:** `availability_mode_desc = SYNCHRONOUS_COMMIT` AND
`synchronization_state_desc = SYNCHRONIZING` (database not yet SYNCHRONIZED, indicating
the sync is in progress but not complete, potentially stalling primary commits)
- **Severity:** Warning
- **Fix:** Primary commits wait for the synchronous secondary to harden the log before
acknowledging. While SYNCHRONIZING is normal during catchup, a sync-commit secondary that
remains SYNCHRONIZING for an extended period adds latency to every primary transaction.
Check `estimated_data_loss_seconds` and `secondary_lag_seconds` to quantify the stall.
If the secondary is persistently SYNCHRONIZING, investigate redo and send queue (H10, H11).
---
## Category 4 — Configuration (H17–H22)
These checks surface AG topology gaps that may not cause immediate problems but increase risk.
### H17 — Async Replica in Sync-Expected Position
- **Trigger:** `availability_mode_desc = ASYNCHRONOUS_COMMIT` on a replica that is the
only secondary in the AG, or is designated as the DR target in a two-replica topology
- **Severity:** Info
- **Fix:** An async-commit secondary provides no data-loss protection for synchronous RPO
requirements. If the topology intends zero data loss, change the replica to
SYNCHRONOUS_COMMIT: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON N'server' WITH
(AVAILABILITY_MODE = SYNCHRONOUS_COMMIT)`. Verify the network and I/O can sustain the
additional commit latency before switching.
### H18 — No Automatic Failover Replica
- **Trigger:** No replica in the AG has `failover_mode_desc = AUTOMATIC`
- **Severity:** Warning
- **Fix:** Without an automatic failover replica, a primary failure requires manual
intervention, increasing recovery time. Configure at least one synchronous-commit secondary
for automatic failover: `ALTER AVAILABILITY GROUP [ag] MODIFY REPLICA ON N'server' WITH
(FAILOVER_MODE = AUTOMATIC)`. Confirm WSFC quorum supports automatic failover before
enabling it.
### H19 — Single Replica AG
- **Trigger:** Only one replica row exists for the availability group (no secondaries)
- **Severity:** Info
- **Fix:** A single-replica AG provides readable secondary benefits (for local replicas) but
no high availability protection. Add a secondary replica if HA is a requirement. Document
the intent if this is a deliberate read-scale-only configuration.
### H20 — Listener Not Configured
- **Trigger:** No rows in `sys.availability_group_listeners` for this AG
- **Severity:** Info
- **Fix:** Without a listener, applications must connect directly to the primary by server
name, which requires a connection string change after every failover. Create a listener:
`ALTER AVAILABILITY GROUP [ag] ADD LISTENER N'ag-listener' (WITH IP ((N'10.0.0.10',
N'255.255.255.0')), PORT=1433)`. Update application connection strings to use the
listener DNS name.
### H21 — Read-Only Routing Not Configured
- **Trigger:** A readable secondary exists (`secondary_role_allow_connections_desc =
ALL` or `READ_ONLY`) AND `read_only_routing_url` IS NULL on that replica
- **Severity:** Info
- **Fix:** Readable secondaries without routing configuration require explicit connection
string targeting. Configure routing: set `READ_ONLY_ROUTING_URL` on each secondary and
`READ_ONLY_ROUTING_LIST` on the primary replica so that `ApplicationIntent=ReadOnly`
connections are automatically directed to a readable secondary.
### H22 — Automatic Seeding Active
- **Trigger:** `seeding_mode_desc = AUTOMATIC` AND a secondary database is in
`synchronization_state_desc = NOT SYNCHRONIZING` (seeding in progress)
- **Severity:** Info
- **Fix:** Automatic seeding is transferring the database to the secondary. This is normal
after adding a new replica or database to the AG. Monitor progress with:
`SELECT * FROM sys.dm_hadr_automatic_seeding`. High network utilization is expected during
seeding. Seeding of large databases can take hours — plan maintenance windows accordingly.
---
## Output Format
Structure the report exactly as follows. Follow the labeling convention: output labels use
`[C1]`, `[W1]`, `[I1]` — check IDs appear in parentheses after the finding name.
```
## HADR Health Analysis
### Summary
- X Critical, Y Warnings, Z Info
- Availability group: [ag_name]
- Replicas: [list with roles, e.g. NODE1\SQL2019 (PRIMARY), NODE2\SQL2019 (SECONDARY — DISCONNECTED)]
- Highest-risk finding: [check name and ID]
### Critical Issues
### [C1 — H1] Replica Disconnected — NODE2\SQL2019
- **Observed:** connected_state_desc = DISCONNECTED; last_connect_error_number = 35206;
last_connect_error_description = "The connection attempt to secondary replica 'NODE2\SQL2019'
timed out."
- **Impact:** All databases on this secondary are no longer receiving log from the primary.
If this is the only secondary, automatic failover protection is lost.
- **Fix:** Verify network connectivity and SQL Server service state on NODE2\SQL2019. Review
CLUSTER.LOG for network partition events. Check Windows Event Log for SQL Server service
failures.
### Warnings
### [W1 — H18] No Automatic Failover Replica
- **Observed:** All replicas have failover_mode_desc = MANUAL
- **Impact:** Primary failure requires manual DBA intervention before any secondary can
promote, increasing downtime.
- **Fix:** Configure FAILOVER_MODE = AUTOMATIC on a sync-commit secondary after verifying
WSFC quorum health.
### Info
### [I1 — H21] Read-Only Routing Not Configured — NODE3\SQL2019
- **Observed:** read_only_routing_url IS NULL; secondary is readable
- **Impact:** ApplicationIntent=ReadOnly connections will not be redirected automatically.
- **Fix:** Set READ_ONLY_ROUTING_URL and configure READ_ONLY_ROUTING_LIST on primary.
### Passed Checks
| Check | Result |
|-------|--------|
| H2 — Replica in Resolving State | PASS — no replica in RESOLVING role |
| H3 — Synchronization Unhealthy | PASS — all connected replicas report HEALTHY |
```
Include a **Prioritized Action Order** table after all findings:
```
### Prioritized Action Order
| Priority | Action | Resolves | Effort |
|----------|--------|----------|--------|
| 1 — Immediately | Investigate replica connectivity on NODE2\SQL2019 | C1 | 15 min |
| 2 — Today | Enable AUTOMATIC failover mode on sync secondary | W1 | 30 min |
| 3 — This sprint | Configure read-only routing on NODE3\SQL2019 | I1 | 20 min |
```
## Notes
- When only natural language input is provided, state which columns are missing and apply
only the checks that can be evaluated from the described values.
- `estimated_data_loss_seconds` and `estimated_recovery_time_seconds` are NULL for async
replicas that are currently disconnected — note this limitation rather than firing H7/H8.
- `secondary_lag_seconds` is NULL for the primary replica row — skip H9 for primary rows.
- If `log_send_rate` and `redo_rate` are both NULL, the DMV was captured on a secondary
replica (these columns are populated only on the primary). Note this and advise recapture
on the primary.
- Do not invent findings not triggered by the rules above.
## Companion Skills
- `/sqlwait-review` — Analyze `HADR_SYNC_COMMIT`, `HADR_WORK_QUEUE`, `HADR_LOGCAPTURE_WAIT`,
and `HADR_TRANSPORT_SESSION_CHANNEL_LOCK` waits on the primary to quantify the commit
latency overhead imposed by synchronous replicas (H16, H4).
- `/sqlplan-review` + `/query-store-review` — After a failover or extended lag event,
applications may use suboptimal plans on the new primary due to cold plan cache or
parameter sniffing. Run post-failover plan review and Query Store regression checks.
- `/procstats-review` — Identify whether a high-CPU or high-read procedure on the primary
is generating excessive log volume, contributing to send queue buildup (H11, H14).
- `/tsql-review` — Review T-SQL that runs on a readable secondary to identify implicit
conversions or non-sargable predicates that add read load and compete with redo threads.