Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install kagenti-kagenti-claude-skills-hypershift-debuggit clone https://github.com/kagenti/kagenti.gitcp kagenti/SKILL.MD ~/.claude/skills/kagenti-kagenti-claude-skills-hypershift-debug/SKILL.md---
name: hypershift:debug
description: Debug AWS resources for HyperShift clusters - identify stuck resources, orphaned infrastructure, and deletion blockers
---
# AWS HyperShift Debug Skill
This skill helps debug AWS resources related to HyperShift clusters, identifying resources that may be blocking cluster deletion or orphaned infrastructure.
## Context-Safe Execution (MANDATORY)
**AWS CLI and kubectl output MUST go to files.**
```bash
export LOG_DIR=/tmp/kagenti/hypershift/${CLUSTER:-debug}
mkdir -p $LOG_DIR
# Pattern: redirect AWS/kubectl output
aws ec2 describe-vpcs ... > $LOG_DIR/vpcs.log 2>&1 && echo "OK" || echo "FAIL"
kubectl get hostedclusters -A > $LOG_DIR/clusters.log 2>&1 && echo "OK" || echo "FAIL"
# Analyze in subagent: Task(subagent_type='Explore') with Grep
```
## When to Use
- Cluster deletion is stuck (finalizer not removed)
- Need to identify orphaned AWS resources
- Debugging IAM permission issues during cluster operations
- Verifying cleanup after cluster destruction
- User asks "debug hypershift" or "why is cluster stuck"
## Arguments
- `cluster-name`: The HyperShift cluster name or suffix to debug (optional, defaults to local)
- `--check`: Quiet mode - returns exit code only (0=no resources, 1=resources exist)
## Quick Debug Commands
### Prerequisites
Ensure credentials are loaded:
```bash
# Load HyperShift credentials (from kagenti repo root)
source .env.kagenti-hypershift-custom # or .env.hypershift-ci
```
### Run Full Debug Script
```bash
# Run comprehensive AWS debug
./.github/scripts/hypershift/debug-aws-hypershift.sh [cluster-name]
# Examples:
./.github/scripts/hypershift/debug-aws-hypershift.sh # defaults to username
./.github/scripts/hypershift/debug-aws-hypershift.sh ladas # suffix only
./.github/scripts/hypershift/debug-aws-hypershift.sh kagenti-hypershift-custom-ladas # full name
./.github/scripts/hypershift/debug-aws-hypershift.sh pr529
# Check mode (quiet, returns exit code only)
./.github/scripts/hypershift/debug-aws-hypershift.sh --check ladas
echo $? # 0 = no resources, 1 = resources exist
```
## Manual Debug Commands
### Check HostedCluster Status
```bash
# Set management cluster kubeconfig
source .env.kagenti-hypershift-custom # loads KUBECONFIG
# Get HostedCluster status
oc get hostedcluster -n clusters <cluster-name> -o yaml
# Check deletion timestamp and finalizers
oc get hostedcluster -n clusters <cluster-name> -o jsonpath='{.metadata.deletionTimestamp}{"\n"}{.metadata.finalizers}'
# Check conditions
oc get hostedcluster -n clusters <cluster-name> -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{" - "}{.message}{"\n"}{end}'
```
### Check HyperShift Operator Logs
```bash
# Operator logs (filtered for cluster)
oc logs -n hypershift -l app=operator --tail=200 | grep -i "<cluster-name>\|error\|failed\|denied"
# All operator logs
oc logs -n hypershift -l app=operator --tail=500
```
### EC2 Resources
```bash
# EC2 Instances
aws ec2 describe-instances \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,InstanceType,Tags[?Key==`Name`].Value|[0]]' \
--output table
# VPCs
aws ec2 describe-vpcs \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Vpcs[*].[VpcId,State,CidrBlock]' \
--output table
# Subnets
aws ec2 describe-subnets \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Subnets[*].[SubnetId,State,CidrBlock,AvailabilityZone]' \
--output table
# Security Groups
aws ec2 describe-security-groups \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'SecurityGroups[*].[GroupId,GroupName,VpcId]' \
--output table
# NAT Gateways
aws ec2 describe-nat-gateways \
--filter "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'NatGateways[*].[NatGatewayId,State,VpcId]' \
--output table
# Internet Gateways
aws ec2 describe-internet-gateways \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'InternetGateways[*].[InternetGatewayId,Attachments[0].VpcId]' \
--output table
# Elastic IPs
aws ec2 describe-addresses \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Addresses[*].[AllocationId,PublicIp]' \
--output table
```
### Load Balancers
```bash
# Classic ELBs
aws elb describe-load-balancers \
--query "LoadBalancerDescriptions[?contains(LoadBalancerName, '<cluster-name>')].[LoadBalancerName,Scheme]" \
--output table
# ALB/NLBs
aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, '<cluster-name>')].[LoadBalancerName,Type,State.Code]" \
--output table
```
### S3 Buckets
```bash
# List S3 buckets for cluster
aws s3api list-buckets --query "Buckets[?contains(Name, '<cluster-name>')].Name" --output text
# Check bucket contents (if exists)
aws s3 ls s3://<bucket-name>/ --recursive
```
### IAM Resources
```bash
# IAM Roles
aws iam list-roles --query "Roles[?contains(RoleName, '<cluster-name>')].RoleName" --output text
# Instance Profiles
aws iam list-instance-profiles --query "InstanceProfiles[?contains(InstanceProfileName, '<cluster-name>')].InstanceProfileName" --output text
# OIDC Providers
aws iam list-open-id-connect-providers --query 'OpenIDConnectProviderList[*].Arn' --output text | tr '\t' '\n' | grep "<cluster-name>"
```
### Route53
```bash
# Hosted zones
aws route53 list-hosted-zones --query "HostedZones[?contains(Name, '<cluster-name>')].[Name,Id,Config.PrivateZone]" --output table
# All zones (for reference)
aws route53 list-hosted-zones --query 'HostedZones[*].[Name,Id,Config.PrivateZone]' --output table
```
## Cleanup Commands (Use with Caution)
### Force Remove HostedCluster Finalizer
**Warning**: This orphans AWS resources. Only use if resources are already deleted or you will clean them manually.
```bash
oc patch hostedcluster -n clusters <cluster-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge
```
### Manual Resource Cleanup Order
If you need to manually delete AWS resources, delete in this order:
1. **EC2 Instances** - Terminate worker nodes
2. **NAT Gateways** - Delete and wait for deletion
3. **Elastic IPs** - Release after NAT gateways deleted
4. **Load Balancers** - Delete ELBs/ALBs/NLBs
5. **Security Groups** - Delete (may need multiple passes for dependencies)
6. **Subnets** - Delete all subnets
7. **Internet Gateways** - Detach and delete
8. **VPCs** - Delete VPC
9. **Route53** - Delete private hosted zone
10. **S3 Buckets** - Empty and delete OIDC bucket
11. **IAM Roles** - Delete roles and instance profiles
12. **OIDC Provider** - Delete OIDC provider
```bash
# Example: Delete all EC2 instances for a cluster
INSTANCE_IDS=$(aws ec2 describe-instances \
--filters "Name=tag-key,Values=kubernetes.io/cluster/<cluster-name>" \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text)
aws ec2 terminate-instances --instance-ids $INSTANCE_IDS
```
## Find Orphaned Resources
```bash
# Find orphaned resources across all clusters
./.github/scripts/hypershift/find-orphaned-resources.sh
```
## Common Issues
### Issue: Cluster stuck in deletion
**Symptoms**: HostedCluster has deletionTimestamp but finalizer remains
**Debug Steps**:
1. Check HyperShift operator logs for errors
2. Check AWS resources still exist (VPCs, subnets, etc.)
3. Verify IAM permissions for resource deletion
4. Check if resources have dependencies blocking deletion
**Common Causes**:
- Security group has dependencies
- ENIs attached to resources
- Load balancer still has targets
- IAM policy doesn't allow deletion
### Issue: Permission denied during deletion
**Symptoms**: Operator logs show "Access Denied" or "UnauthorizedOperation"
**Debug Steps**:
1. Check which resource deletion is failing
2. Verify IAM policy has permission for that resource type
3. Check if tag conditions match (kubernetes.io/cluster/<cluster-name>)
**Fix**: Update IAM policies and re-run setup script:
```bash
./.github/scripts/hypershift/setup-hypershift-ci-credentials.sh
```
### Issue: AWS resources orphaned
**Symptoms**: HostedCluster deleted but AWS resources remain
**Debug Steps**:
1. Run this debug skill to find remaining resources
2. Note all resource IDs
3. Follow manual cleanup order above
## Integration with Other Skills
- **hypershift:cluster**: Create and destroy clusters
- **hypershift:quotas**: Check AWS quotas before debugging
- **k8s:health**: Check overall cluster health after debugging
- **k8s:logs**: Examine HyperShift operator logs in detail
- **k8s:pods**: Debug control plane pods on management cluster
## Related Documentation
- `.github/scripts/local-setup/README.md` - Local setup documentation