Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install jmagly-aiwg-plugins-codex-sdlc-skills-troubleshooting-guidegit clone https://github.com/jmagly/aiwg.gitcp aiwg/SKILL.MD ~/.claude/skills/jmagly-aiwg-plugins-codex-sdlc-skills-troubleshooting-guide/SKILL.md---
namespace: aiwg
name: troubleshooting-guide
platforms: [all]
description: Generate troubleshooting documentation
commandHint:
argumentHint: 1. **System Overview and Architecture** [--interactive] [--guidance "text"]
category: documentation-changelogs
---
# Troubleshooting Guide Generator Command
Generate troubleshooting documentation
## Instructions
Follow this systematic approach to create troubleshooting guides: **$ARGUMENTS**
1. **System Overview and Architecture**
- Document the system architecture and components
- Map out dependencies and integrations
- Identify critical paths and failure points
- Create system topology diagrams
- Document data flow and communication patterns
2. **Common Issues Identification**
- Collect historical support tickets and issues
- Interview team members about frequent problems
- Analyze error logs and monitoring data
- Review user feedback and complaints
- Identify patterns in system failures
3. **Troubleshooting Framework**
- Establish systematic diagnostic procedures
- Create problem isolation methodologies
- Document escalation paths and procedures
- Set up logging and monitoring checkpoints
- Define severity levels and response times
4. **Diagnostic Tools and Commands**
```markdown
## Essential Diagnostic Commands
### System Health
```bash
# Check system resources
top # CPU and memory usage
df -h # Disk space
free -m # Memory usage
netstat -tuln # Network connections
# Application logs
tail -f /var/log/app.log
journalctl -u service-name -f
# Database connectivity
mysql -u user -p -e "SELECT 1"
psql -h host -U user -d db -c "SELECT 1"
```
```
5. **Issue Categories and Solutions**
**Performance Issues:**
```markdown
### Slow Response Times
**Symptoms:**
- API responses > 5 seconds
- User interface freezing
- Database timeouts
**Diagnostic Steps:**
1. Check system resources (CPU, memory, disk)
2. Review application logs for errors
3. Analyze database query performance
4. Check network connectivity and latency
**Common Causes:**
- Database connection pool exhaustion
- Inefficient database queries
- Memory leaks in application
- Network bandwidth limitations
**Solutions:**
- Restart application services
- Optimize database queries
- Increase connection pool size
- Scale infrastructure resources
```
6. **Error Code Documentation**
```markdown
## Error Code Reference
### HTTP Status Codes
- **500 Internal Server Error**
- Check application logs for stack traces
- Verify database connectivity
- Check environment variables
- **404 Not Found**
- Verify URL routing configuration
- Check if resources exist
- Review API endpoint documentation
- **503 Service Unavailable**
- Check service health status
- Verify load balancer configuration
- Check for maintenance mode
```
7. **Environment-Specific Issues**
- Document development environment problems
- Address staging/testing environment issues
- Cover production-specific troubleshooting
- Include local development setup problems
8. **Database Troubleshooting**
```markdown
### Database Connection Issues
**Symptoms:**
- "Connection refused" errors
- "Too many connections" errors
- Slow query performance
**Diagnostic Commands:**
```sql
-- Check active connections
SHOW PROCESSLIST;
-- Check database size
SELECT table_schema,
ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) AS 'DB Size in MB'
FROM information_schema.tables
GROUP BY table_schema;
-- Check slow queries
SHOW VARIABLES LIKE 'slow_query_log';
```
```
9. **Network and Connectivity Issues**
```markdown
### Network Troubleshooting
**Basic Connectivity:**
```bash
# Test basic connectivity
ping example.com
telnet host port
curl -v https://api.example.com/health
# DNS resolution
nslookup example.com
dig example.com
# Network routing
traceroute example.com
```
**SSL/TLS Issues:**
```bash
# Check SSL certificate
openssl s_client -connect example.com:443
curl -vI https://example.com
```
```
10. **Application-Specific Troubleshooting**
**Memory Issues:**
```markdown
### Out of Memory Errors
**Java Applications:**
```bash
# Check heap usage
jstat -gc [PID]
jmap -dump:format=b,file=heapdump.hprof [PID]
# Analyze heap dump
jhat heapdump.hprof
```
**Node.js Applications:**
```bash
# Monitor memory usage
node --inspect app.js
# Use Chrome DevTools for memory profiling
```
```
11. **Security and Authentication Issues**
```markdown
### Authentication Failures
**Symptoms:**
- 401 Unauthorized responses
- Token validation errors
- Session timeout issues
**Diagnostic Steps:**
1. Verify credentials and tokens
2. Check token expiration
3. Validate authentication service
4. Review CORS configuration
**Common Solutions:**
- Refresh authentication tokens
- Clear browser cookies/cache
- Verify CORS headers
- Check API key permissions
```
12. **Deployment and Configuration Issues**
```markdown
### Deployment Failures
**Container Issues:**
```bash
# Check container status
docker ps -a
docker logs container-name
# Check resource limits
docker stats
# Debug container
docker exec -it container-name /bin/bash
```
**Kubernetes Issues:**
```bash
# Check pod status
kubectl get pods
kubectl describe pod pod-name
kubectl logs pod-name
# Check service connectivity
kubectl get svc
kubectl port-forward pod-name 8080:8080
```
```
13. **Monitoring and Alerting Setup**
- Configure health checks and monitoring
- Set up log aggregation and analysis
- Implement alerting for critical issues
- Create dashboards for system metrics
- Document monitoring thresholds
14. **Escalation Procedures**
```markdown
## Escalation Matrix
### Severity Levels
**Critical (P1):** System down, data loss
- Immediate response required
- Escalate to on-call engineer
- Notify management within 30 minutes
**High (P2):** Major functionality impaired
- Response within 2 hours
- Escalate to senior engineer
- Provide hourly updates
**Medium (P3):** Minor functionality issues
- Response within 8 hours
- Assign to appropriate team member
- Provide daily updates
```
15. **Recovery Procedures**
- Document system recovery steps
- Create data backup and restore procedures
- Establish rollback procedures for deployments
- Document disaster recovery processes
- Test recovery procedures regularly
16. **Preventive Measures**
- Implement monitoring and alerting
- Set up automated health checks
- Create deployment validation procedures
- Establish code review processes
- Document maintenance procedures
17. **Knowledge Base Integration**
- Link to relevant documentation
- Reference API documentation
- Include links to monitoring dashboards
- Connect to team communication channels
- Integrate with ticketing systems
18. **Team Communication**
```markdown
## Communication Channels
### Immediate Response
- Slack: #incidents channel
- Phone: On-call rotation
- Email: alerts@company.com
### Status Updates
- Status page: status.company.com
- Twitter: @company_status
- Internal wiki: troubleshooting section
```
19. **Documentation Maintenance**
- Regular review and updates
- Version control for troubleshooting guides
- Feedback collection from users
- Integration with incident post-mortems
- Continuous improvement processes
20. **Self-Service Tools**
- Create diagnostic scripts and tools
- Build automated recovery procedures
- Implement self-healing systems
- Provide user-friendly diagnostic interfaces
- Create chatbot integration for common issues
**Advanced Troubleshooting Techniques:**
**Log Analysis:**
```bash
# Search for specific errors
grep -i "error" /var/log/app.log | tail -50
# Analyze log patterns
awk '{print $1}' access.log | sort | uniq -c | sort -nr
# Monitor logs in real-time
tail -f /var/log/app.log | grep -i "exception"
```
**Performance Profiling:**
```bash
# System performance
iostat -x 1
sar -u 1 10
vmstat 1 10
# Application profiling
strace -p [PID]
perf record -p [PID]
```
Remember to:
- Keep troubleshooting guides up-to-date
- Test all documented procedures regularly
- Collect feedback from users and improve guides
- Include screenshots and visual aids where helpful
- Make guides searchable and well-organized
## References
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context and documentation standards
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Research-first for root cause analysis
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/diagram-generation.md — System topology and architecture diagram standards
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference