Amazon DevOps Guru

Amazon DevOps Guru คือบริการ Machine Learning สำหรับ AIOps ที่วิเคราะห์ operational data จากแอปพลิเคชันเพื่อตรวจจับ anomalies และปัญหาด้าน operational โดยอัตโนมัติ ระบบเรียนรู้ normal behavior ของแอปพลิเคชันจาก CloudWatch metrics, CloudTrail events, AWS Config changes และ X-Ray traces จากนั้นแจ้งเตือนเมื่อพบสิ่งผิดปกติพร้อมคำแนะนำการแก้ไขที่เฉพาะเจาะจง

บริการนี้ค้นพบ resources ของแอปพลิเคชันโดยอัตโนมัติผ่าน CloudFormation stacks หรือ resource tags ทำให้ไม่ต้อง configure อะไรมาก รองรับ proactive insights (แจ้งเตือนก่อนเกิด outage) และ reactive insights (วิเคราะห์หลังเกิด anomaly) ช่วยลด Mean Time to Detect (MTTD) และ Mean Time to Resolve (MTTR) ได้อย่างมีนัยสำคัญ

AWS Docs: https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html

สถาปัตยกรรม

ฟีเจอร์หลัก

Anomaly Detection Across CloudWatch Metrics

วิเคราะห์ metrics จาก CloudWatch ทั้งหมดของแอปพลิเคชัน รวมถึง custom metrics โดยไม่ต้องกำหนด thresholds เอง ML model เรียนรู้ seasonal patterns, weekly cycles และ growth trends เพื่อแยกแยะ normal variation จาก real anomalies

AWS CloudTrail Integration

วิเคราะห์ CloudTrail events เพื่อตรวจจับ operational changes ที่อาจส่งผลต่อ application เช่น configuration changes, security group updates หรือ IAM policy modifications ที่ correlate กับ performance degradation

AWS Config Integration

ติดตาม configuration changes ของ resources ผ่าน AWS Config เพื่อระบุว่า change ใดทำให้เกิด anomaly ช่วยใน root cause analysis ว่า deployment หรือ configuration change ล่าสุดทำให้ระบบมีปัญหา

Auto-Discovers App Resources

ค้นพบ resources ที่เกี่ยวข้องกับแอปพลิเคชันโดยอัตโนมัติผ่าน CloudFormation stacks, AWS tags หรือ AWS account ทั้งหมด ไม่ต้อง configure resource list ด้วยตนเอง ระบบ update เมื่อมี resources ใหม่ถูก deploy

Proactive Insights (Before Outage)

แจ้งเตือนเมื่อตรวจพบสัญญาณที่บ่งชี้ว่าจะเกิดปัญหาในอนาคต เช่น memory pressure ที่กำลังเพิ่มขึ้น, connection pool ที่กำลังจะเต็ม หรือ disk space ที่กำลังจะหมด ทำให้ทีม ops แก้ไขได้ก่อนเกิด downtime

Reactive Insights (After Anomaly)

วิเคราะห์เมื่อเกิด anomaly แล้ว โดย correlate metrics จาก services ต่างๆ เพื่อหา root cause และแสดง timeline ของเหตุการณ์ที่เกิดขึ้น ช่วยให้ทีม ops เข้าใจและแก้ไขปัญหาได้รวดเร็วขึ้น

Root Cause Analysis with Recommendations

ให้ recommendations เฉพาะเจาะจงตามปัญหาที่พบ เช่น "เพิ่ม read replica สำหรับ RDS", "เพิ่ม Lambda concurrency limit", หรือ "ตรวจสอบ slow query ใน Aurora" พร้อม link ไปยัง AWS documentation และ runbooks

ส่ง notifications ไปยัง SNS topics ที่เชื่อมต่อกับ email, SMS, PagerDuty, OpsGenie, Slack หรือ ticketing systems อื่นๆ สามารถกำหนด notification filter ตาม severity หรือ resource type

Cost Estimation for CloudFormation Stacks

คำนวณ cost anomalies ใน CloudFormation stacks ตรวจจับเมื่อค่าใช้จ่ายเพิ่มขึ้นผิดปกติและระบุ resources ที่ทำให้ cost spike ช่วยในการ cost optimization

Supports EC2, ECS, EKS, Lambda, RDS

รองรับ AWS services หลักที่ใช้ใน application stacks ครอบคลุมทั้ง compute, container, serverless และ database layers ให้ภาพรวมสุขภาพของ application ที่สมบูรณ์

การติดตั้งและการตั้งค่า

เปิดใช้งานผ่าน Console

เข้า AWS Console > Amazon DevOps Guru
เลือก resource coverage: CloudFormation stacks, tags หรือ All account resources
เลือก SNS topic สำหรับ notifications
DevOps Guru จะเริ่ม analyze ใน 24 ชั่วโมงแรก (learning period)
ดู insights บน Dashboard

ติดตั้ง SDK

pip install boto3

IAM Permissions ที่จำเป็น

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "devops-guru:*",
        "cloudwatch:GetMetricData",
        "cloudwatch:DescribeAlarms",
        "cloudtrail:LookupEvents",
        "config:GetResourceConfigHistory",
        "xray:GetServiceGraph",
        "sns:Publish"
      ],
      "Resource": "*"
    }
  ]
}

การกำหนด Resource Coverage ผ่าน Tags

# Tag resources ที่ต้องการ monitor
aws ec2 create-tags \
  --resources i-1234567890abcdef0 \
  --tags Key=DevOpsGuruEnabled,Value=true

# หรือ configure ผ่าน CloudFormation stack name
aws devops-guru update-resource-collection \
  --action ADD \
  --resource-collection '{
    "CloudFormation": {
      "StackNames": ["my-production-stack", "my-api-stack"]
    }
  }'

วิธีใช้งาน

ดู Insights และ Anomalies

import boto3
from datetime import datetime, timedelta

devops_guru = boto3.client('devops-guru', region_name='ap-southeast-1')

# ดู insights ที่ active
insights_response = devops_guru.list_insights(
    StatusFilter={
        'Any': {
            'Type': 'REACTIVE',
            'StartTimeRange': {
                'FromTime': datetime.utcnow() - timedelta(days=7),
                'ToTime': datetime.utcnow()
            }
        }
    }
)

print("Active Reactive Insights (7 วันที่ผ่านมา):")
for insight in insights_response.get('ReactiveInsights', []):
    print(f"\nInsight ID: {insight['Id']}")
    print(f"Name: {insight['Name']}")
    print(f"Severity: {insight['Severity']}")
    print(f"Status: {insight['Status']}")
    print(f"Start Time: {insight['InsightTimeRange']['StartTime']}")

# ดู proactive insights (คำเตือนล่วงหน้า)
proactive_response = devops_guru.list_insights(
    StatusFilter={
        'Any': {
            'Type': 'PROACTIVE',
            'StartTimeRange': {
                'FromTime': datetime.utcnow() - timedelta(days=1),
                'ToTime': datetime.utcnow() + timedelta(days=3)
            }
        }
    }
)

print("\nProactive Insights (คำเตือนล่วงหน้า):")
for insight in proactive_response.get('ProactiveInsights', []):
    print(f"\nInsight: {insight['Name']}")
    print(f"Severity: {insight['Severity']}")

ดู Root Cause Analysis

# ดูรายละเอียดของ insight เฉพาะ
insight_detail = devops_guru.describe_insight(
    Id='REACTIVE_123456789'
)

insight = insight_detail.get('ReactiveInsight', {})
print(f"Insight: {insight['Name']}")
print(f"Description: {insight.get('Description', 'N/A')}")

# ดู anomalies ที่เกี่ยวข้อง
anomalies = devops_guru.list_anomalies_for_insight(
    InsightId='REACTIVE_123456789',
    MaxResults=20
)

print("\nRelated Anomalies:")
for anomaly in anomalies.get('ReactiveAnomalies', []):
    print(f"\n- Anomaly ID: {anomaly['Id']}")
    print(f"  Severity: {anomaly['Severity']}")
    
    # แสดง resources ที่ได้รับผลกระทบ
    for resource in anomaly.get('AnomalyResources', []):
        print(f"  Affected Resource: {resource['Name']} ({resource['Type']})")
    
    # แสดง source details
    sources = anomaly.get('SourceDetails', {})
    for metric in sources.get('CloudWatchMetrics', []):
        print(f"  Metric: {metric['Namespace']}/{metric['MetricName']}")
        print(f"  Stat: {metric['Stat']}, Period: {metric['Period']}s")

ดู Recommendations

# ดูคำแนะนำสำหรับ insight
recommendations = devops_guru.list_recommendations(
    InsightId='REACTIVE_123456789',
    Locale='EN_US'
)

print("Recommendations:")
for rec in recommendations.get('Recommendations', []):
    print(f"\nCategory: {rec['Category']}")
    print(f"Description: {rec['Description']}")
    
    # Links ไปยัง documentation
    for link in rec.get('RelatedUrls', []):
        print(f"  Reference: {link}")
    
    # Related CloudWatch metrics
    for event in rec.get('RelatedEvents', []):
        print(f"  Related Event: {event['Name']}")

ตั้ง Notification กับ SNS

# กำหนด SNS topic สำหรับ notifications
devops_guru.add_notification_channel(
    Config={
        'Sns': {
            'TopicArn': 'arn:aws:sns:ap-southeast-1:123456789:devops-alerts'
        }
    }
)

# ตัวอย่าง Lambda ที่รับ SNS notification จาก DevOps Guru
def handle_devops_guru_notification(event, context):
    import json
    
    for record in event['Records']:
        message = json.loads(record['Sns']['Message'])
        
        insight_id = message.get('insightId')
        insight_type = message.get('insightType')  # REACTIVE or PROACTIVE
        severity = message.get('insightSeverity')
        account = message.get('accountId')
        region = message.get('region')
        
        print(f"DevOps Guru Alert:")
        print(f"  Type: {insight_type}")
        print(f"  Severity: {severity}")
        print(f"  Insight ID: {insight_id}")
        
        # ส่งไปยัง PagerDuty หรือ Slack
        if severity in ['HIGH', 'MEDIUM']:
            send_to_pagerduty(insight_id, severity, insight_type)
        
        send_to_slack_channel('#devops-alerts', message)

ราคา (ประมาณการในบาท)

รายการ	ราคา USD	ราคา THB (1 USD = 35 บาท)
ทรัพยากร 1-200 รายการแรก	$0.0028/resource/hour	~0.098 บาท/resource/ชั่วโมง
ทรัพยากรที่เกิน 200 รายการ	$0.0025/resource/hour	~0.088 บาท/resource/ชั่วโมง
DevOps Guru for RDS	$0.018/vCPU/hour	~0.63 บาท/vCPU/ชั่วโมง

Free Tier: ฟรี 1 เดือนแรก สูงสุด 7 resources

ตัวอย่างค่าใช้จ่าย: แอปพลิเคชันที่มี 50 resources (EC2, Lambda, RDS, etc.)

50 resources x $0.0028 x 24h x 30 days = $100.80/เดือน (~3,528 บาท/เดือน)

ตัวอย่างค่าใช้จ่าย: ระบบขนาดใหญ่ 300 resources

200 x $0.0028 x 720h = $403.20
100 x $0.0025 x 720h = $180.00
รวม $583.20/เดือน (~20,412 บาท/เดือน)

เหมาะสำหรับ

ทีม DevOps/SRE ที่ต้องการ proactive monitoring โดยไม่ต้องตั้ง threshold alerts เอง
องค์กรที่มี microservices หรือ distributed systems ที่ซับซ้อน หลาย services
ระบบ production ที่ต้องการ high availability และต้องการลด MTTR
ทีม ops ขนาดเล็กที่ดูแล infrastructure ขนาดใหญ่
บริษัทที่ต้องการ reduce alert fatigue จากการตั้ง threshold alerts มากเกินไป
แอปพลิเคชันที่ scale up/down บ่อย ซึ่ง static thresholds ไม่เหมาะสม

ใช้ร่วมกับ AWS Services

Amazon CloudWatch - แหล่งข้อมูล metrics หลักสำหรับ anomaly detection
AWS CloudTrail - วิเคราะห์ API calls และ configuration changes
AWS Config - ติดตาม resource configuration changes
AWS X-Ray - วิเคราะห์ distributed tracing สำหรับ microservices
Amazon SNS - ส่ง notifications ไปยัง on-call teams
AWS CloudFormation - กำหนด scope ของ monitoring ตาม stacks
Amazon EventBridge - route DevOps Guru events ไปยัง downstream systems
AWS Systems Manager OpsCenter - สร้าง OpsItems สำหรับ operational issues

Use Case ตัวอย่าง

1. E-Commerce ป้องกัน Downtime ช่วง Flash Sale

บริษัท e-commerce ใช้ DevOps Guru monitor ระบบ order processing ที่ประกอบด้วย Lambda functions, RDS Aurora, SQS และ ElastiCache กว่า 80 resources ระบบตรวจพบ proactive insight ว่า RDS connection count กำลังเพิ่มขึ้น trend ผิดปกติ 6 ชั่วโมงก่อน flash sale ทีม ops แก้ไขโดยเพิ่ม RDS Proxy และปรับ connection pooling ได้ทันก่อน traffic surge ไม่มี downtime แม้ traffic เพิ่มขึ้น 10 เท่า

2. Fintech App ค้นหา Root Cause รวดเร็ว

แอป mobile banking มี API latency เพิ่มขึ้นจาก 200ms เป็น 3,000ms กะทันหัน DevOps Guru correlate metrics จาก API Gateway, Lambda, RDS และ ElastiCache พบว่า ElastiCache หมด memory ส่งผลให้ cache miss rate สูงขึ้น และ RDS รับ query เพิ่มขึ้นอย่างกะทันหัน Recommendation คือ eviction policy adjustment ทีมแก้ไขได้ใน 15 นาที เทียบกับการ debug แบบดั้งเดิมที่อาจใช้เวลาหลายชั่วโมง

3. SaaS Platform วิเคราะห์ Cost Anomaly

บริษัท SaaS ที่มีหลาย customer tiers ใช้ DevOps Guru ตรวจจับ cost anomaly ใน production CloudFormation stack ระบบแจ้งเตือนว่า Lambda invocations เพิ่มขึ้นผิดปกติ 5 เท่าในช่วงดึก ซึ่งไม่ใช่ pattern ปกติ ทีมตรวจสอบพบว่ามี infinite loop bug ใน event processing function ที่ trigger ตัวเองซ้ำ แก้ไขได้ก่อนที่ AWS bill จะบวม ประหยัดค่า Lambda ได้กว่า 50,000 บาทในเดือนนั้น

สถาปัตยกรรม​

ฟีเจอร์หลัก​

Anomaly Detection Across CloudWatch Metrics​

AWS CloudTrail Integration​

AWS Config Integration​

Auto-Discovers App Resources​

Proactive Insights (Before Outage)​

Reactive Insights (After Anomaly)​

Root Cause Analysis with Recommendations​

Notification Integration (SNS, PagerDuty, OpsGenie)​

Cost Estimation for CloudFormation Stacks​

Supports EC2, ECS, EKS, Lambda, RDS​

การติดตั้งและการตั้งค่า​

เปิดใช้งานผ่าน Console​

ติดตั้ง SDK​

IAM Permissions ที่จำเป็น​

การกำหนด Resource Coverage ผ่าน Tags​

วิธีใช้งาน​

ดู Insights และ Anomalies​

ดู Root Cause Analysis​

ดู Recommendations​

ตั้ง Notification กับ SNS​

ราคา (ประมาณการในบาท)​

เหมาะสำหรับ​

ใช้ร่วมกับ AWS Services​

Use Case ตัวอย่าง​

1. E-Commerce ป้องกัน Downtime ช่วง Flash Sale​

2. Fintech App ค้นหา Root Cause รวดเร็ว​

3. SaaS Platform วิเคราะห์ Cost Anomaly​