AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践

AI1天前发布 beixibaobao
3 0 0

AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践

新用户可获得高达 200 美元的服务抵扣金

亚马逊云科技新用户可以免费使用亚马逊云科技免费套餐(Amazon Free Tier)。注册即可获得 100 美元的服务抵扣金,在探索关键亚马逊云科技服务时可以再额外获得最多 100 美元的服务抵扣金。使用免费计划试用亚马逊云科技服务,最长可达 6 个月,无需支付任何费用,除非您选择付费计划。付费计划允许您扩展运营并获得超过 150 项亚马逊云科技服务的访问权限。

前言

企业云架构中,VPC 与本地数据中心通过 Direct Connect 建立的网络连接稳定性至关重要,传统路由监控常面临异常难察觉、人工优化效率低等问题,本文聚焦如何结合 AI 技术与亚马逊云科技原生服务,构建一套能实时追踪路由状态、智能分析路由聚合并自动预警的监控系统,助力企业降低运维成本,保障网络通信稳定

技术架构

在这里插入图片描述

该架构以 Amazon Cloud 为基础,本地数据中心经 Direct Connect Gateway 与 VPC 借 BGP 传播路由,EventBridge 定时触发 Lambda Function,调用 EC2 API 查询 VPC 路由表,超阈值时经 SNS Topic、邮件订阅发预警,CloudWatch Logs 记录日志,构建了 “本地 – 云端路由交互 + 定时监控 + 智能预警 + 日志留存” 的 VPC Direct Connect 路由监控体系,保障网络路由稳定与异常及时响应

  • Amazon EventBridge:按预设间隔触发监控任务,支持灵活配置频率,企业可选每半小时 / 1 小时监控,平衡问题响应效率与成本
  • Amazon Lambda:作为核心监控逻辑载体,查询 VPC 路由表、识别 Direct Connect 传播路由并完成统计分析,实现路由状态的核心检查
  • Amazon SNS:接收监控异常信息并分发预警,通过邮件推送含摘要、路由详情及优化建议的通知,助力运维快速响应
  • Amazon IAM:遵循最小权限原则,仅授予 Lambda 查询路由表和发送 SNS 通知的必要权限,最大化降低监控过程的安全风险

前提准备:亚马逊云科技注册流程

Step.1 登录官网

登录亚马逊云科技官网,填写邮箱和账户名称完成验证(注册亚马逊云科技填写 root 邮箱、账户名,验证邮件地址,查收邮件填验证码验证,验证通过后设 root 密码并确认)

在这里插入图片描述

Step.2 选择账户计划

选择账户计划,两种计划,按需选"选择免费计划 / 选择付费计划"继续流程

  • 免费(6 个月,适合学习实验,含$200抵扣金、限精选服务,超限额或到期可升级付费,否则关停)
  • 付费(适配生产,同享$200 抵扣金,可体验全部服务,抵扣金覆盖广,用完按即用即付计费)

在这里插入图片描述

Step.3 填写联系人信息

填写联系人信息(选择使用场景,填联系人全名、电话,选择所在国家地区,完善地址、邮政编码,勾选同意客户协议,点击继续 进入下一步)

在这里插入图片描述

Step.4 绑定信息

绑定相关信息,选择国家地区,点击"Send code"收验证码填写,勾选同意协议后,点击"验证并继续"进入下一步

在这里插入图片描述

Step.5 电话验证

电话验证填写真实手机号,选择验证方式,完成安全检查,若选语音,网页同步显 4 位数字码,接来电后输入信息,再填收到的验证信息,遇问题超 10 分钟收不到可返回重试。

在这里插入图片描述

Step.6 售后支持

售后支持:免费计划自动获基本支持,付费计划需选支持计划(各计划都含客户服务,可访问文档白皮书,按需选后点 “完成注册”,若需企业级支持可了解付费升级选项,确认选好即可完成整个注册流程 )

在这里插入图片描述

Amazon VPC Direct Connect 路由监控系统

1、下载 CloudFormation 内容到本地,并保存为 yaml 格式

AWSTemplateFormatVersion: '2010-09-09'
Description: 'VPC Direct Connect Route Monitor with AI-powered route aggregation analysis'
Parameters:
  VpcId:
    Type: AWS::EC2::VPC::Id
    Description: Select the VPC to monitor for DX propagated routes
  MaxRoutes:
    Type: Number
    Default: 100
    MinValue: 1
    MaxValue: 1000
    Description: Maximum number of routes limit (default 100)
  WarningThreshold1:
    Type: Number
    Default: 60
    MinValue: 1
    MaxValue: 100
    Description: First warning threshold percentage (default 60%)
  WarningThreshold2:
    Type: Number
    Default: 80
    MinValue: 1
    MaxValue: 100
    Description: Second warning threshold percentage (default 80%)
  MonitoringFrequency:
    Type: String
    Default: '1 hour'
    AllowedValues:
      - '5 minutes'
      - '10 minutes'
      - '30 minutes'
      - '1 hour'
      - '1 day'
    Description: How often to check the route count (default 1 hour)
  NotificationEmail:
    Type: String
    Description: Email address to receive alert notifications
    AllowedPattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'
    ConstraintDescription: Please enter a valid email address
  EnableAIAnalysis:
    Type: String
    Default: 'false'
    AllowedValues:
      - 'true'
      - 'false'
    Description: Enable AI-powered route aggregation analysis using Amazon Bedrock Nova Lite
  BedrockRegion:
    Type: String
    Default: 'us-east-1'
    AllowedValues:
      - 'us-east-1'
      - 'us-west-2'
      - 'eu-west-1'
      - 'ap-southeast-1'
      - 'ap-northeast-1'
      - 'eu-north-1'
    Description: AWS region where Bedrock is available (default us-east-1)
Conditions:
  Is5Minutes: !Equals [!Ref MonitoringFrequency, '5 minutes']
  Is10Minutes: !Equals [!Ref MonitoringFrequency, '10 minutes']
  Is30Minutes: !Equals [!Ref MonitoringFrequency, '30 minutes']
  Is1Day: !Equals [!Ref MonitoringFrequency, '1 day']
  AIAnalysisEnabled: !Equals [!Ref EnableAIAnalysis, 'true']
Resources:
  AlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub '${AWS::StackName}-vpc-dx-route-alerts'
      DisplayName: 'VPC DX Route Monitor Alerts'
  EmailSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      TopicArn: !Ref AlertTopic
      Protocol: email
      Endpoint: !Ref NotificationEmail
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: VPCRouteMonitoringPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeRouteTables
                  - ec2:DescribeVpcs
                Resource: '*'
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref AlertTopic
        - !If
          - AIAnalysisEnabled
          - PolicyName: BedrockAccessPolicy
            PolicyDocument:
              Version: '2012-10-17'
              Statement:
                - Effect: Allow
                  Action:
                    - bedrock:InvokeModel
                  Resource: !Sub 'arn:aws:bedrock:${BedrockRegion}::foundation-model/amazon.nova-lite-v1:0'
          - !Ref 'AWS::NoValue'
  RouteMonitorFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Sub '${AWS::StackName}-vpc-dx-route-monitor'
      Runtime: python3.9
      Handler: index.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Timeout: 600
      MemorySize: 512
      Description: 'Monitor VPC DX propagated routes with optional AI analysis'
      Environment:
        Variables:
          VPC_ID: !Ref VpcId
          MAX_ROUTES: !Ref MaxRoutes
          WARNING_THRESHOLD_1: !Ref WarningThreshold1
          WARNING_THRESHOLD_2: !Ref WarningThreshold2
          SNS_TOPIC_ARN: !Ref AlertTopic
          ENABLE_AI_ANALYSIS: !Ref EnableAIAnalysis
          BEDROCK_REGION: !Ref BedrockRegion
      Code:
        ZipFile: |
          #!/usr/bin/env python3
          """
          AWS VPC DX路由监控Lambda函数 - AI增强版本
          支持通过Amazon Bedrock Nova Lite进行路由聚合分析
          """
          import boto3
          import json
          import os
          from datetime import datetime
          from typing import Dict, List, Any, Set, Tuple, Optional
          def lambda_handler(event, context):
              """Lambda主函数"""
              try:
                  vpc_id = os.environ.get('VPC_ID')
                  max_routes = int(os.environ.get('MAX_ROUTES', '100'))
                  warning_threshold_1 = int(os.environ.get('WARNING_THRESHOLD_1', '60'))
                  warning_threshold_2 = int(os.environ.get('WARNING_THRESHOLD_2', '80'))
                  sns_topic_arn = os.environ.get('SNS_TOPIC_ARN')
                  enable_ai_analysis = os.environ.get('ENABLE_AI_ANALYSIS', 'false').lower() == 'true'
                  bedrock_region = os.environ.get('BEDROCK_REGION', 'us-east-1')
                  if not vpc_id or not sns_topic_arn:
                      raise ValueError("缺少必要的环境变量")
                  # 查询DX传播的路由
                  route_info = query_dx_propagated_routes(vpc_id)
                  if route_info is None:
                      send_error_notification(sns_topic_arn, vpc_id, "查询路由失败")
                      return {'statusCode': 500, 'body': json.dumps({'error': '查询路由失败'})}
                  # 计算使用百分比
                  current_routes = route_info['unique_dx_routes']
                  usage_percentage = (current_routes / max_routes) * 100
                  # AI分析(如果启用)
                  ai_analysis = None
                  if enable_ai_analysis and current_routes > 0:
                      try:
                          ai_analysis = analyze_routes_with_ai(route_info['unique_routes'], bedrock_region)
                          print("AI分析完成")
                      except Exception as e:
                          print(f"AI分析失败: {e}")
                          ai_analysis = {"error": f"AI分析失败: {str(e)}"}
                  # 检查预警
                  should_alert = False
                  alert_level = None
                  if usage_percentage >= warning_threshold_2:
                      should_alert = True
                      alert_level = 'HIGH'
                  elif usage_percentage >= warning_threshold_1:
                      should_alert = True
                      alert_level = 'MEDIUM'
                  # 发送预警
                  if should_alert:
                      send_alert_notification(
                          sns_topic_arn, vpc_id, current_routes, max_routes, 
                          usage_percentage, alert_level, route_info['unique_routes'], ai_analysis
                      )
                  result = {
                      'vpc_id': vpc_id,
                      'unique_dx_routes': current_routes,
                      'max_routes': max_routes,
                      'usage_percentage': round(usage_percentage, 2),
                      'alert_sent': should_alert,
                      'alert_level': alert_level,
                      'ai_analysis_enabled': enable_ai_analysis,
                      'ai_analysis_status': 'completed' if ai_analysis and 'error' not in ai_analysis else 'failed' if ai_analysis else 'disabled',
                      'timestamp': datetime.now().isoformat()
                  }
                  print(f"监控结果: {json.dumps(result, ensure_ascii=False)}")
                  return {'statusCode': 200, 'body': json.dumps(result, ensure_ascii=False)}
              except Exception as e:
                  error_msg = f"Lambda执行失败: {str(e)}"
                  print(error_msg)
                  try:
                      if 'sns_topic_arn' in locals():
                          send_error_notification(sns_topic_arn, vpc_id if 'vpc_id' in locals() else 'Unknown', str(e))
                  except:
                      pass
                  return {'statusCode': 500, 'body': json.dumps({'error': error_msg})}
          def query_dx_propagated_routes(vpc_id: str) -> Dict[str, Any]:
              """查询VPC中通过DX传播的路由条目"""
              try:
                  ec2_client = boto3.client('ec2')
                  response = ec2_client.describe_route_tables(
                      Filters=[{'Name': 'vpc-id', 'Values': [vpc_id]}]
                  )
                  route_tables = response.get('RouteTables', [])
                  if not route_tables:
                      print(f"未找到VPC {vpc_id} 的路由表")
                      return None
                  if response.get('NextToken'):
                      print("警告: 检测到分页,可能需要升级到完整版本")
                  # 使用Set去重
                  unique_dx_routes: Set[Tuple[str, str, str]] = set()
                  unique_routes_list = []
                  for rt in route_tables:
                      rt_id = rt['RouteTableId']
                      routes = rt.get('Routes', [])
                      for route in routes:
                          if route.get('Origin') == 'EnableVgwRoutePropagation':
                              destination = route.get('DestinationCidrBlock') or route.get('DestinationIpv6CidrBlock', '未知')
                              target_type, target_value = get_route_target(route)
                              route_key = (destination, target_type, target_value)
                              if route_key not in unique_dx_routes:
                                  unique_dx_routes.add(route_key)
                                  unique_routes_list.append({
                                      'route_table_id': rt_id,
                                      'destination': destination,
                                      'target_type': target_type,
                                      'target_value': target_value,
                                      'state': route.get('State', '未知')
                                  })
                  return {
                      'unique_dx_routes': len(unique_dx_routes),
                      'unique_routes': unique_routes_list
                  }
              except Exception as e:
                  print(f"查询DX路由失败: {e}")
                  return None
          def analyze_routes_with_ai(routes: List[Dict], bedrock_region: str) -> Optional[Dict]:
              """使用Amazon Bedrock Nova Lite分析路由聚合"""
              try:
                  bedrock_client = boto3.client('bedrock-runtime', region_name=bedrock_region)
                  # 准备路由数据
                  route_data = []
                  for route in routes:
                      route_data.append({
                          'destination': route['destination'],
                          'target_type': route['target_type'],
                          'target_value': route['target_value'],
                          'state': route['state']
                      })
                  # 构建AI提示
                  prompt = f"""你是一个AWS网络专家,请分析以下Direct Connect传播的路由,并提供路由聚合建议。
          当前路由列表(共{len(routes)}条):
          {json.dumps(route_data, indent=2, ensure_ascii=False)}
          请分析并提供以下内容:
          1. 路由聚合机会分析
          2. 具体的CIDR聚合建议
          3. 预期的路由数量减少
          4. 实施建议和注意事项
          5. 风险评估
          请用中文回答,格式要清晰易读。"""
                  # 调用Nova Lite
                  request_body = {
                      "messages": [
                          {
                              "role": "user",
                              "content": [
                                  {
                                      "text": prompt
                                  }
                              ]
                          }
                      ],
                      "inferenceConfig": {
                          "max_new_tokens": 4000,
                          "temperature": 0.1
                      }
                  }
                  response = bedrock_client.invoke_model(
                      modelId='amazon.nova-lite-v1:0',
                      body=json.dumps(request_body)
                  )
                  response_body = json.loads(response['body'].read())
                  ai_analysis = response_body['output']['message']['content'][0]['text']
                  return {
                      'analysis': ai_analysis,
                      'route_count': len(routes),
                      'analysis_timestamp': datetime.now().isoformat(),
                      'model_used': 'Amazon Nova Lite'
                  }
              except Exception as e:
                  print(f"AI分析失败: {e}")
                  return {"error": str(e)}
          def get_route_target(route: Dict) -> tuple:
              """获取路由目标类型和值"""
              if 'VirtualPrivateGatewayId' in route:
                  return 'vpn-gateway', route['VirtualPrivateGatewayId']
              elif 'TransitGatewayId' in route:
                  return 'transit-gateway', route['TransitGatewayId']
              elif 'DirectConnectGatewayId' in route:
                  return 'dx-gateway', route['DirectConnectGatewayId']
              elif 'GatewayId' in route:
                  return 'gateway', route['GatewayId']
              elif 'NatGatewayId' in route:
                  return 'nat-gateway', route['NatGatewayId']
              elif 'NetworkInterfaceId' in route:
                  return 'network-interface', route['NetworkInterfaceId']
              elif 'InstanceId' in route:
                  return 'instance', route['InstanceId']
              else:
                  return 'unknown', 'unknown'
          def send_alert_notification(sns_topic_arn: str, vpc_id: str, current_routes: int, 
                                    max_routes: int, usage_percentage: float, alert_level: str, 
                                    routes: List[Dict], ai_analysis: Optional[Dict] = None):
              """发送预警通知(包含AI分析)"""
              try:
                  sns_client = boto3.client('sns')
                  subject = f"🚨 VPC DX路由预警 - {alert_level} 级别 ({usage_percentage:.1f}%)"
                  if ai_analysis and 'error' not in ai_analysis:
                      subject += " [含AI分析]"
                  message_lines = [
                      f"VPC Direct Connect 路由监控预警",
                      f"",
                      f"📊 监控摘要:",
                      f"  VPC ID: {vpc_id}",
                      f"  当前DX传播路由数: {current_routes}",
                      f"  最大路由限制: {max_routes}",
                      f"  使用百分比: {usage_percentage:.2f}%",
                      f"  预警级别: {alert_level}",
                      f"  检查时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')}",
                      f"",
                      f"🔍 DX传播路由详情:"
                  ]
                  if routes:
                      message_lines.append(f"  {'目标网段':<18} {'目标类型':<15} {'目标值':<25}")
                      message_lines.append(f"  {'-'*18} {'-'*15} {'-'*25}")
                      for route in routes[:15]:  # 限制显示数量,为AI分析留空间
                          message_lines.append(
                              f"  {route['destination']:<18} {route['target_type']:<15} {route['target_value']:<25}"
                          )
                      if len(routes) > 15:
                          message_lines.append(f"  ... 还有 {len(routes) - 15} 条路由未显示")
                  # 添加AI分析结果
                  if ai_analysis:
                      message_lines.extend([
                          f"",
                          f"🤖 AI路由聚合分析 (Amazon Nova Lite):",
                          f"{'='*60}"
                      ])
                      if 'error' in ai_analysis:
                          message_lines.append(f"AI分析失败: {ai_analysis['error']}")
                      else:
                          # 将AI分析结果按行分割并添加适当的缩进
                          analysis_lines = ai_analysis['analysis'].split('n')
                          for line in analysis_lines:
                              if line.strip():
                                  message_lines.append(f"{line}")
                              else:
                                  message_lines.append("")
                          message_lines.extend([
                              f"",
                              f"分析时间: {ai_analysis.get('analysis_timestamp', 'Unknown')}",
                              f"使用模型: {ai_analysis.get('model_used', 'Unknown')}"
                          ])
                  message_lines.extend([
                      f"",
                      f"⚠️  建议操作:",
                      f"  - 检查是否有不必要的路由传播",
                      f"  - 考虑优化路由聚合",
                      f"  - 如需增加路由限制,请联系AWS支持"
                  ])
                  if ai_analysis and 'error' not in ai_analysis:
                      message_lines.append(f"  - 参考上述AI分析建议进行路由优化")
                  message_lines.append(f"n此消息由AWS Lambda自动生成")
                  sns_client.publish(
                      TopicArn=sns_topic_arn,
                      Subject=subject,
                      Message="n".join(message_lines)
                  )
                  print("预警通知已发送")
              except Exception as e:
                  print(f"发送预警通知失败: {e}")
          def send_error_notification(sns_topic_arn: str, vpc_id: str, error_message: str):
              """发送错误通知"""
              try:
                  sns_client = boto3.client('sns')
                  subject = f"❌ VPC DX路由监控错误 - {vpc_id}"
                  message = f"""VPC Direct Connect 路由监控执行错误
          错误信息:
            VPC ID: {vpc_id}
            错误消息: {error_message}
            发生时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')}
          请检查Lambda函数配置和权限设置。"""
                  sns_client.publish(
                      TopicArn=sns_topic_arn,
                      Subject=subject,
                      Message=message
                  )
              except Exception as e:
                  print(f"发送错误通知失败: {e}")
  ScheduleRule:
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub '${AWS::StackName}-vpc-dx-route-monitor-schedule'
      Description: 'Schedule trigger for VPC DX route monitoring'
      ScheduleExpression: !If 
        - Is5Minutes
        - 'rate(5 minutes)'
        - !If 
          - Is10Minutes
          - 'rate(10 minutes)'
          - !If 
            - Is30Minutes
            - 'rate(30 minutes)'
            - !If 
              - Is1Day
              - 'rate(1 day)'
              - 'rate(1 hour)'
      State: ENABLED
      Targets:
        - Arn: !GetAtt RouteMonitorFunction.Arn
          Id: 'RouteMonitorTarget'
  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref RouteMonitorFunction
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt ScheduleRule.Arn
  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/lambda/${RouteMonitorFunction}'
      RetentionInDays: 14
Outputs:
  LambdaFunctionName:
    Description: 'Lambda function name for VPC DX route monitoring'
    Value: !Ref RouteMonitorFunction
  SNSTopicArn:
    Description: 'SNS topic ARN for alert notifications'
    Value: !Ref AlertTopic
  MonitoredVPC:
    Description: 'VPC ID being monitored'
    Value: !Ref VpcId
  AIAnalysisEnabled:
    Description: 'Whether AI analysis is enabled'
    Value: !Ref EnableAIAnalysis
  BedrockRegion:
    Description: 'Bedrock region for AI analysis'
    Value: !Ref BedrockRegion
    Condition: AIAnalysisEnabled

2、网站控制台 CloudFormation 功能,上传刚才创建的 CloudFormation.yaml 文件

在这里插入图片描述

3、输入监控必须项目,填写预警所需邮箱,选择已经学习 DX 路由的 VPC,并点击下一步

在这里插入图片描述

4、勾选我确认,同意系统所需最小授权,并点击下一步完成堆栈部署

在这里插入图片描述

5、部署完成后,系统会自动开始工作,通过大模型优化

在这里插入图片描述

Amazon VPC Direct Connect 介绍

Amazon VPC Direct Connect 是亚马逊云科技提供的专用网络服务,通过物理专线或合作伙伴网络建立本地数据中心与 Amazon VPC 之间的私有连接,替代公共互联网,为混合云架构提供高速、稳定、安全的网络通道

  • 低延迟高稳定:绕过公共互联网,减少网络抖动和延迟波动,保障实时业务(如金融交易、数据同步)的连续性
  • 安全与成本优化:私有链路降低数据传输风险,固定带宽计费模式相比公网高频传输更节省长期成本
  • 灵活扩展与集成:支持 1-100 Gbps 带宽按需扩展,可连接多 VPC、跨区域资源及亚马逊云科技服务,适配复杂混合云架构

总结

本文介绍的 AI 加持型 Amazon VPC Direct Connect 路由监控系统,通过 EventBridge 定时触发、Lambda 核心分析、SNS 预警通知的无服务器架构,实现路由状态实时监控与异常预警,并集成 Amazon Bedrock 大模型提供智能路由优化建议。借助 CloudFormation 自动化部署,兼顾安全与灵活性,有效降低运维成本,保障混合云网络连接的稳定可靠。

© 版权声明

相关文章