AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践
AI 加持:Amazon VPC Direct Connect 路由监控系统构建实践
新用户可获得高达 200 美元的服务抵扣金
亚马逊云科技新用户可以免费使用亚马逊云科技免费套餐(Amazon Free Tier)。注册即可获得 100 美元的服务抵扣金,在探索关键亚马逊云科技服务时可以再额外获得最多 100 美元的服务抵扣金。使用免费计划试用亚马逊云科技服务,最长可达 6 个月,无需支付任何费用,除非您选择付费计划。付费计划允许您扩展运营并获得超过 150 项亚马逊云科技服务的访问权限。
前言
企业云架构中,VPC 与本地数据中心通过 Direct Connect 建立的网络连接稳定性至关重要,传统路由监控常面临异常难察觉、人工优化效率低等问题,本文聚焦如何结合 AI 技术与亚马逊云科技原生服务,构建一套能实时追踪路由状态、智能分析路由聚合并自动预警的监控系统,助力企业降低运维成本,保障网络通信稳定
技术架构

该架构以 Amazon Cloud 为基础,本地数据中心经 Direct Connect Gateway 与 VPC 借 BGP 传播路由,EventBridge 定时触发 Lambda Function,调用 EC2 API 查询 VPC 路由表,超阈值时经 SNS Topic、邮件订阅发预警,CloudWatch Logs 记录日志,构建了 “本地 – 云端路由交互 + 定时监控 + 智能预警 + 日志留存” 的 VPC Direct Connect 路由监控体系,保障网络路由稳定与异常及时响应
- Amazon EventBridge:按预设间隔触发监控任务,支持灵活配置频率,企业可选每半小时 / 1 小时监控,平衡问题响应效率与成本
- Amazon Lambda:作为核心监控逻辑载体,查询 VPC 路由表、识别 Direct Connect 传播路由并完成统计分析,实现路由状态的核心检查
- Amazon SNS:接收监控异常信息并分发预警,通过邮件推送含摘要、路由详情及优化建议的通知,助力运维快速响应
- Amazon IAM:遵循最小权限原则,仅授予 Lambda 查询路由表和发送 SNS 通知的必要权限,最大化降低监控过程的安全风险
前提准备:亚马逊云科技注册流程
Step.1 登录官网
登录亚马逊云科技官网,填写邮箱和账户名称完成验证(注册亚马逊云科技填写 root 邮箱、账户名,验证邮件地址,查收邮件填验证码验证,验证通过后设 root 密码并确认)
Step.2 选择账户计划
选择账户计划,两种计划,按需选"选择免费计划 / 选择付费计划"继续流程
- 免费(6 个月,适合学习实验,含$200抵扣金、限精选服务,超限额或到期可升级付费,否则关停)
- 付费(适配生产,同享$200 抵扣金,可体验全部服务,抵扣金覆盖广,用完按即用即付计费)
Step.3 填写联系人信息
填写联系人信息(选择使用场景,填联系人全名、电话,选择所在国家地区,完善地址、邮政编码,勾选同意客户协议,点击继续 进入下一步)
Step.4 绑定信息
绑定相关信息,选择国家地区,点击"Send code"收验证码填写,勾选同意协议后,点击"验证并继续"进入下一步
Step.5 电话验证
电话验证填写真实手机号,选择验证方式,完成安全检查,若选语音,网页同步显 4 位数字码,接来电后输入信息,再填收到的验证信息,遇问题超 10 分钟收不到可返回重试。
Step.6 售后支持
售后支持:免费计划自动获基本支持,付费计划需选支持计划(各计划都含客户服务,可访问文档白皮书,按需选后点 “完成注册”,若需企业级支持可了解付费升级选项,确认选好即可完成整个注册流程 )
Amazon VPC Direct Connect 路由监控系统
1、下载 CloudFormation 内容到本地,并保存为 yaml 格式
AWSTemplateFormatVersion: '2010-09-09' Description: 'VPC Direct Connect Route Monitor with AI-powered route aggregation analysis' Parameters: VpcId: Type: AWS::EC2::VPC::Id Description: Select the VPC to monitor for DX propagated routes MaxRoutes: Type: Number Default: 100 MinValue: 1 MaxValue: 1000 Description: Maximum number of routes limit (default 100) WarningThreshold1: Type: Number Default: 60 MinValue: 1 MaxValue: 100 Description: First warning threshold percentage (default 60%) WarningThreshold2: Type: Number Default: 80 MinValue: 1 MaxValue: 100 Description: Second warning threshold percentage (default 80%) MonitoringFrequency: Type: String Default: '1 hour' AllowedValues: - '5 minutes' - '10 minutes' - '30 minutes' - '1 hour' - '1 day' Description: How often to check the route count (default 1 hour) NotificationEmail: Type: String Description: Email address to receive alert notifications AllowedPattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$' ConstraintDescription: Please enter a valid email address EnableAIAnalysis: Type: String Default: 'false' AllowedValues: - 'true' - 'false' Description: Enable AI-powered route aggregation analysis using Amazon Bedrock Nova Lite BedrockRegion: Type: String Default: 'us-east-1' AllowedValues: - 'us-east-1' - 'us-west-2' - 'eu-west-1' - 'ap-southeast-1' - 'ap-northeast-1' - 'eu-north-1' Description: AWS region where Bedrock is available (default us-east-1) Conditions: Is5Minutes: !Equals [!Ref MonitoringFrequency, '5 minutes'] Is10Minutes: !Equals [!Ref MonitoringFrequency, '10 minutes'] Is30Minutes: !Equals [!Ref MonitoringFrequency, '30 minutes'] Is1Day: !Equals [!Ref MonitoringFrequency, '1 day'] AIAnalysisEnabled: !Equals [!Ref EnableAIAnalysis, 'true'] Resources: AlertTopic: Type: AWS::SNS::Topic Properties: TopicName: !Sub '${AWS::StackName}-vpc-dx-route-alerts' DisplayName: 'VPC DX Route Monitor Alerts' EmailSubscription: Type: AWS::SNS::Subscription Properties: TopicArn: !Ref AlertTopic Protocol: email Endpoint: !Ref NotificationEmail LambdaExecutionRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: lambda.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Policies: - PolicyName: VPCRouteMonitoringPolicy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - ec2:DescribeRouteTables - ec2:DescribeVpcs Resource: '*' - Effect: Allow Action: - sns:Publish Resource: !Ref AlertTopic - !If - AIAnalysisEnabled - PolicyName: BedrockAccessPolicy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - bedrock:InvokeModel Resource: !Sub 'arn:aws:bedrock:${BedrockRegion}::foundation-model/amazon.nova-lite-v1:0' - !Ref 'AWS::NoValue' RouteMonitorFunction: Type: AWS::Lambda::Function Properties: FunctionName: !Sub '${AWS::StackName}-vpc-dx-route-monitor' Runtime: python3.9 Handler: index.lambda_handler Role: !GetAtt LambdaExecutionRole.Arn Timeout: 600 MemorySize: 512 Description: 'Monitor VPC DX propagated routes with optional AI analysis' Environment: Variables: VPC_ID: !Ref VpcId MAX_ROUTES: !Ref MaxRoutes WARNING_THRESHOLD_1: !Ref WarningThreshold1 WARNING_THRESHOLD_2: !Ref WarningThreshold2 SNS_TOPIC_ARN: !Ref AlertTopic ENABLE_AI_ANALYSIS: !Ref EnableAIAnalysis BEDROCK_REGION: !Ref BedrockRegion Code: ZipFile: | #!/usr/bin/env python3 """ AWS VPC DX路由监控Lambda函数 - AI增强版本 支持通过Amazon Bedrock Nova Lite进行路由聚合分析 """ import boto3 import json import os from datetime import datetime from typing import Dict, List, Any, Set, Tuple, Optional def lambda_handler(event, context): """Lambda主函数""" try: vpc_id = os.environ.get('VPC_ID') max_routes = int(os.environ.get('MAX_ROUTES', '100')) warning_threshold_1 = int(os.environ.get('WARNING_THRESHOLD_1', '60')) warning_threshold_2 = int(os.environ.get('WARNING_THRESHOLD_2', '80')) sns_topic_arn = os.environ.get('SNS_TOPIC_ARN') enable_ai_analysis = os.environ.get('ENABLE_AI_ANALYSIS', 'false').lower() == 'true' bedrock_region = os.environ.get('BEDROCK_REGION', 'us-east-1') if not vpc_id or not sns_topic_arn: raise ValueError("缺少必要的环境变量") # 查询DX传播的路由 route_info = query_dx_propagated_routes(vpc_id) if route_info is None: send_error_notification(sns_topic_arn, vpc_id, "查询路由失败") return {'statusCode': 500, 'body': json.dumps({'error': '查询路由失败'})} # 计算使用百分比 current_routes = route_info['unique_dx_routes'] usage_percentage = (current_routes / max_routes) * 100 # AI分析(如果启用) ai_analysis = None if enable_ai_analysis and current_routes > 0: try: ai_analysis = analyze_routes_with_ai(route_info['unique_routes'], bedrock_region) print("AI分析完成") except Exception as e: print(f"AI分析失败: {e}") ai_analysis = {"error": f"AI分析失败: {str(e)}"} # 检查预警 should_alert = False alert_level = None if usage_percentage >= warning_threshold_2: should_alert = True alert_level = 'HIGH' elif usage_percentage >= warning_threshold_1: should_alert = True alert_level = 'MEDIUM' # 发送预警 if should_alert: send_alert_notification( sns_topic_arn, vpc_id, current_routes, max_routes, usage_percentage, alert_level, route_info['unique_routes'], ai_analysis ) result = { 'vpc_id': vpc_id, 'unique_dx_routes': current_routes, 'max_routes': max_routes, 'usage_percentage': round(usage_percentage, 2), 'alert_sent': should_alert, 'alert_level': alert_level, 'ai_analysis_enabled': enable_ai_analysis, 'ai_analysis_status': 'completed' if ai_analysis and 'error' not in ai_analysis else 'failed' if ai_analysis else 'disabled', 'timestamp': datetime.now().isoformat() } print(f"监控结果: {json.dumps(result, ensure_ascii=False)}") return {'statusCode': 200, 'body': json.dumps(result, ensure_ascii=False)} except Exception as e: error_msg = f"Lambda执行失败: {str(e)}" print(error_msg) try: if 'sns_topic_arn' in locals(): send_error_notification(sns_topic_arn, vpc_id if 'vpc_id' in locals() else 'Unknown', str(e)) except: pass return {'statusCode': 500, 'body': json.dumps({'error': error_msg})} def query_dx_propagated_routes(vpc_id: str) -> Dict[str, Any]: """查询VPC中通过DX传播的路由条目""" try: ec2_client = boto3.client('ec2') response = ec2_client.describe_route_tables( Filters=[{'Name': 'vpc-id', 'Values': [vpc_id]}] ) route_tables = response.get('RouteTables', []) if not route_tables: print(f"未找到VPC {vpc_id} 的路由表") return None if response.get('NextToken'): print("警告: 检测到分页,可能需要升级到完整版本") # 使用Set去重 unique_dx_routes: Set[Tuple[str, str, str]] = set() unique_routes_list = [] for rt in route_tables: rt_id = rt['RouteTableId'] routes = rt.get('Routes', []) for route in routes: if route.get('Origin') == 'EnableVgwRoutePropagation': destination = route.get('DestinationCidrBlock') or route.get('DestinationIpv6CidrBlock', '未知') target_type, target_value = get_route_target(route) route_key = (destination, target_type, target_value) if route_key not in unique_dx_routes: unique_dx_routes.add(route_key) unique_routes_list.append({ 'route_table_id': rt_id, 'destination': destination, 'target_type': target_type, 'target_value': target_value, 'state': route.get('State', '未知') }) return { 'unique_dx_routes': len(unique_dx_routes), 'unique_routes': unique_routes_list } except Exception as e: print(f"查询DX路由失败: {e}") return None def analyze_routes_with_ai(routes: List[Dict], bedrock_region: str) -> Optional[Dict]: """使用Amazon Bedrock Nova Lite分析路由聚合""" try: bedrock_client = boto3.client('bedrock-runtime', region_name=bedrock_region) # 准备路由数据 route_data = [] for route in routes: route_data.append({ 'destination': route['destination'], 'target_type': route['target_type'], 'target_value': route['target_value'], 'state': route['state'] }) # 构建AI提示 prompt = f"""你是一个AWS网络专家,请分析以下Direct Connect传播的路由,并提供路由聚合建议。 当前路由列表(共{len(routes)}条): {json.dumps(route_data, indent=2, ensure_ascii=False)} 请分析并提供以下内容: 1. 路由聚合机会分析 2. 具体的CIDR聚合建议 3. 预期的路由数量减少 4. 实施建议和注意事项 5. 风险评估 请用中文回答,格式要清晰易读。""" # 调用Nova Lite request_body = { "messages": [ { "role": "user", "content": [ { "text": prompt } ] } ], "inferenceConfig": { "max_new_tokens": 4000, "temperature": 0.1 } } response = bedrock_client.invoke_model( modelId='amazon.nova-lite-v1:0', body=json.dumps(request_body) ) response_body = json.loads(response['body'].read()) ai_analysis = response_body['output']['message']['content'][0]['text'] return { 'analysis': ai_analysis, 'route_count': len(routes), 'analysis_timestamp': datetime.now().isoformat(), 'model_used': 'Amazon Nova Lite' } except Exception as e: print(f"AI分析失败: {e}") return {"error": str(e)} def get_route_target(route: Dict) -> tuple: """获取路由目标类型和值""" if 'VirtualPrivateGatewayId' in route: return 'vpn-gateway', route['VirtualPrivateGatewayId'] elif 'TransitGatewayId' in route: return 'transit-gateway', route['TransitGatewayId'] elif 'DirectConnectGatewayId' in route: return 'dx-gateway', route['DirectConnectGatewayId'] elif 'GatewayId' in route: return 'gateway', route['GatewayId'] elif 'NatGatewayId' in route: return 'nat-gateway', route['NatGatewayId'] elif 'NetworkInterfaceId' in route: return 'network-interface', route['NetworkInterfaceId'] elif 'InstanceId' in route: return 'instance', route['InstanceId'] else: return 'unknown', 'unknown' def send_alert_notification(sns_topic_arn: str, vpc_id: str, current_routes: int, max_routes: int, usage_percentage: float, alert_level: str, routes: List[Dict], ai_analysis: Optional[Dict] = None): """发送预警通知(包含AI分析)""" try: sns_client = boto3.client('sns') subject = f"🚨 VPC DX路由预警 - {alert_level} 级别 ({usage_percentage:.1f}%)" if ai_analysis and 'error' not in ai_analysis: subject += " [含AI分析]" message_lines = [ f"VPC Direct Connect 路由监控预警", f"", f"📊 监控摘要:", f" VPC ID: {vpc_id}", f" 当前DX传播路由数: {current_routes}", f" 最大路由限制: {max_routes}", f" 使用百分比: {usage_percentage:.2f}%", f" 预警级别: {alert_level}", f" 检查时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')}", f"", f"🔍 DX传播路由详情:" ] if routes: message_lines.append(f" {'目标网段':<18} {'目标类型':<15} {'目标值':<25}") message_lines.append(f" {'-'*18} {'-'*15} {'-'*25}") for route in routes[:15]: # 限制显示数量,为AI分析留空间 message_lines.append( f" {route['destination']:<18} {route['target_type']:<15} {route['target_value']:<25}" ) if len(routes) > 15: message_lines.append(f" ... 还有 {len(routes) - 15} 条路由未显示") # 添加AI分析结果 if ai_analysis: message_lines.extend([ f"", f"🤖 AI路由聚合分析 (Amazon Nova Lite):", f"{'='*60}" ]) if 'error' in ai_analysis: message_lines.append(f"AI分析失败: {ai_analysis['error']}") else: # 将AI分析结果按行分割并添加适当的缩进 analysis_lines = ai_analysis['analysis'].split('n') for line in analysis_lines: if line.strip(): message_lines.append(f"{line}") else: message_lines.append("") message_lines.extend([ f"", f"分析时间: {ai_analysis.get('analysis_timestamp', 'Unknown')}", f"使用模型: {ai_analysis.get('model_used', 'Unknown')}" ]) message_lines.extend([ f"", f"⚠️ 建议操作:", f" - 检查是否有不必要的路由传播", f" - 考虑优化路由聚合", f" - 如需增加路由限制,请联系AWS支持" ]) if ai_analysis and 'error' not in ai_analysis: message_lines.append(f" - 参考上述AI分析建议进行路由优化") message_lines.append(f"n此消息由AWS Lambda自动生成") sns_client.publish( TopicArn=sns_topic_arn, Subject=subject, Message="n".join(message_lines) ) print("预警通知已发送") except Exception as e: print(f"发送预警通知失败: {e}") def send_error_notification(sns_topic_arn: str, vpc_id: str, error_message: str): """发送错误通知""" try: sns_client = boto3.client('sns') subject = f"❌ VPC DX路由监控错误 - {vpc_id}" message = f"""VPC Direct Connect 路由监控执行错误 错误信息: VPC ID: {vpc_id} 错误消息: {error_message} 发生时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')} 请检查Lambda函数配置和权限设置。""" sns_client.publish( TopicArn=sns_topic_arn, Subject=subject, Message=message ) except Exception as e: print(f"发送错误通知失败: {e}") ScheduleRule: Type: AWS::Events::Rule Properties: Name: !Sub '${AWS::StackName}-vpc-dx-route-monitor-schedule' Description: 'Schedule trigger for VPC DX route monitoring' ScheduleExpression: !If - Is5Minutes - 'rate(5 minutes)' - !If - Is10Minutes - 'rate(10 minutes)' - !If - Is30Minutes - 'rate(30 minutes)' - !If - Is1Day - 'rate(1 day)' - 'rate(1 hour)' State: ENABLED Targets: - Arn: !GetAtt RouteMonitorFunction.Arn Id: 'RouteMonitorTarget' LambdaInvokePermission: Type: AWS::Lambda::Permission Properties: FunctionName: !Ref RouteMonitorFunction Action: lambda:InvokeFunction Principal: events.amazonaws.com SourceArn: !GetAtt ScheduleRule.Arn LambdaLogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub '/aws/lambda/${RouteMonitorFunction}' RetentionInDays: 14 Outputs: LambdaFunctionName: Description: 'Lambda function name for VPC DX route monitoring' Value: !Ref RouteMonitorFunction SNSTopicArn: Description: 'SNS topic ARN for alert notifications' Value: !Ref AlertTopic MonitoredVPC: Description: 'VPC ID being monitored' Value: !Ref VpcId AIAnalysisEnabled: Description: 'Whether AI analysis is enabled' Value: !Ref EnableAIAnalysis BedrockRegion: Description: 'Bedrock region for AI analysis' Value: !Ref BedrockRegion Condition: AIAnalysisEnabled2、网站控制台 CloudFormation 功能,上传刚才创建的 CloudFormation.yaml 文件
3、输入监控必须项目,填写预警所需邮箱,选择已经学习 DX 路由的 VPC,并点击下一步
4、勾选我确认,同意系统所需最小授权,并点击下一步完成堆栈部署
5、部署完成后,系统会自动开始工作,通过大模型优化
Amazon VPC Direct Connect 介绍
Amazon VPC Direct Connect 是亚马逊云科技提供的专用网络服务,通过物理专线或合作伙伴网络建立本地数据中心与 Amazon VPC 之间的私有连接,替代公共互联网,为混合云架构提供高速、稳定、安全的网络通道
- 低延迟高稳定:绕过公共互联网,减少网络抖动和延迟波动,保障实时业务(如金融交易、数据同步)的连续性
- 安全与成本优化:私有链路降低数据传输风险,固定带宽计费模式相比公网高频传输更节省长期成本
- 灵活扩展与集成:支持 1-100 Gbps 带宽按需扩展,可连接多 VPC、跨区域资源及亚马逊云科技服务,适配复杂混合云架构
总结
本文介绍的 AI 加持型 Amazon VPC Direct Connect 路由监控系统,通过 EventBridge 定时触发、Lambda 核心分析、SNS 预警通知的无服务器架构,实现路由状态实时监控与异常预警,并集成 Amazon Bedrock 大模型提供智能路由优化建议。借助 CloudFormation 自动化部署,兼顾安全与灵活性,有效降低运维成本,保障混合云网络连接的稳定可靠。









