引言
隨著超大型數據中心的出現和RDMA(遠程直接內存訪問技術)技術的普及,對網絡運維提出的要求越來越高,只靠傳統的網絡監控技術如SNMP/CLI/log已經無法滿足。Telemetry是一項監控設備性能和故障的遠程數據采集技術,獲取的監控數據擁有更高的精度和更加實時。特別是基于訂閱服務的Stream模式可以由網絡設備本身進行信息上報,從而有效減輕數據采集器維持監控的負載。
SONiC作為專注于數據中心的開源網絡系統,架構設計就是將各個模塊放置在獨立的docker中,基于redis-database在各個子系統之間進行數據持久化、復制和多進程通信。基于gRPC的Telemetry技術可以采集Redis-database中存儲的設備接口流量等信息后,經過Protocol Buffer編碼,實時上報給采集器進行接收和存儲,供分析器讀取。
網絡丟包是網絡通信中較為常見的故障,越早獲取到丟包信息和丟包原因才可能越早進行排障。SONiC的drop counter功能可以將交換芯片對丟包信息的監控能力呈現給Telemetry模塊,不但可以快速檢測到丟包的出現,還能檢測出丟包的原因,從而很好地解決數據中心網絡中這一痛點問題。下面的演示以telemetry采集盛科V682-SONiC交換機的drop counter為例,通過一系列開源軟件實現網絡丟包檢測的可視化。
環境搭建
綜述
驗證環境的拓撲圖如下
服務器運行兩個docker
docker1運行Telegraf和Prometheus
Prometheus是一款基于時序數據庫的開源系統監控和報警系統。
Telegraf是一款基于插件驅動開源數據收集Agent。此處是用支持sonic-telemetry的分支版本編譯,實現將sonic-telemetry數據轉化為Prometheus能夠識別的數據格式
docker2運行Grafana
Grafana 是一款開源數據可視化工具,可以做數據監控和數據統計,帶有告警功能
因此監控數據的走向就是Telemetry->Telegraf->Prometheus->Grafana,最后在Grafana對外提供的Web上進行可視化展示
交換機上的操作
SONiC交換機添加VLAN10和VLAN60,Ethernet7加入到VLAN60,在VLAN60上配置60.1.1.254/24
admin@sonic:~$?sudo?config?vlan?add?10 admin@sonic:~$?sudo?config?vlan?add?60 admin@sonic:~$?sudo?config?vlan?member?add?60?Ethernet7 admin@sonic:~$?sudo?config?interface?ip?add?Vlan60?60.1.1.254/24
SONiC交換機進入telemetry docker,啟動Telemetry服務,監聽8181端口
admin@sonic:~$?docker?exec?-it?telemetry?bash root@sonic:/#?/usr/sbin/telemetry?-logtostderr?--insecure?--port?8181?--allow_no_client_auth?-v=2
SONiC交換機上添加INGRESS_VLAN_FILTER和LPM4_MISS兩個丟包原因
admin@sonic:~$?sudo?config?dropcounters?install?DEBUG_VLAN_TAG_MISMATCH?PORT_INGRESS_DROPS?[INGRESS_VLAN_FILTER]? admin@sonic:~$? admin@sonic:~$?sudo?config?dropcounters?install?DEBUG_ROUTE_MISMATCH?PORT_INGRESS_DROPS?[LPM4_MISS]? admin@sonic:~$ admin@sonic:~$?show?dropcounters?configuration Counter??????????????????Alias????????????????????Group????Type????????????????Reasons??????????????Description -----------------------??-----------------------??-------??------------------??-------------------??------------- DEBUG_ROUTE_MISMATCH?????DEBUG_ROUTE_MISMATCH?????N/A??????PORT_INGRESS_DROPS??LPM4_MISS????????????N/A DEBUG_VLAN_TAG_MISMATCH??DEBUG_VLAN_TAG_MISMATCH??N/A??????PORT_INGRESS_DROPS??INGRESS_VLAN_FILTER??N/A admin@sonic:~$?
查看COUNTERS_DB,PORT下新增了兩個field
admin@sonic:~$?sonic-db-cli?COUNTERS_DB?HGET?COUNTERS:oid:0x10000000000c5?SAI_PORT_STAT_IN_CONFIGURED_DROP_REASONS_1_DROPPED_PKTS 0 admin@sonic:~$?sonic-db-cli?COUNTERS_DB?HGET?COUNTERS:oid:0x10000000000c5?SAI_PORT_STAT_IN_CONFIGURED_DROP_REASONS_2_DROPPED_PKTS 0 admin@sonic:~$?
docker1上的操作
服務器上準備好prometheus的配置文件/opt/prometheus/prometheus.yml
admin@admin-server:~$?cat?/opt/prometheus/prometheus.yml global: ??scrape_interval:?????2s ??evaluation_interval:?2s scrape_configs: ??-?job_name:?telegraf ????static_configs: ??????-?targets:?['127.0.0.1:9100'] ????????labels: ??????????instance:?localhost admin@admin-server:~$?
以鏡像ubuntu/prometheus:2.32-20.04_beta為基礎運行docker1,將docker的9090端口映射到服務器的9191端口
admin@admin-server:~/telemetry$?docker?run??-d? >???-p?9191:9090? >???-v?/opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml? >???--name?prometheus? >???ubuntu/prometheus:2.32-20.04_beta
將編譯好的telegram(包含sonic input plugin)可執行文件拷貝到docker1,并按照指定配置運行
root@31e2f2a810ca:/prometheus#?cd?/root/ root@31e2f2a810ca:~#?ls?-l total?92908 -rwxr-xr-x?1?1000?1000?95132120?Aug?12?17:31?telegraf -rw-r--r--?1?1000?1000??????578?Aug?27?06:33?telegraf.conf root@31e2f2a810ca:~#?ls?-t telegraf.conf??telegraf root@31e2f2a810ca:~#?ls??? telegraf??telegraf.conf root@31e2f2a810ca:~#? root@31e2f2a810ca:~#?cat?telegraf.conf? [agent] ??debug?=?true [[inputs.sonic_telemetry_gnmi]] ??addresses?=?["10.13.33.135:8181"] ??username?=?"admin" ??password?=?"YourPaSsWoRd" ??encoding?=?"json_ietf" ??redial?=?"10s" ??enable_tls?=?true ??insecure_skip_verify?=?true ??target?=?"COUNTERS_DB" ??[[inputs.sonic_telemetry_gnmi.subscription]] ????name?=?"test_135_ifcounters" ????origin?=?"" ????path?=?"/COUNTERS/Ethernet7" ????subscription_mode?=?"sample" ????sample_interval?=?"5s" [[outputs.prometheus_client]] ??listen?=?":9100" ??metric_version?=?2 root@31e2f2a810ca:~#? root@31e2f2a810ca:~#?./telegraf?--config?telegraf.conf? 2023-09-24T06:06:34Z?I!?Starting?Telegraf? 2023-09-24T06:06:34Z?I!?Loaded?inputs:?sonic_telemetry_gnmi 2023-09-24T06:06:34Z?I!?Loaded?aggregators:? 2023-09-24T06:06:34Z?I!?Loaded?processors:? 2023-09-24T06:06:34Z?I!?Loaded?outputs:?prometheus_client 2023-09-24T06:06:34Z?I!?Tags?enabled:?host=31e2f2a810ca 2023-09-24T06:06:34Z?I!?[agent]?Config:?Interval:10s,?Quiet:false,?Hostname:"31e2f2a810ca",?Flush?Interval:10s 2023-09-24T06:06:34Z?D!?[agent]?Initializing?plugins 2023-09-24T06:06:34Z?D!?[agent]?Connecting?outputs 2023-09-24T06:06:34Z?D!?[agent]?Attempting?connection?to?[outputs.prometheus_client] 2023-09-24T06:06:34Z?I!?[outputs.prometheus_client]?Listening?on?http://[::]:9100/metrics 2023-09-24T06:06:34Z?D!?[agent]?Successfully?connected?to?outputs.prometheus_client 2023-09-24T06:06:34Z?D!?[agent]?Starting?service?inputs 2023-09-24T06:06:35Z?D!?[inputs.sonic_telemetry_gnmi]?Connection?to?GNMI?device?10.13.33.135:8181?established 2023-09-24T06:06:44Z?D!?[outputs.prometheus_client]?Wrote?batch?of?2?metrics?in?665.355μs 2023-09-24T06:06:44Z?D!?[outputs.prometheus_client]?Buffer?fullness:?0?/?10000?metrics 2023-09-24T06:06:54Z?D!?[outputs.prometheus_client]?Wrote?batch?of?1?metrics?in?280.59μs 2023-09-24T06:06:54Z?D!?[outputs.prometheus_client]?Buffer?fullness:?0?/?10000?metrics 2023-09-24T06:07:04Z?D!?[outputs.prometheus_client]?Wrote?batch?of?1?metrics?in?305.204μs 2023-09-24T06:07:04Z?D!?[outputs.prometheus_client]?Buffer?fullness:?0?/?10000?metrics ......
打開Prometheus的targets頁面,確認telegaf源已經up??
docker2上的操作
以鏡像philhawthorne/docker-influxdb-grafana:latest為基礎運行docker2
admin@admin-server:~$?docker?run?-d? >???--name?docker-influxdb-grafana? >???-p?3003:3003? >???-p?3004:8083? >???-p?8086:8086? >???-v?/home/admin/influxdb:/var/lib/influxdb? >???-v?/home/admin/grafana:/var/lib/grafana? >???philhawthorne/docker-influxdb-grafana:latest
打開Grafana的配置頁面,添加Prometheus的相關信息,測試確認對接正常
在Grafana里面添加Panel
數據源選取上一步配置好的Prometheus
添加兩個監控Metric:
test_135_ifcounters_SAI_PORT_STAT_IN_CONFIGURED_DROP_REASONS_1_DROPPED_PKTS test_135_ifcounters_SAI_PORT_STAT_IN_CONFIGURED_DROP_REASONS_2_DROPPED_PKTS
發包驗證
構造報文1,vlan_id為10,與Ethernet7所屬VLAN不符,發入Ethernet7
構造報文2,vlan_id為60,但是dest_ip為交換機路由表中不存在的9.9.9.9,發入Ethernet7
查看Panel頁面,橫坐標為時間軸,縱坐標為報文個數
圖中綠色曲線代表進入Ethernet7的報文因為與入端口VLAN屬性不符而被丟棄的報文個數變化
圖中黃色曲線代表進入Ethernet7的報文因為查找不到相應的路由而被丟棄的報文個數變化
運維人員或者控制器可以基于上報的丟包統計消息快速排查故障位置,執行故障應對策略,恢復網絡業務
編輯:黃飛
?
評論
查看更多