1、引子:從三板斧開始
?說起三板斧,我們首先想到的就是隋唐英雄程咬金,他手持宣花大斧,遇到對(duì)手先掄三招(嚴(yán)格意義來說,是兩招半)過去,大部分情況下都能將對(duì)手撂倒,大不了不行就拖斧跑路。
?今天登場的surftrace,就是一款類似于三板斧的工具,使用者只需掌握相關(guān)的內(nèi)核知識(shí),就可以快速上手使用。先看一個(gè)現(xiàn)實(shí)的案例。
1.1、誰喚醒了羅伯特
?在定位調(diào)度問題時(shí),Robert進(jìn)程總是被意外喚醒,因此需要知道都有哪些進(jìn)程把Robert進(jìn)程(pid為1234)給喚醒了。
?解決方案:內(nèi)核采用try_to_wake_up函數(shù)來喚醒一個(gè)線程,函數(shù)原型:
static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
?該函數(shù)第一個(gè)入?yún)truct task_struct 包含任務(wù)pid信息,通過過濾pid以及獲取current信息,就可以找到是誰喚醒了。
?上面的問題非常容易通過ko、systemtap、bcc、bpftrace等方案實(shí)現(xiàn)。只是上述方案都存在一個(gè)共同的需求:沒有現(xiàn)成的命令可以使用,均需要編碼實(shí)現(xiàn),調(diào)試費(fèi)時(shí)費(fèi)力,問題定位了,可能就丟一邊了。
1.2、surftrace登場
?先放碼出來:
pip install surftrace
surftrace 'p try_to_wake_up pid=%0->pid comm=$comm f:pid==1234'
?輸出結(jié)果
surftrace 'p try_to_wake_up pid=%0->pid comm=$comm f:pid==1234'
echo 'p:f0 try_to_wake_up pid=+0x948(%di):u32 comm=$comm' >> /sys/kernel/debug/tracing/kprobe_events
echo 'pid==1234' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
-0 [011] d.h. 11766726.224113: f0: (try_to_wake_up+0x0/0x580) pid=1234 comm="swapper/11" <...>-2166943 [011] d.h. 11766727.225113: f0: (try_to_wake_up+0x0/0x580) pid=1234 comm="python3"
-0 [008] d.h. 11766728.226114: f0: (try_to_wake_up+0x0/0x580) pid=1234 comm="swapper/8"
-0 [008] d.h. 11766729.227114: f0: (try_to_wake_up+0x0/0x580) pid=1234 comm="swapper/8" <...>-3391432 [008] d.h. 11766730.228131: f0: (try_to_wake_up+0x0/0x580) pid=1234 comm="docker-proxy-cu"
^Cecho 0 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo -:f0 >> /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
?有沒有一種上三板斧即收效的感覺?
2、surftrace簡介
?surftrace是在ftrace和libbpf基礎(chǔ)上封裝的一系列工具集,用于trace內(nèi)核信息。
?項(xiàng)目鏈接:https://github.com/aliyun/surftrace.git
我們接下來要介紹的surftrace-cmd基于ftrace封裝實(shí)現(xiàn),首先就需要從ftrace開始說起
2.1、ftrace原理與不足
?關(guān)于ftrace的介紹,可以參考其davaddi的文章:問題排查利器:Linux 原生跟蹤工具 Ftrace 必知必會(huì),這篇文章介紹的比較詳細(xì)。概括的說:ftrace是一個(gè)內(nèi)核中的追蹤器,用于幫助系統(tǒng)開發(fā)者或設(shè)計(jì)者查看內(nèi)核運(yùn)行情況,它可以被用來調(diào)試或者分析延遲/性能等常見問題。如今ftrace已經(jīng)成為一個(gè)開發(fā)框架,從2.6內(nèi)核開始引入,是一套公認(rèn)安全、可靠、高效的內(nèi)核數(shù)據(jù)獲取方式。
?但是ftrace對(duì)使用者的要求比較高,以對(duì)內(nèi)核符號(hào) wake_up_new_task 進(jìn)行trace,同時(shí)要獲取入?yún)?struct task_struct *)->comm 成員信息為例,啟動(dòng)配置需要經(jīng)歷三個(gè)步驟:
echo 'p:f0 wake_up_new_task comm=+0x678(%di):string' /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
?要想停止需要繼續(xù)配置如下:
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo -:f0 /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
?一共需要六個(gè)步驟。其中,最困難的是第一個(gè)參數(shù)解析步驟。通常情況下,需要使用gdb 加載對(duì)應(yīng)內(nèi)核vmlinux, 對(duì) struct task_struct 結(jié)構(gòu)體中 comm成員進(jìn)行偏移計(jì)算。上述方法如果不經(jīng)常使用,重新手工操作的時(shí)間成本非常高,導(dǎo)致真正直接采用ftrace對(duì)內(nèi)核信息進(jìn)行采集的案例非常少,相關(guān)資料文獻(xiàn)也匱乏。
2.2、surftrace目標(biāo)
?surftrace的主要目標(biāo)是為了降低內(nèi)核trace難度,以達(dá)到快速高效獲取內(nèi)核信息目標(biāo)。綜合來說要達(dá)到以下效果:
-
一鍵trace內(nèi)核符號(hào),并獲取指定內(nèi)核數(shù)據(jù);
-
除了C和linux 操作系統(tǒng)內(nèi)核,用戶無需新增學(xué)習(xí)掌握其它知識(shí)點(diǎn)(需要獲取數(shù)據(jù)進(jìn)行二次處理除外);
-
覆蓋大部分主流發(fā)行版內(nèi)核;
-
類似bcc開發(fā)模式,達(dá)到libbpf最佳資源消耗;
3、surftrace 命令使用
?使用surftrace,需要滿足以下條件:
-
公開發(fā)行版linux內(nèi)核,支持目錄參考:http://pylcc.openanolis.cn/version/ (持續(xù)更新)
-
內(nèi)核支持ftrace,已配置了debugfs,root權(quán)限;
-
Python2 >= 2.7; Python3 >= 3.5,已安裝pip;
surftrace支持 remote(默認(rèn)),local和gdb三種表達(dá)式解析器,要求分別如下:
-
remote mode:可以訪問pylcc.openanolis.cn
-
local mode:從http://pylcc.openanolis.cn/db/ 下載對(duì)應(yīng)arch和內(nèi)核的下載到本地
-
gdb mode:gdb version > 8.0,存放有對(duì)應(yīng)內(nèi)核的vmlinux;對(duì)于gdb模式而言,不受公開發(fā)行版內(nèi)核限制
3.1、安裝
?我們以龍蜥 4.19.91-24.8.an8.x86_64內(nèi)核為例,需要root用戶,執(zhí)行以下命令進(jìn)行安裝:
pip3 install surftrace
Collecting surftrace
Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/b9/a2/f7e04bb8ebb12e6517162a70886e3ffe8d466437b15624590c9301fdcc52/surftrace-0.2.tar.gz
Building wheels for collected packages: surftrace
Running setup.py bdist_wheel for surftrace ... done
Stored in directory: /root/.cache/pip/wheels/cf/28/93/187f359be189bf0bf4a70197c53519c6ca54ffb957bcbebf5a
Successfully built surftrace
Installing collected packages: surftrace
Successfully installed surftrace-0.2
?檢查安裝是否成功
surftrace --help
usage: surftrace [-h] [-v VMLINUX] [-m MODE] [-d DB] [-r RIP] [-f FILE]
[-g GDB] [-F FUNC] [-o OUTPUT] [-l LINE] [-a ARCH] [-s] [-S]
[traces [traces ...]]
Trace ftrace kprobe events.
positional arguments:
traces set trace args.
optional arguments:
-h, --help show this help message and exit
-v VMLINUX, --vmlinux VMLINUX
set vmlinux path.
-m MODE, --mode MODE set arg parser, fro
-d DB, --db DB set local db path.
-r RIP, --rip RIP set remote server ip, remote mode only.
-f FILE, --file FILE set input args path.
-g GDB, --gdb GDB set gdb exe file path.
-F FUNC, --func FUNC disasassemble function.
-o OUTPUT, --output OUTPUT
set output bash file
-l LINE, --line LINE get file disasemble info
-a ARCH, --arch ARCH set architecture.
-s, --stack show call stacks.
-S, --show only show expressions.
examples:
3.2、常規(guī)函數(shù)入口trace
?接下來我們以 以下兩個(gè)常用內(nèi)核符號(hào)為例,它的原型定義如下:
void wake_up_new_task(struct task_struct *p);
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op);
3.2.1、追蹤符號(hào)入口和返回點(diǎn)
命令:surftrace 'p wake_up_new_task' 'r wake_up_new_task'
surftrace 'p wake_up_new_task' 'r wake_up_new_task'
echo 'p:f0 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 'r:f1 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f1/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2336 [001] .... 1447.877666: f0: (wake_up_new_task+0x0/0x280)
surftrace-2336 [001] d... 1447.877670: f1: (_do_fork+0x153/0x3d0 <- wake_up_new_task)
?示例中入?yún)⒂袃蓚€(gè)表達(dá)式,所有表達(dá)式要用單引號(hào)括起來。
-
'p wake_up_new_task':p表示表示probe函數(shù)入口;
-
'r wake_up_new_task':r表示probe函數(shù)返回位置;
?后面的wake_up_new_task是要trace的函數(shù)符號(hào),這個(gè)符號(hào)必須要在tracing/available_filter_functions 中可以找到的。
3.2.2、獲取函數(shù)入?yún)?/span>
?要獲取 do_filp_open 函數(shù) 第一個(gè)入?yún)fd,它的數(shù)據(jù)類型是:int。
命令:surftrace 'p do_filp_open dfd=%0'
surftrace 'p do_filp_open dfd=%0'
echo 'p:f0 do_filp_open dfd=%di:u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2435 [001] .... 2717.606277: f0: (do_filp_open+0x0/0x100) dfd=4294967196
AliYunDun-1812 [000] .... 2717.655955: f0: (do_filp_open+0x0/0x100) dfd=4294967196
AliYunDun-1812 [000] .... 2717.856227: f0: (do_filp_open+0x0/0x100) dfd=4294967196
-
dfd是自定義變量,可以自行定義,名字不沖突即可
-
%0表示第一個(gè)入?yún)ⅲ?1表示第二個(gè)……
?前面打印中,dfd是按照十進(jìn)制顯示的,可能沒有十六進(jìn)制那么直觀,指定十六進(jìn)制的方法:
?命令:surftrace 'p do_filp_open dfd=X%0'
surftrace 'p do_filp_open dfd=X%0'
echo 'p:f0 do_filp_open dfd=%di:x32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2459 [000] .... 3137.167885: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
AliYunDun-1812 [001] .... 3137.171997: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
AliYunDun-1826 [001] .... 3137.201401: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
?傳參編號(hào)%前面使用了X進(jìn)制類型標(biāo)識(shí)符,共有SUX三種類型,分別對(duì)應(yīng)有符號(hào)十進(jìn)制、無符號(hào)十進(jìn)制和十六進(jìn)制,不指定默認(rèn)為U類型。
3.2.3、解析入?yún)⒔Y(jié)構(gòu)體
?wake_up_new_task入?yún)㈩愋蜑閟truct task_struct *,如果要獲取入?yún)⒅衏omm成員,即任務(wù)名,
命令:surftrace 'p wake_up_new_task comm=%0->comm'
surftrace 'p wake_up_new_task comm=%0->comm'
echo 'p:f0 wake_up_new_task comm=+0xae0(%di):string' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2421 [000] .... 2368.261019: f0: (wake_up_new_task+0x0/0x280) comm="surftrace"
bash-2392 [001] .... 2375.809655: f0: (wake_up_new_task+0x0/0x280) comm="bash"
bash-2392 [001] .... 2379.038534: f0: (wake_up_new_task+0x0/0x280) comm="bash"
bash-2392 [000] .... 2381.237443: f0: (wake_up_new_task+0x0/0x280) comm="bash"
?方法和C語言獲取結(jié)構(gòu)體成員方法一樣。
?結(jié)構(gòu)體類型可以級(jí)聯(lián)訪問:
surftrace 'p wake_up_new_task uesrs=S%0->mm->mm_users'
echo 'p:f0 wake_up_new_task uesrs=+0x58(+0x850(%di)):s32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2471 [001] .... 3965.234680: f0: (wake_up_new_task+0x0/0x280) uesrs=2
bash-2392 [000] .... 3970.094475: f0: (wake_up_new_task+0x0/0x280) uesrs=1
bash-2392 [000] .... 3971.954463: f0: (wake_up_new_task+0x0/0x280) uesrs=1
surftrace 'p wake_up_new_task node=%0->se.run_node.rb_left'
echo 'p:f0 wake_up_new_task node=+0xa8(%di):u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
surftrace-2543 [001] .... 5926.605145: f0: (wake_up_new_task+0x0/0x280) node=0
bash-2392 [001] .... 5940.292293: f0: (wake_up_new_task+0x0/0x280) node=0
bash-2392 [001] .... 5945.207106: f0: (wake_up_new_task+0x0/0x280) node=0
systemd-journal-553 [000] .... 5953.211998: f0: (wake_up_new_task+0x0/0x280) node=0
3.2.4、設(shè)置過過濾器
?過濾器需要放在表達(dá)式最后,以f:開頭,可以使用括號(hào)和&& ||邏輯表達(dá)式進(jìn)行組合,具體寫法可以參考ftrace文檔說明
?命令行 surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"'
surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"'
echo 'p:f0 wake_up_new_task comm=+0xb28(%di):string' >> /sys/kernel/debug/tracing/kprobe_events
echo 'comm=="python3"' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-2640781 [002] .... 6305734.444913: f0: (wake_up_new_task+0x0/0x250) comm="python3"
<...>-2640781 [002] .... 6305734.447806: f0: (wake_up_new_task+0x0/0x250) comm="python3"
<...>-2640781 [002] .... 6305734.450897: f0: (wake_up_new_task+0x0/0x250) comm="python3"
?系統(tǒng)會(huì)默認(rèn)提供 'common_pid', 'common_preempt_count', 'common_flags', 'common_type' 這5個(gè)變量作為過濾器,該變量由系統(tǒng)提供,無需額外定義。
3.2.5、函數(shù)內(nèi)部追蹤
?函數(shù)內(nèi)部追蹤需要結(jié)合函數(shù)內(nèi)部匯編代碼進(jìn)行推導(dǎo),該方法并不通用,該內(nèi)容操作進(jìn)供參考。反匯編do_filp_open函數(shù)
3699 in fs/namei.c
0xffffffff812adb65 <+85>: mov %r13d,%edx
0xffffffff812adb70 <+96>: or $0x40,%edx
0xffffffff812adb73 <+99>: mov %r12,%rsi
0xffffffff812adb76 <+102>: mov %rsp,%rdi
0xffffffff812adb89 <+121>: callq 0xffffffff812ac760
0xffffffff812adb92 <+130>: mov %rax,%rbx
3700 in fs/namei.c
0xffffffff812adb8e <+126>: cmp $0xfffffffffffffff6,%rax
0xffffffff812adb95 <+133>: je 0xffffffff812adbb4
164 >3701 in fs/namei.c
0xffffffff812adbb4 <+164>: mov %r13d,%edx
0xffffffff812adbb7 <+167>: mov %r12,%rsi
0xffffffff812adbba <+170>: mov %rsp,%rdi
0xffffffff812adbbd <+173>: callq 0xffffffff812ac760
0xffffffff812adbc2 <+178>: mov %rax,%rbx
0xffffffff812adbc5 <+181>: jmp 0xffffffff812adb97
135 >3702 in fs/namei.c
0xffffffff812adb97 <+135>: cmp $0xffffffffffffff8c,%rbx
0xffffffff812adb9b <+139>: je 0xffffffff812adbc7
183 >
?對(duì)應(yīng)源碼
struct file *do_filp_open(int dfd, struct filename *pathname,
const struct open_flags *op)
{
struct nameidata nd;
int flags = op->lookup_flags;
struct file *filp;
set_nameidata(&nd, dfd, pathname);
filp = path_openat(&nd, op, flags | LOOKUP_RCU);
if (unlikely(filp == ERR_PTR(-ECHILD)))
filp = path_openat(&nd, op, flags);
if (unlikely(filp == ERR_PTR(-ESTALE)))
filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
restore_nameidata();
return filp;
}
要獲取 3699行 filp = path_openat(&nd, op, flags | LOOKUP_RCU) 對(duì)應(yīng)的filp的值
surftrace 'p do_filp_open+121 filp=X!(u64)%ax'
echo 'p:f0 do_filp_open+121 filp=%ax:x64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-1315799 [006] d.Z. 6314249.201847: f0: (do_filp_open+0x79/0xd0) filp=0xffff929db2819840
<...>-4006158 [014] d.Z. 6314249.326736: f0: (do_filp_open+0x79/0xd0) filp=0xffff929daeac48c0
?變量表達(dá)式:filp=X!(u64)%ax 中,使用!對(duì)寄存器類型進(jìn)行數(shù)據(jù)類型強(qiáng)制轉(zhuǎn)換,括號(hào)當(dāng)中的是是數(shù)據(jù)類型定義。
?展開 struct file 結(jié)構(gòu)體定義:
struct file {
union {
struct llist_node fu_llist;
struct callback_head fu_rcuhead;
} f_u;
struct path f_path;
struct inode *f_inode;
const struct file_operations *f_op;
spinlock_t f_lock;
enum rw_hint f_write_hint;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
struct mutex f_pos_lock;
loff_t f_pos;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
void *f_security;
void *private_data;
struct list_head f_ep_links;
struct list_head f_tfile_llink;
struct address_space *f_mapping;
errseq_t f_wb_err;
}
如果要獲取此時(shí)的f_pos值,可以這樣獲取
-
命令行:surftrace 'p do_filp_open+121 pos=X!(struct file*)%ax->f_pos'
surftrace 'p do_filp_open+121 pos=X!(struct file*)%ax->f_pos'
echo 'p:f0 do_filp_open+121 pos=+0x68(%ax):x64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-1334277 [010] d.Z. 6314645.646230: f0: (do_filp_open+0x79/0xd0) pos=0x0
<...>-2916553 [002] d.Z. 6314645.653164: f0: (do_filp_open+0x79/0xd0) pos=0x0
<...>-2916553 [002] d.Z. 6314645.653253: f0: (do_filp_open+0x79/0xd0) pos=0x0
?獲取方法和前面保持一致。
3.3、獲取返回值
?前文已經(jīng)描述采用r 對(duì)事件類型進(jìn)行標(biāo)識(shí),返回寄存器統(tǒng)一用$retval標(biāo)識(shí),與ftrace保持一致,以獲取do_filp_open函數(shù)返回值為例:
-
命令行:surftrace 'r do_filp_open filp=$retval'
surftrace 'r do_filp_open filp=$retval'
echo 'r:f0 do_filp_open filp=$retval:u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-1362926 [010] d... 6315264.198718: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804769722880
<...>-4006154 [008] d... 6315264.256749: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770426624
<...>-4006154 [008] d... 6315264.256776: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770425344
獲取 struct file 中f_pos成員
-
命令行:surftrace 'r do_filp_open pos=$retval->f_pos'
surftrace 'r do_filp_open pos=$retval->f_pos'
echo 'r:f0 do_filp_open pos=+0x68($retval):u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-1371049 [008] d... 6315439.568814: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
systemd-journal-3665 [012] d... 6315439.568962: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
systemd-journal-3665 [012] d... 6315439.571519: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
3.4、skb處理
?sk_buff 是linux網(wǎng)絡(luò)協(xié)議棧重要的結(jié)構(gòu)體,通過前面的方法,并不能直接解析到我們關(guān)注的報(bào)文內(nèi)容,需要進(jìn)行特殊處理。以追蹤icmp接收ping報(bào)文為例,我們?cè)赺_netif_receive_skb_core 函數(shù)中進(jìn)行probe和過濾:
-
命令行 surftrace 'p __netif_receive_skb_core proto=@(struct iphdr *)l3%0->protocol ip_src=@(struct iphdr *)%0->saddr ip_dst=@(struct iphdr *)l3%0->daddr data=X@(struct iphdr *)l3%0->sdata[1] f:proto==1&&ip_src==127.0.0.1'
-
同時(shí)可能需要 執(zhí)行 ping127.0.0.1
surftrace 'p __netif_receive_skb_core proto=@(struct iphdr *)l3%0->protocol ip_src=@(struct iphdr *)%0->saddr ip_dst=@(struct iphdr *)l3%0->daddr data=X@(struct iphdr *)l3%0->sdata[1] f:proto==1&&ip_src==127.0.0.1'
echo 'p:f0 __netif_receive_skb_core proto=+0x9(+0xe8(%di)):u8 ip_src=+0xc(+0xe8(%di)):u32 ip_dst=+0x10(+0xe8(%di)):u32 data=+0x16(+0xe8(%di)):x16' >> /sys/kernel/debug/tracing/kprobe_events
echo 'proto==1&&ip_src==0x100007f' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<...>-1420827 [013] ..s1 6316511.011244: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a0d
<...>-1420827 [013] ..s1 6316511.011264: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a1
?協(xié)議的獲取表達(dá)式為 @(struct iphdr *)l3%0->protocol,和之前不一樣的是,寄存器的結(jié)構(gòu)體名左括號(hào)加了@符號(hào)進(jìn)行特殊標(biāo)記,表示需要用該結(jié)構(gòu)體來解析skb->data指針數(shù)據(jù),結(jié)構(gòu)體名和右括號(hào)后加了l3標(biāo)記(命名為右標(biāo)記),表示當(dāng)前skb->data指向了TCP/IP 層3位置。
-
右標(biāo)記有l(wèi)2、l3、l4三個(gè)選項(xiàng),也可以不標(biāo)記,默認(rèn)為l3,如 ip_src=@(struct iphdr *)%0->saddr,沒有右標(biāo)記。
-
報(bào)文結(jié)構(gòu)體有 'struct ethhdr', 'struct iphdr', 'struct icmphdr', 'struct tcphdr', 'struct udphdr'五類,如果協(xié)議棧層級(jí)和報(bào)文結(jié)構(gòu)體對(duì)應(yīng)不上,解析器會(huì)報(bào)參數(shù)錯(cuò)誤,如右標(biāo)記為l3,但是報(bào)文結(jié)構(gòu)體是 struct ethhdr類型;
-
'struct icmphdr', 'struct tcphdr', 'struct udphdr'這三個(gè)4層結(jié)構(gòu)體增加了xdata成員,用于獲取協(xié)議對(duì)應(yīng)報(bào)文內(nèi)容。xdata有 cdata. sdata, ldata, qdata, Sdata 五種類型,位寬對(duì)應(yīng) 1 2 4 8 和字符串. 數(shù)組下標(biāo)是按照位寬進(jìn)行對(duì)齊的,如實(shí)例表達(dá)式中的 data=%0~$(struct icmphdr)l3->sdata[1],sdata[1]表示要提取icmp報(bào)文中的2~3字節(jié)內(nèi)容
-
surftrace 會(huì)對(duì)以 ip_xx開頭的變量進(jìn)行ipv4<->u32 ,如 ip_src=@(struct iphdr *)%0->saddr,會(huì)轉(zhuǎn)成對(duì)應(yīng)的IP格式。對(duì)B16、B32、B64、b16、b32、b64開頭的變量也會(huì)進(jìn)行大小端轉(zhuǎn)換,B開頭按照16進(jìn)制輸出,b以10進(jìn)制輸出。
3.5、event
?trace event 信息參考 /sys/kernel/debug/tracing/events目錄下的事件 描述,以追蹤wakeup等待超過10ms任務(wù)為例
?命令行 surftrace 'e sched/sched_stat_wait f:delay>1000000'
surftrace 'e sched/sched_stat_wait f:delay>1000000'
echo 'delay>1000000' > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
-0 [001] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046552 delay=87023763 [ns]
-0 [005] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046617 delay=87360020 [ns]
4、總結(jié)
?通過前面的舉例,我們可以匯總出surftrace-cmd是一款類似三板斧一樣的簡潔易用的內(nèi)核trace工具。特別是在以下應(yīng)用場景中具有明顯的方案優(yōu)勢:
-
內(nèi)核符號(hào)快速追蹤、傳參解析、數(shù)據(jù)過濾,可以做到一鍵追蹤;
-
函數(shù)內(nèi)部匯編級(jí)別的追蹤和數(shù)據(jù)解析,類似的情況libbpf和bcc等方案無法實(shí)現(xiàn);
-
skb報(bào)文解析,已經(jīng)做了大小端和ip格式轉(zhuǎn)換等人性化處理,方便對(duì)網(wǎng)絡(luò)報(bào)文在內(nèi)核每一個(gè)環(huán)節(jié)進(jìn)行有效追蹤。
?同時(shí),surftrace-cmd沒有內(nèi)置像libbpf中的hashmap等數(shù)據(jù)類型,如果要在內(nèi)核態(tài)做復(fù)雜的邏輯運(yùn)算和存儲(chǔ)等場景,還是推薦采用libbpf等方案。
?后面我們將從實(shí)際案例角度出發(fā),為你展示surftrace-cmd在網(wǎng)絡(luò)、IO等內(nèi)核問題的典型應(yīng)用
原文標(biāo)題:內(nèi)核trace三板斧-surtrace-cmd
文章出處:【微信公眾號(hào):Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。
-
內(nèi)核
+關(guān)注
關(guān)注
3文章
1372瀏覽量
40282 -
Trace
+關(guān)注
關(guān)注
0文章
18瀏覽量
10563 -
工具
+關(guān)注
關(guān)注
4文章
311瀏覽量
27771
原文標(biāo)題:內(nèi)核trace三板斧-surtrace-cmd
文章出處:【微信號(hào):LinuxDev,微信公眾號(hào):Linux閱碼場】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。
發(fā)布評(píng)論請(qǐng)先 登錄
相關(guān)推薦
評(píng)論