linux性能优化--cpu篇
linux性能優化--cpu篇
- 前言
- 負載
- CPU使用率
- proc
- perf
- 一些鏈接
- `perf list`
- 比較有用的event
- `perf stat`
- `perf record`
- Profiling
- Static Tracing
- `perf report`
- perf probe
- MISC
- perf調用查看docker內部的程序
- `perf top`
- perf 優化
- perf生成火焰圖
- Brendan Gregg博客上提及的生成火焰圖的方法
- 使用systemtap生成火焰圖
- perf script
- perf使用筆記
- OFF-CPU分析
- 測量off-cpu time
- 注意
- Request-Synchronous Context
- Scheduler Latency
- 非自愿上下文切換 Involuntary Context Switching
- IO
- Off-CPU
- wake-up
- Chain Graphs
- bcc系列工具概覽
- Single Purpose Tools:
- Multi Tools: Kernel Dynamic Tracing
- Multi Tools: User Level Dynamic Tracing
- Multi Tools: Kernel Static Tracing
- Multi Tools: User Statically Defined Tracing (USDT)
- perf-tools
- 參考鏈接
前言
本篇文章是我日常cpu性能優化的實踐筆記,結構有些混亂,未來會重構。
我認為,學習性能優化最有效的途徑,一是看祖師爺Brendan Gregg的文章,二就是實踐了。
負載
- 平均負載多少才合適?
- 當平均負載高于 CPU 數量 70% 的時候,你就應該分析排查負載高的問題了
- 三種情況會導致CPU負載升高:
- CPU 密集型進程,使用大量 CPU 會導致平均負載升高
- I/O 密集型進程,等待 I/O 也會導致平均負載升高,但不一定會很高
- 大量等待 CPU 的進程調度也會導致平均負載升高,此時的 CPU 使用率也會比較高。
- cswch: 每秒自愿上下文切換(voluntary context switches)的次數,另一個則是 nvcswch
- 所謂自愿上下文切換,是指進程無法獲取所需資源,導致的上下文切換。比如說, I/O、內存等系統資源不足時,就會發生自愿上下文切換。
- nvcswch: 表示每秒非自愿上下文切換(non voluntary context switches)的次數。
- 而非自愿上下文切換,則是指進程由于時間片已到等原因,被系統強制調度,進而發生的上下文切換。比如說,大量進程都在爭搶 CPU 時,就容易發生非自愿上下文切換。
- 觸發上下文切換。在非常流暢的系統中,這個開銷大約是3-5微秒,這可比搶鎖和同步cache還慢。
- Steps in Context Switching 解釋上下文切換的時候做了點啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking
CPU使用率
- Linux 通過 /proc 虛擬文件系統,向用戶空間提供了系統內部狀態的信息,,而 /proc/stat 提供的就是系統的 CPU 和任務統計信息。
- user(通常縮寫為 us),代表用戶態 CPU 時間。注意,它不包括下面的 nice 時間,但包括了 guest 時間。
- nice(通常縮寫為 ni),代表低優先級用戶態 CPU 時間,也就是進程的 nice 值被調整為 1-19 之間時的。這里注意,nice 可取值范圍是 -20 到 19,數值越大,優先級反而越低。
- system(通常縮寫為 sys),代表內核態 CPU 時間
- idle(通常縮寫為 id),代表空閑時間。注意,它不包括等待 I/O 的時間(iowait)。
- irq(通常縮寫為 hi),代表處理硬中斷的 CPU 時間。
- softirq(通常縮寫為 si),代表處理軟中斷的 CPU時間。
- steal(通常縮寫為 st),代表當系統運行在虛擬機中的時候,被其他虛擬機占用的 CPU 時間。
- guest(通常縮寫為 guest),代表通過虛擬化運行其他操作系統的時間,也就是運行虛擬機的 CPU 時間。
- guest_nice(通常縮寫為 gnice),代表以低優先級運行虛擬機的時間。
- 性能分析工具給出的都是間隔一段時間的平均 CPU 使用率,所以要注意間隔時間的設置
proc
perf
一些鏈接
面對一個問題程序,最好采用自頂向下的策略。先整體看看該程序運行時各種統計事件的大概,再針對某些方向深入細節。而不要一下子扎進瑣碎細節,會一葉障目的。
有些程序慢是因為計算量太大,其多數時間都應該在使用 CPU 進行計算,這叫做 CPU bound 型;有些程序慢是因為過多的 IO,這種時候其 CPU 利用率應該不高,這叫做 IO bound 型;對于 CPU bound 程序的調優和 IO bound 的調優是不同的。
perf list
使用 perf list 命令可以列出所有能夠觸發 perf 采樣點的事件 perf list 'sched:*'表示列出sched相關的tracepoints
不同的系統會列出不同的結果,在 2.6.35 版本的內核中,該列表已經相當的長,但無論有多少,我們可以將它們劃分為三類:
- block: block device I/O
- ext4: file system operations
- kmem: kernel memory allocation events
- random: kernel random number generator events
- sched: CPU scheduler events
- syscalls: system call enter and exits
- task: task events
比較有用的event
perf stat
# CPU counter statistics for the specified command: 對某個command的執行過程執行perf,顯示一些cpu相關的計數 perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上,但更為詳細 perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 對某個pid perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 對整個系統,執行5秒 perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 顯示基本cpu相關信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相關 perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000perf record
- perf record則提供了保存數據的功能,保存后的數據,需要你用 perf report 解析展示。
- perf 后面經常加 -g選項,開啟調試關系的采樣,查看調用鏈
- -p <pid>指定進程號
Profiling
# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1Static Tracing
# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a# Trace all context-switches, until Ctrl-C: # -c 1 意味著-vv選項中的sample_period是1,也就是抓所有的時間 perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open): # -vv顯示詳細信息 perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -aperf report
# Show perf.data in an ncurses browser (TUI) if possible: perf report# Show perf.data with a column for sample count: perf report -n# Show perf.data as a text report, with data coalesced and percentages: perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded# List all events from perf.data: perf script# List all perf.data events, with data header (newer kernels; was previously default): perf script --header# List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging): perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdioperf probe
Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes: perf probe -lMISC
Mixed
# Trace system calls by process, showing a summary refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear): stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches: perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8): perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -gSpecial
# Record cacheline events (Linux 4.10+): perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+): perf c2c report其他用法
- perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
- perf kmem: kernel memory allocation analysis.
- perf kvm: KVM virtual guest analysis.
- perf lock: lock analysis.
- perf mem: memory access analysis.
- perf sched: kernel scheduler statistics. Examples.
perf調用查看docker內部的程序
- 最簡單的方法是在docker外執行perf record。 然后把文件拷貝到容器內來分析。
- 先運行 perf record -g -p < pid>,執行一會兒(比如 15 秒)后,按 Ctrl+C 停止。
- 然后,把生成的 perf.data 文件,拷貝到容器里面來分…
perf top
- 它能夠實時顯示占用 CPU 時鐘最多的函數或者指令,因此可以用來查找熱點函數
- perf 后面經常加 -g選項,開啟調試關系的采樣,查看調用鏈
- -p <pid>指定進程號
- 每一行包含四列:
- 第一列 Overhead ,是該符號的性能事件在所有采樣中的比例,用百分比來表示。
- 第二列 Shared ,是該函數或指令所在的動態共享對象(Dynamic Shared Object),如內核、進程名、動態鏈接庫名、內核模塊名等。
- 第三列 Object ,是動態共享對象的類型。比如 [.] 表示用戶空間的可執行程序、或者動態鏈接庫,而 [k] 則表示內核空間。
- 最后一列 Symbol 是符號名,也就是函數名。當函數名未知時,用十六進制的地址來表示。
perf 優化
perf生成火焰圖
# 1. 采樣 # -g 選項是告訴perf record額外記錄函數的調用關系 # -e cpu-clock 指perf record監控的指標為cpu周期 # -p 指定需要record的進程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具對perf.data進行解析 perf script -i perf.data &> perf.unfold# 后面用到火焰圖repo https://github.com/brendangregg/FlameGraph.git # 3. 將perf.unfold中的符號進行折疊 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg圖 xxx/flamegraph.pl perf.folded > perf.svgBrendan Gregg博客上提及的生成火焰圖的方法
當我們使用 stackcollapse-perf.pl生成了對應的folded文件后,可以使用grep生成自己希望的火焰圖
perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以過濾掉 cpu_idle線程的影響,將關注重點放到真正消耗cpu的地方 grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svgI frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.
使用systemtap生成火焰圖
SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.
The timer.profile probe was used, which samples all CPUs at 100 Hertz
# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保證stack trace不會truncate stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded cat out.stap-folded | ./flamegraph.pl > stap-kernel.svgperf script
可以把record記錄下的文件轉成readable的一條條記錄
perf使用筆記
OFF-CPU分析
off-cpu說明進程使用的系統調用,進入了內核態,分析offcpu的主要是為了測量cpu的阻塞在什么樣的系統調用已經系統調用花費的時間。
如下可見,off-cpu主要分析的是進入某種阻塞的系統調用(sys cpu并不高),進程消耗在等待某種資源(如IO)
Off-CPU Tracing -------------------------------------------->| | B B A A A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt測量off-cpu time
最簡單使用time命令,可以觀察到進程花費在sys上時間。如下,我們總共耗費了50s,但是user和sys加起來只有11.6s。有38.2s消失了,這就是耗費在off-cpu上的時間(讀寫IO)
$ time tar cf archive.tar linux-4.15-rc2real 0m50.798s user 0m1.048s sys 0m11.627sbcc命令
# 統計offcpu時間,輸出histo /usr/share/bcc/tools/cpudist -O -p <pid> # 統計offcpu的時間,可以分別統計用戶態和內核態的,可以看到用戶態和內核態的棧幀 /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 對某個pid的調用malloc計數,因為bpf是內核態工具,stackcount只有在ctrl+c之后才會將統計結果輸出,這樣就避免了每次進程調用malloc之后的內核態用戶態轉換。 # 函數可以用*通配符,例如c:*map可以匹配到18個函數 /usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks # 將上述輸出轉化為火焰圖 ./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg注意
Request-Synchronous Context
很多我們部署的服務,其進程大多處于等待狀態(網絡線程等待網絡包,工作線程等待任務)。這些干擾會影響我們直接分析offcpu。我們其實想找的是如某次request的過程中,該線程出現的offcpu時間。所以我們可以在上面的offcputime的輸出中只選出有用棧幀進行分析。
Scheduler Latency
Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.
You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.
感覺意思是說,如果cpu負載太高,那么offcpu有一部分的時間是花費在調度上了
非自愿上下文切換 Involuntary Context Switching
Involuntary Context Switching也會造成上下文切換,但這一般情況下不是我們想要定位的問題,可以用如下的參數只關注D狀態下的offcpu TASK_UNINTERRUPTIBLE
/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.
IO
sycn IO的時延增大直接會表現為業務的時延增大。我們需要找到一種方式,可以記錄每一次調用sync io系統調用的時間,從而畫出火焰圖。 Brendan寫了一個工具 fileiostacks.py。當我們確定系統是因為sync IO導致了性能問題,可以使用這種方式進行分析
如下這種方式,只會打印出調用sycn IO訪問文件系統的棧幀的火焰圖
/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg但是IO不止有一種,還有block device,還有網絡,我們下面看一個block_device的
使用perf
perf record -e block:block_rq_insert -a -g -- sleep 30 perf script --header > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg使用bcc
/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svgOff-CPU
和上面不同,這里我們想分析所有的off-cpu事件。
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svgwake-up
這個火焰圖主要將關注點放在wakeup的線程上,我們就可以看到到底是什么線程喚醒了某個進程,感覺還是挺有偶用的,比如說鎖開銷很高的時候,我們就能分析到底是什么進程在喚醒鎖。
比如之前內核出現的wmb驚群的問題,應該可以用這種方法分析出
/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svgChain Graphs
后之覽者,亦將有感于斯文。
bcc系列工具概覽
cd /usr/share/bcc/tools/
Single Purpose Tools:
# Trace new processes: execsnoop# Trace file opens with process and filename: opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name: bitesize# Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1# Trace TCP active connections (connect()) with IP address and ports: tcpconnect# Trace TCP passive connections (accept()) with IP address and ports: tcpaccept# Trace TCP connections to local port 80, with session duration: tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state: # 只打印重傳包 tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10# Trace details and latency of resolver DNS lookups: gethostlatency# Trace commands issued in all running bash shells: bashreadlineMulti Tools: Kernel Dynamic Tracing
# Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181: # 統計函數延時,我們也可以用來做自己進程的性能優化 funclatency -p 181 -u vfs_readMulti Tools: User Level Dynamic Tracing
# Trace the libc library function nanosleep() and print the requested sleep details: trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor: argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds: funclatency.py -u 'c:getaddrinfo'Multi Tools: Kernel Static Tracing
# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint: stackcount t:block:block_rq_insertMulti Tools: User Statically Defined Tracing (USDT)
# Trace the pthread_create USDT probe, and print arg1 as hex: trace 'u:pthread:pthread_create "%x", arg1'perf-tools
perf-tools
里面有些工具bcc已經收納了,但是有的比如像iosnoop,還是perf-tools才有
參考鏈接
總結
以上是生活随笔為你收集整理的linux性能优化--cpu篇的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux网络编程--阻塞与非阻塞
- 下一篇: linux系列之:告诉他,他根本不懂ki