當前位置：首頁 > 运维知识 > linux >内容正文

linux

linux性能优化--cpu篇

發布時間：2024/2/28 linux 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 linux性能优化--cpu篇小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

linux性能優化--cpu篇

- 前言
- 負載
- CPU使用率
- proc
- perf
- - 一些鏈接
  - `perf list`
  - - 比較有用的event
  - `perf stat`
  - `perf record`
  - - Profiling
    - Static Tracing
  - `perf report`
  - perf probe
  - MISC
  - perf調用查看docker內部的程序
  - `perf top`
  - perf 優化
  - perf生成火焰圖
  - - Brendan Gregg博客上提及的生成火焰圖的方法
    - - 使用systemtap生成火焰圖
  - perf script
  - perf使用筆記
- OFF-CPU分析
- - 測量off-cpu time
  - 注意
  - - Request-Synchronous Context
    - Scheduler Latency
    - 非自愿上下文切換 Involuntary Context Switching
  - IO
  - Off-CPU
  - wake-up
  - Chain Graphs
- bcc系列工具概覽
- - Single Purpose Tools:
  - Multi Tools: Kernel Dynamic Tracing
  - Multi Tools: User Level Dynamic Tracing
  - Multi Tools: Kernel Static Tracing
  - Multi Tools: User Statically Defined Tracing (USDT)
- perf-tools
- 參考鏈接

前言

本篇文章是我日常cpu性能優化的實踐筆記，結構有些混亂，未來會重構。

我認為，學習性能優化最有效的途徑，一是看祖師爺Brendan Gregg的文章，二就是實踐了。

負載

平均負載多少才合適？
- 當平均負載高于 CPU 數量 70% 的時候，你就應該分析排查負載高的問題了
- 三種情況會導致CPU負載升高：
- CPU 密集型進程，使用大量 CPU 會導致平均負載升高
- I/O 密集型進程，等待 I/O 也會導致平均負載升高，但不一定會很高
- 大量等待 CPU 的進程調度也會導致平均負載升高，此時的 CPU 使用率也會比較高。
上下文切換：
- cswch: 每秒自愿上下文切換（voluntary context switches）的次數，另一個則是 nvcswch
  - 所謂自愿上下文切換，是指進程無法獲取所需資源，導致的上下文切換。比如說， I/O、內存等系統資源不足時，就會發生自愿上下文切換。
- nvcswch: 表示每秒非自愿上下文切換（non voluntary context switches）的次數。
  - 而非自愿上下文切換，則是指進程由于時間片已到等原因，被系統強制調度，進而發生的上下文切換。比如說，大量進程都在爭搶 CPU 時，就容易發生非自愿上下文切換。
- 觸發上下文切換。在非常流暢的系統中，這個開銷大約是3-5微秒，這可比搶鎖和同步cache還慢。
- Steps in Context Switching 解釋上下文切換的時候做了點啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking

CPU使用率

Linux 通過 /proc 虛擬文件系統，向用戶空間提供了系統內部狀態的信息，，而 /proc/stat 提供的就是系統的 CPU 和任務統計信息。
user（通常縮寫為 us），代表用戶態 CPU 時間。注意，它不包括下面的 nice 時間，但包括了 guest 時間。
nice（通常縮寫為 ni），代表低優先級用戶態 CPU 時間，也就是進程的 nice 值被調整為 1-19 之間時的。這里注意，nice 可取值范圍是 -20 到 19，數值越大，優先級反而越低。
system（通常縮寫為 sys），代表內核態 CPU 時間
idle（通常縮寫為 id），代表空閑時間。注意，它不包括等待 I/O 的時間（iowait）。
irq（通常縮寫為 hi），代表處理硬中斷的 CPU 時間。
softirq（通常縮寫為 si），代表處理軟中斷的 CPU時間。
steal（通常縮寫為 st），代表當系統運行在虛擬機中的時候，被其他虛擬機占用的 CPU 時間。
guest（通常縮寫為 guest），代表通過虛擬化運行其他操作系統的時間，也就是運行虛擬機的 CPU 時間。
guest_nice（通常縮寫為 gnice），代表以低優先級運行虛擬機的時間。
性能分析工具給出的都是間隔一段時間的平均 CPU 使用率，所以要注意間隔時間的設置

proc

stack選項表示當前的進程(主線程)內核模式的堆棧/proc/[pid]/stack

/proc/[PID]/task/下表示當前進程所有的線程

/proc/[PID]/task/[TID]/stack表示當前進程的內核模式堆棧

# -d 參數表示高亮顯示變化的區域 $ watch -d cat /proc/interrupts

perf

一些鏈接

linux 性能分析工具——perf(包括處理器特性介紹)

這里面的參考鏈接很好

https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/

如何用Perf解開服務器消耗的困境(Perf：跟蹤問題三板斧)

man page

面對一個問題程序，最好采用自頂向下的策略。先整體看看該程序運行時各種統計事件的大概，再針對某些方向深入細節。而不要一下子扎進瑣碎細節，會一葉障目的。
有些程序慢是因為計算量太大，其多數時間都應該在使用 CPU 進行計算，這叫做 CPU bound 型；有些程序慢是因為過多的 IO，這種時候其 CPU 利用率應該不高，這叫做 IO bound 型；對于 CPU bound 程序的調優和 IO bound 的調優是不同的。

perf list

使用 perf list 命令可以列出所有能夠觸發 perf 采樣點的事件 perf list 'sched:*'表示列出sched相關的tracepoints

不同的系統會列出不同的結果，在 2.6.35 版本的內核中，該列表已經相當的長，但無論有多少，我們可以將它們劃分為三類：

Hardware Event 是由 PMU 硬件產生的事件，比如 cache 命中，當您需要了解程序對硬件特性的使用情況時，便需要對這些事件進行采樣；

Software Event 是內核軟件產生的事件，比如進程切換，tick 數等 ;man perf_event_open可以查到每一個event的解釋

Tracepoint event 是內核中的靜態 tracepoint 所觸發的事件，這些 tracepoint 用來判斷程序運行期間內核的行為細節，比如 slab 分配器的分配次數等。

kernel tracepoint分為以下幾類

block: block device I/O
ext4: file system operations
kmem: kernel memory allocation events
random: kernel random number generator events
sched: CPU scheduler events
syscalls: system call enter and exits
task: task events

Dynamic Tracing

比較有用的event

cpu-clock一般就用這個，表示監控的指標時cpu的周期

block:block_rq_issue系統下發IO

perf stat

# CPU counter statistics for the specified command: 對某個command的執行過程執行perf，顯示一些cpu相關的計數 perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上，但更為詳細 perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 對某個pid perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 對整個系統，執行5秒 perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 顯示基本cpu相關信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相關 perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000

perf record

perf record則提供了保存數據的功能，保存后的數據，需要你用 perf report 解析展示。
perf 后面經常加 -g選項，開啟調試關系的采樣，查看調用鏈
-p <pid>指定進程號

Profiling

# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1

Static Tracing

# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a# Trace all context-switches, until Ctrl-C: # -c 1 意味著-vv選項中的sample_period是1，也就是抓所有的時間 perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open): # -vv顯示詳細信息 perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -a

perf report

# Show perf.data in an ncurses browser (TUI) if possible: perf report# Show perf.data with a column for sample count: perf report -n# Show perf.data as a text report, with data coalesced and percentages: perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded# List all events from perf.data: perf script# List all perf.data events, with data header (newer kernels; was previously default): perf script --header# List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging): perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdio

perf probe

Dynamic Tracing

# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes: perf probe -l

MISC

Mixed

# Trace system calls by process, showing a summary refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear): stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches: perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8): perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g

Special

# Record cacheline events (Linux 4.10+): perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+): perf c2c report

其他用法

perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
perf kmem: kernel memory allocation analysis.
perf kvm: KVM virtual guest analysis.
perf lock: lock analysis.
perf mem: memory access analysis.
perf sched: kernel scheduler statistics. Examples.

perf調用查看docker內部的程序

最簡單的方法是在docker外執行perf record。然后把文件拷貝到容器內來分析。
先運行 perf record -g -p < pid>，執行一會兒（比如 15 秒）后，按 Ctrl+C 停止。
然后，把生成的 perf.data 文件，拷貝到容器里面來分…

docker cp perf.data phpfpm:/tmp docker exec -i -t phpfpm bash

perf top

它能夠實時顯示占用 CPU 時鐘最多的函數或者指令，因此可以用來查找熱點函數
perf 后面經常加 -g選項，開啟調試關系的采樣，查看調用鏈
-p <pid>指定進程號
每一行包含四列：
第一列 Overhead ，是該符號的性能事件在所有采樣中的比例，用百分比來表示。
第二列 Shared ，是該函數或指令所在的動態共享對象（Dynamic Shared Object），如內核、進程名、動態鏈接庫名、內核模塊名等。
第三列 Object ，是動態共享對象的類型。比如 [.] 表示用戶空間的可執行程序、或者動態鏈接庫，而 [k] 則表示內核空間。
最后一列 Symbol 是符號名，也就是函數名。當函數名未知時，用十六進制的地址來表示。

# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file): perf top -F 49# Sample CPUs at 49 Hertz, and show top process names and segments, live: perf top -F 49 -ns comm,dso

perf 優化

優化系統調用

優化cache利用率

perf生成火焰圖

# 1. 采樣 # -g 選項是告訴perf record額外記錄函數的調用關系 # -e cpu-clock 指perf record監控的指標為cpu周期 # -p 指定需要record的進程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具對perf.data進行解析 perf script -i perf.data &> perf.unfold# 后面用到火焰圖repo https://github.com/brendangregg/FlameGraph.git # 3. 將perf.unfold中的符號進行折疊 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg圖 xxx/flamegraph.pl perf.folded > perf.svg

Brendan Gregg博客上提及的生成火焰圖的方法

當我們使用 stackcollapse-perf.pl生成了對應的folded文件后，可以使用grep生成自己希望的火焰圖

perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以過濾掉 cpu_idle線程的影響，將關注重點放到真正消耗cpu的地方 grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.

使用systemtap生成火焰圖

SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.

The timer.profile probe was used, which samples all CPUs at 100 Hertz

# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保證stack trace不會truncate stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

perf script

可以把record記錄下的文件轉成readable的一條條記錄

perf使用筆記

Omitting frame pointers會導致perf看到的棧幀不全，有幾種辦法規避

使用編譯選項-fno-omit-frame-pointer。如果優化過度，使用了-fomit-frame-pointer 編譯選項，會導致棧幀不全。我猜這種優化是引入-O2只有引入的，所以最好加-O0 -fno-omit-frame-pointer

如果加了上面的選項還是棧幀不全，我猜測應該是libc庫編譯的時候優化掉frame-pointer了。

perf record的時候增加選項--call-graph lbr，這個必須要cpu支持last branch record特性

使用dwarf（也就是gdb info） perf record增加選項--call-graph dwarf或-g dwarf

OFF-CPU分析

off-cpu說明進程使用的系統調用，進入了內核態，分析offcpu的主要是為了測量cpu的阻塞在什么樣的系統調用已經系統調用花費的時間。

如下可見，off-cpu主要分析的是進入某種阻塞的系統調用（sys cpu并不高），進程消耗在等待某種資源（如IO）

測量off-cpu time

最簡單使用time命令，可以觀察到進程花費在sys上時間。如下，我們總共耗費了50s，但是user和sys加起來只有11.6s。有38.2s消失了，這就是耗費在off-cpu上的時間（讀寫IO）

$ time tar cf archive.tar linux-4.15-rc2real 0m50.798s user 0m1.048s sys 0m11.627s

bcc命令

# 統計offcpu時間，輸出histo /usr/share/bcc/tools/cpudist -O -p <pid> # 統計offcpu的時間，可以分別統計用戶態和內核態的，可以看到用戶態和內核態的棧幀 /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 對某個pid的調用malloc計數，因為bpf是內核態工具，stackcount只有在ctrl+c之后才會將統計結果輸出，這樣就避免了每次進程調用malloc之后的內核態用戶態轉換。 # 函數可以用*通配符，例如c:*map可以匹配到18個函數 /usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks # 將上述輸出轉化為火焰圖 ./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg

注意

Request-Synchronous Context

很多我們部署的服務，其進程大多處于等待狀態（網絡線程等待網絡包，工作線程等待任務）。這些干擾會影響我們直接分析offcpu。我們其實想找的是如某次request的過程中，該線程出現的offcpu時間。所以我們可以在上面的offcputime的輸出中只選出有用棧幀進行分析。

Scheduler Latency

Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.

You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.

感覺意思是說，如果cpu負載太高，那么offcpu有一部分的時間是花費在調度上了

非自愿上下文切換 Involuntary Context Switching

Involuntary Context Switching也會造成上下文切換，但這一般情況下不是我們想要定位的問題，可以用如下的參數只關注D狀態下的offcpu TASK_UNINTERRUPTIBLE

/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2

On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.

IO

sycn IO的時延增大直接會表現為業務的時延增大。我們需要找到一種方式，可以記錄每一次調用sync io系統調用的時間，從而畫出火焰圖。 Brendan寫了一個工具 fileiostacks.py。當我們確定系統是因為sync IO導致了性能問題，可以使用這種方式進行分析

如下這種方式，只會打印出調用sycn IO訪問文件系統的棧幀的火焰圖

/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg

但是IO不止有一種，還有block device，還有網絡，我們下面看一個block_device的

使用perf

perf record -e block:block_rq_insert -a -g -- sleep 30 perf script --header > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg

使用bcc

/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg

Off-CPU

和上面不同，這里我們想分析所有的off-cpu事件。

/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

wake-up

這個火焰圖主要將關注點放在wakeup的線程上，我們就可以看到到底是什么線程喚醒了某個進程，感覺還是挺有偶用的，比如說鎖開銷很高的時候，我們就能分析到底是什么進程在喚醒鎖。

比如之前內核出現的wmb驚群的問題，應該可以用這種方法分析出

/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg

Chain Graphs

后之覽者，亦將有感于斯文。

bcc系列工具概覽

cd /usr/share/bcc/tools/

Single Purpose Tools:

# Trace new processes: execsnoop# Trace file opens with process and filename: opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name: bitesize# Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1# Trace TCP active connections (connect()) with IP address and ports: tcpconnect# Trace TCP passive connections (accept()) with IP address and ports: tcpaccept# Trace TCP connections to local port 80, with session duration: tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state: # 只打印重傳包 tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10# Trace details and latency of resolver DNS lookups: gethostlatency# Trace commands issued in all running bash shells: bashreadline

Multi Tools: Kernel Dynamic Tracing

# Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181: # 統計函數延時，我們也可以用來做自己進程的性能優化 funclatency -p 181 -u vfs_read

Multi Tools: User Level Dynamic Tracing

# Trace the libc library function nanosleep() and print the requested sleep details: trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor: argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds: funclatency.py -u 'c:getaddrinfo'

Multi Tools: Kernel Static Tracing

# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint: stackcount t:block:block_rq_insert

Multi Tools: User Statically Defined Tracing (USDT)

# Trace the pthread_create USDT probe, and print arg1 as hex: trace 'u:pthread:pthread_create "%x", arg1'

perf-tools

里面有些工具bcc已經收納了，但是有的比如像iosnoop，還是perf-tools才有

參考鏈接

linux 性能調優工具perf + 火焰圖常用命令

系統級性能分析工具perf的介紹與使用寫的很全的perf介紹

Flame Graphs看跪了，還有系統優化的最新消息

CPU Flame Graphsbrendan gregg NB

Off-CPU Flame Graphsbcc一系列的perf工具，包括offcpu的分析，IO的分析等等

關于-fno-omit-frame-pointer與-fomit-frame-pointer

perf Examples NB

SystemTap新手指南中文翻譯

perf-tools 很多很有用的tool，例如tcpretrans可以只抓重傳的tcp包

Linux Perf Tools Tips很牛，很多問題，例如perf找不到symbol怎么辦還有systemtap相關

eBPF

Linux Extended BPF (eBPF) Tracing Tools

Userspace stack is not unwinded in most samples with offcputime.py

Speed up SystemTap script monitoring of system calls

What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?

Linux Load Averages: Solving the Mystery

漫話性能：CPU上下文切換深度好文

總結

以上是生活随笔為你收集整理的linux性能优化--cpu篇的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux网络编程--阻塞与非阻塞
下一篇： linux系列之:告诉他，他根本不懂ki