日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > linux >内容正文

linux

linux性能优化--cpu篇

發布時間:2024/2/28 linux 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 linux性能优化--cpu篇 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

linux性能優化--cpu篇

    • 前言
    • 負載
    • CPU使用率
    • proc
    • perf
      • 一些鏈接
      • `perf list`
        • 比較有用的event
      • `perf stat`
      • `perf record`
        • Profiling
        • Static Tracing
      • `perf report`
      • perf probe
      • MISC
      • perf調用查看docker內部的程序
      • `perf top`
      • perf 優化
      • perf生成火焰圖
        • Brendan Gregg博客上提及的生成火焰圖的方法
          • 使用systemtap生成火焰圖
      • perf script
      • perf使用筆記
    • OFF-CPU分析
      • 測量off-cpu time
      • 注意
        • Request-Synchronous Context
        • Scheduler Latency
        • 非自愿上下文切換 Involuntary Context Switching
      • IO
      • Off-CPU
      • wake-up
      • Chain Graphs
    • bcc系列工具概覽
      • Single Purpose Tools:
      • Multi Tools: Kernel Dynamic Tracing
      • Multi Tools: User Level Dynamic Tracing
      • Multi Tools: Kernel Static Tracing
      • Multi Tools: User Statically Defined Tracing (USDT)
    • perf-tools
    • 參考鏈接


前言

本篇文章是我日常cpu性能優化的實踐筆記,結構有些混亂,未來會重構。

我認為,學習性能優化最有效的途徑,一是看祖師爺Brendan Gregg的文章,二就是實踐了。

負載

  • 平均負載多少才合適?
    • 當平均負載高于 CPU 數量 70% 的時候,你就應該分析排查負載高的問題了
    • 三種情況會導致CPU負載升高:
    • CPU 密集型進程,使用大量 CPU 會導致平均負載升高
    • I/O 密集型進程,等待 I/O 也會導致平均負載升高,但不一定會很高
    • 大量等待 CPU 的進程調度也會導致平均負載升高,此時的 CPU 使用率也會比較高。
  • 上下文切換:
    • cswch: 每秒自愿上下文切換(voluntary context switches)的次數,另一個則是 nvcswch
      • 所謂自愿上下文切換,是指進程無法獲取所需資源,導致的上下文切換。比如說, I/O、內存等系統資源不足時,就會發生自愿上下文切換。
    • nvcswch: 表示每秒非自愿上下文切換(non voluntary context switches)的次數。
      • 而非自愿上下文切換,則是指進程由于時間片已到等原因,被系統強制調度,進而發生的上下文切換。比如說,大量進程都在爭搶 CPU 時,就容易發生非自愿上下文切換。
    • 觸發上下文切換。在非常流暢的系統中,這個開銷大約是3-5微秒,這可比搶鎖和同步cache還慢。
    • Steps in Context Switching 解釋上下文切換的時候做了點啥
    • How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking

CPU使用率

  • Linux 通過 /proc 虛擬文件系統,向用戶空間提供了系統內部狀態的信息,,而 /proc/stat 提供的就是系統的 CPU 和任務統計信息。
  • user(通常縮寫為 us),代表用戶態 CPU 時間。注意,它不包括下面的 nice 時間,但包括了 guest 時間。
  • nice(通常縮寫為 ni),代表低優先級用戶態 CPU 時間,也就是進程的 nice 值被調整為 1-19 之間時的。這里注意,nice 可取值范圍是 -20 到 19,數值越大,優先級反而越低。
  • system(通常縮寫為 sys),代表內核態 CPU 時間
  • idle(通常縮寫為 id),代表空閑時間。注意,它不包括等待 I/O 的時間(iowait)。
  • irq(通常縮寫為 hi),代表處理硬中斷的 CPU 時間。
  • softirq(通常縮寫為 si),代表處理軟中斷的 CPU時間。
  • steal(通常縮寫為 st),代表當系統運行在虛擬機中的時候,被其他虛擬機占用的 CPU 時間。
  • guest(通常縮寫為 guest),代表通過虛擬化運行其他操作系統的時間,也就是運行虛擬機的 CPU 時間。
  • guest_nice(通常縮寫為 gnice),代表以低優先級運行虛擬機的時間。
  • 性能分析工具給出的都是間隔一段時間的平均 CPU 使用率,所以要注意間隔時間的設置

proc

  • stack選項表示當前的進程(主線程)內核模式的堆棧/proc/[pid]/stack
  • /proc/[PID]/task/下表示當前進程所有的線程
  • /proc/[PID]/task/[TID]/stack表示當前進程的內核模式堆棧
  • # -d 參數表示高亮顯示變化的區域 $ watch -d cat /proc/interrupts

    perf

    一些鏈接

  • linux 性能分析工具——perf(包括處理器特性介紹)
  • 這里面的參考鏈接很好
  • https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/
  • 如何用Perf解開服務器消耗的困境(Perf:跟蹤問題三板斧)
  • man page
  • 面對一個問題程序,最好采用自頂向下的策略。先整體看看該程序運行時各種統計事件的大概,再針對某些方向深入細節。而不要一下子扎進瑣碎細節,會一葉障目的。
    有些程序慢是因為計算量太大,其多數時間都應該在使用 CPU 進行計算,這叫做 CPU bound 型;有些程序慢是因為過多的 IO,這種時候其 CPU 利用率應該不高,這叫做 IO bound 型;對于 CPU bound 程序的調優和 IO bound 的調優是不同的。

    perf list

    使用 perf list 命令可以列出所有能夠觸發 perf 采樣點的事件 perf list 'sched:*'表示列出sched相關的tracepoints

    不同的系統會列出不同的結果,在 2.6.35 版本的內核中,該列表已經相當的長,但無論有多少,我們可以將它們劃分為三類:

  • Hardware Event 是由 PMU 硬件產生的事件,比如 cache 命中,當您需要了解程序對硬件特性的使用情況時,便需要對這些事件進行采樣;
  • Software Event 是內核軟件產生的事件,比如進程切換,tick 數等 ;man perf_event_open可以查到每一個event的解釋
  • Tracepoint event 是內核中的靜態 tracepoint 所觸發的事件,這些 tracepoint 用來判斷程序運行期間內核的行為細節,比如 slab 分配器的分配次數等。
  • kernel tracepoint分為以下幾類
    • block: block device I/O
    • ext4: file system operations
    • kmem: kernel memory allocation events
    • random: kernel random number generator events
    • sched: CPU scheduler events
    • syscalls: system call enter and exits
    • task: task events
  • Dynamic Tracing
  • 比較有用的event

  • cpu-clock一般就用這個,表示監控的指標時cpu的周期
  • block:block_rq_issue系統下發IO
  • perf stat

    # CPU counter statistics for the specified command: 對某個command的執行過程執行perf,顯示一些cpu相關的計數 perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上,但更為詳細 perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 對某個pid perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 對整個系統,執行5秒 perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 顯示基本cpu相關信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相關 perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000

    perf record

    • perf record則提供了保存數據的功能,保存后的數據,需要你用 perf report 解析展示。
    • perf 后面經常加 -g選項,開啟調試關系的采樣,查看調用鏈
    • -p <pid>指定進程號

    Profiling

    # Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1

    Static Tracing

    # Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a# Trace all context-switches, until Ctrl-C: # -c 1 意味著-vv選項中的sample_period是1,也就是抓所有的時間 perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open): # -vv顯示詳細信息 perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -a

    perf report

    # Show perf.data in an ncurses browser (TUI) if possible: perf report# Show perf.data with a column for sample count: perf report -n# Show perf.data as a text report, with data coalesced and percentages: perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded# List all events from perf.data: perf script# List all perf.data events, with data header (newer kernels; was previously default): perf script --header# List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging): perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdio

    perf probe

    Dynamic Tracing

    # Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes: perf probe -l

    MISC

    Mixed

    # Trace system calls by process, showing a summary refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear): stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches: perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8): perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g

    Special

    # Record cacheline events (Linux 4.10+): perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+): perf c2c report

    其他用法

    • perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
    • perf kmem: kernel memory allocation analysis.
    • perf kvm: KVM virtual guest analysis.
    • perf lock: lock analysis.
    • perf mem: memory access analysis.
    • perf sched: kernel scheduler statistics. Examples.

    perf調用查看docker內部的程序

    • 最簡單的方法是在docker外執行perf record。 然后把文件拷貝到容器內來分析。
    • 先運行 perf record -g -p < pid>,執行一會兒(比如 15 秒)后,按 Ctrl+C 停止。
    • 然后,把生成的 perf.data 文件,拷貝到容器里面來分…
    docker cp perf.data phpfpm:/tmp docker exec -i -t phpfpm bash

    perf top

    • 它能夠實時顯示占用 CPU 時鐘最多的函數或者指令,因此可以用來查找熱點函數
    • perf 后面經常加 -g選項,開啟調試關系的采樣,查看調用鏈
    • -p <pid>指定進程號
    • 每一行包含四列:
    • 第一列 Overhead ,是該符號的性能事件在所有采樣中的比例,用百分比來表示。
    • 第二列 Shared ,是該函數或指令所在的動態共享對象(Dynamic Shared Object),如內核、進程名、動態鏈接庫名、內核模塊名等。
    • 第三列 Object ,是動態共享對象的類型。比如 [.] 表示用戶空間的可執行程序、或者動態鏈接庫,而 [k] 則表示內核空間。
    • 最后一列 Symbol 是符號名,也就是函數名。當函數名未知時,用十六進制的地址來表示。
    # Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file): perf top -F 49# Sample CPUs at 49 Hertz, and show top process names and segments, live: perf top -F 49 -ns comm,dso

    perf 優化

  • 優化系統調用
  • 優化cache利用率
  • perf生成火焰圖

    # 1. 采樣 # -g 選項是告訴perf record額外記錄函數的調用關系 # -e cpu-clock 指perf record監控的指標為cpu周期 # -p 指定需要record的進程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具對perf.data進行解析 perf script -i perf.data &> perf.unfold# 后面用到火焰圖repo https://github.com/brendangregg/FlameGraph.git # 3. 將perf.unfold中的符號進行折疊 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg圖 xxx/flamegraph.pl perf.folded > perf.svg

    Brendan Gregg博客上提及的生成火焰圖的方法

    當我們使用 stackcollapse-perf.pl生成了對應的folded文件后,可以使用grep生成自己希望的火焰圖

    perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以過濾掉 cpu_idle線程的影響,將關注重點放到真正消耗cpu的地方 grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

    I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
    Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
    There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.

    使用systemtap生成火焰圖

    SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.

    The timer.profile probe was used, which samples all CPUs at 100 Hertz

    # -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保證stack trace不會truncate stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

    perf script

    可以把record記錄下的文件轉成readable的一條條記錄

    perf使用筆記

  • Omitting frame pointers會導致perf看到的棧幀不全,有幾種辦法規避
  • 使用編譯選項-fno-omit-frame-pointer。如果優化過度,使用了-fomit-frame-pointer 編譯選項,會導致棧幀不全。 我猜這種優化是引入-O2只有引入的,所以最好加-O0 -fno-omit-frame-pointer
  • 如果加了上面的選項還是棧幀不全,我猜測應該是libc庫編譯的時候優化掉frame-pointer了。
  • perf record的時候增加選項--call-graph lbr,這個必須要cpu支持last branch record特性
  • 使用dwarf(也就是gdb info) perf record增加選項--call-graph dwarf或-g dwarf
  • OFF-CPU分析

    off-cpu說明進程使用的系統調用,進入了內核態,分析offcpu的主要是為了測量cpu的阻塞在什么樣的系統調用已經系統調用花費的時間。

    如下可見,off-cpu主要分析的是進入某種阻塞的系統調用(sys cpu并不高),進程消耗在等待某種資源(如IO)

    Off-CPU Tracing -------------------------------------------->| | B B A A A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt

    測量off-cpu time

    最簡單使用time命令,可以觀察到進程花費在sys上時間。如下,我們總共耗費了50s,但是user和sys加起來只有11.6s。有38.2s消失了,這就是耗費在off-cpu上的時間(讀寫IO)

    $ time tar cf archive.tar linux-4.15-rc2real 0m50.798s user 0m1.048s sys 0m11.627s

    bcc命令

    # 統計offcpu時間,輸出histo /usr/share/bcc/tools/cpudist -O -p <pid> # 統計offcpu的時間,可以分別統計用戶態和內核態的,可以看到用戶態和內核態的棧幀 /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 對某個pid的調用malloc計數,因為bpf是內核態工具,stackcount只有在ctrl+c之后才會將統計結果輸出,這樣就避免了每次進程調用malloc之后的內核態用戶態轉換。 # 函數可以用*通配符,例如c:*map可以匹配到18個函數 /usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks # 將上述輸出轉化為火焰圖 ./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg

    注意

    Request-Synchronous Context

    很多我們部署的服務,其進程大多處于等待狀態(網絡線程等待網絡包,工作線程等待任務)。這些干擾會影響我們直接分析offcpu。我們其實想找的是如某次request的過程中,該線程出現的offcpu時間。所以我們可以在上面的offcputime的輸出中只選出有用棧幀進行分析。

    Scheduler Latency

    Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.

    You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.

    感覺意思是說,如果cpu負載太高,那么offcpu有一部分的時間是花費在調度上了

    非自愿上下文切換 Involuntary Context Switching

    Involuntary Context Switching也會造成上下文切換,但這一般情況下不是我們想要定位的問題,可以用如下的參數只關注D狀態下的offcpu TASK_UNINTERRUPTIBLE

    /usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2

    On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.

    IO

    sycn IO的時延增大直接會表現為業務的時延增大。我們需要找到一種方式,可以記錄每一次調用sync io系統調用的時間,從而畫出火焰圖。 Brendan寫了一個工具 fileiostacks.py。當我們確定系統是因為sync IO導致了性能問題,可以使用這種方式進行分析

    如下這種方式,只會打印出調用sycn IO訪問文件系統的棧幀的火焰圖

    /usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg

    但是IO不止有一種,還有block device,還有網絡,我們下面看一個block_device的

    使用perf

    perf record -e block:block_rq_insert -a -g -- sleep 30 perf script --header > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg

    使用bcc

    /usr/share/bcc/tools/biostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg

    Off-CPU

    和上面不同,這里我們想分析所有的off-cpu事件。

    /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

    wake-up

    這個火焰圖主要將關注點放在wakeup的線程上,我們就可以看到到底是什么線程喚醒了某個進程,感覺還是挺有偶用的,比如說鎖開銷很高的時候,我們就能分析到底是什么進程在喚醒鎖。

    比如之前內核出現的wmb驚群的問題,應該可以用這種方法分析出

    /usr/share/bcc/tools/wakeuptime -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg

    Chain Graphs

    后之覽者,亦將有感于斯文。

    bcc系列工具概覽

    cd /usr/share/bcc/tools/

    Single Purpose Tools:

    # Trace new processes: execsnoop# Trace file opens with process and filename: opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name: bitesize# Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1# Trace TCP active connections (connect()) with IP address and ports: tcpconnect# Trace TCP passive connections (accept()) with IP address and ports: tcpaccept# Trace TCP connections to local port 80, with session duration: tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state: # 只打印重傳包 tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10# Trace details and latency of resolver DNS lookups: gethostlatency# Trace commands issued in all running bash shells: bashreadline

    Multi Tools: Kernel Dynamic Tracing

    # Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181: # 統計函數延時,我們也可以用來做自己進程的性能優化 funclatency -p 181 -u vfs_read

    Multi Tools: User Level Dynamic Tracing

    # Trace the libc library function nanosleep() and print the requested sleep details: trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor: argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds: funclatency.py -u 'c:getaddrinfo'

    Multi Tools: Kernel Static Tracing

    # Count stack traces that led to issuing block I/O, tracing its kernel tracepoint: stackcount t:block:block_rq_insert

    Multi Tools: User Statically Defined Tracing (USDT)

    # Trace the pthread_create USDT probe, and print arg1 as hex: trace 'u:pthread:pthread_create "%x", arg1'

    perf-tools

    perf-tools

    里面有些工具bcc已經收納了,但是有的比如像iosnoop,還是perf-tools才有

    參考鏈接

  • linux 性能調優工具perf + 火焰圖 常用命令
  • 系統級性能分析工具perf的介紹與使用寫的很全的perf介紹
  • Flame Graphs看跪了,還有系統優化的最新消息
  • CPU Flame Graphsbrendan gregg NB
  • Off-CPU Flame Graphsbcc一系列的perf工具,包括offcpu的分析,IO的分析等等
  • 關于-fno-omit-frame-pointer與-fomit-frame-pointer
  • perf Examples NB
  • SystemTap新手指南中文翻譯
  • perf-tools 很多很有用的tool,例如tcpretrans可以只抓重傳的tcp包
  • Linux Perf Tools Tips很牛,很多問題,例如perf找不到symbol怎么辦 還有systemtap相關
  • eBPF
  • Linux Extended BPF (eBPF) Tracing Tools
  • Userspace stack is not unwinded in most samples with offcputime.py
  • Speed up SystemTap script monitoring of system calls
  • What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?
  • Linux Load Averages: Solving the Mystery
  • 漫話性能:CPU上下文切換深度好文
  • 總結

    以上是生活随笔為你收集整理的linux性能优化--cpu篇的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。