當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

editorloop 占用_systemd CPU占用100%，并出现大量僵尸进程

發布時間：2024/10/8 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 editorloop 占用_systemd CPU占用100%，并出现大量僵尸进程小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

有一天，突然大量CentOS 7服務器出現異常，表現為systemd CPU占用100%，并出現大量僵尸進程，top信息如下：top信息

隨著僵尸進程的增加，系統資源漸漸被消耗完，導致宕機。

在CentOS7中，systemd作為pid為1的進程，負責給孤兒進程收尸。這個問題中，systemd CPU占用100%是因，出現大量僵尸進程是果，所以看看systemd為什么占用了100%的CPU。

裝上systemd的debuginfo包，并用perf對systemd進行觀察，發現在systemd的用戶態中占用較高CPU的函數有endswith，hidden_file_allow_backup，dirent_ensure_type，hidden_file，find_symlinks_fd，內核態占用CPU高的函數中有dcache_readdir，推斷內核在讀目錄。perf report的結果

編寫hidden_file.stp文件，在hidden_file函數被調用時將用戶態stack打印：

probe process("/usr/lib/systemd/systemd").function("hidden_file").call{

print_usyms(ubacktrace())

}

執行stap hidden_file.stp，結果如下：

0x7f927f3b4b35 : 0x7f927f3b4b35 [/usr/lib64/libc-2.17.so+0x21b35/0x3bc000]

0x7f9280d04b20 : hidden_file+0x0/0x50 [/usr/lib/systemd/systemd]

0x7f9280cdeaf8 : find_symlinks_fd+0xc8/0x510 [/usr/lib/systemd/systemd]

0x7f9280ce386e : unit_file_lookup_state+0x26e/0x410 [/usr/lib/systemd/systemd]

0x7f9280ce4be7 : unit_file_get_list+0x1b7/0x390 [/usr/lib/systemd/systemd]

0x7ff9280cc77ce : method_list_unit_files+0xae/0x220 [/usr/lib/systemd/systemd]

0x7f9280d294c7 : object_find_and_run+0xac7/0x1670 [/usr/lib/systemd/systemd]

0x7f9280d2a189 : bus_process_object+0x119/0x310 [/usr/lib/systemd/systemd]

0x7f9280d32e53 : bus_process_internal+0xdb3/0x1210 [/usr/lib/systemd/systemd]

0x7f9280d332d3 : io_callback+0x13/0x50 [/usr/lib/systemd/systemd]

0x7f9280d387d0 : source_dispatch+0x1c0/0x320 [/usr/lib/systemd/systemd]

0x7f9280d3986a : sd_event_dispatch+0x6a/0x1b0 [/usr/lib/systemd/systemd]

0x7f9280c992e3 : manager_loop+0x403/0x500 [/usr/lib/systemd/systemd]

0x7f9280c8d72b : main+0x1e7b/0x3e00 [/usr/lib/systemd/systemd]

從systemd的source_dispatch函數的代碼看，這里觸發了一個SOURCE_IO事件。

static int source_dispatch(sd_event_source *s) {

int r = 0;

assert(s);

assert(s->pending || s->type == SOURCE_EXIT);

if (s->type != SOURCE_DEFER && s->type != SOURCE_EXIT) {

r = source_set_pending(s, false);

if (r < 0)

return r;

}

if (s->type != SOURCE_POST) {

sd_event_source *z;

Iterator i;

/* If we execute a non-post source, let's mark all

* post sources as pending */

SET_FOREACH(z, s->event->post_sources, i) {

if (z->enabled == SD_EVENT_OFF)

continue;

r = source_set_pending(z, true);

if (r < 0)

return r;

}

if (s->enabled == SD_EVENT_ONESHOT) {

r = sd_event_source_set_enabled(s, SD_EVENT_OFF);

if (r < 0)

return r;

}

s->dispatching = true;

switch (s->type) {

case SOURCE_IO:

r = s->io.callback(s, s->io.fd, s->io.revents, s->userdata);

break;

case SOURCE_TIME_REALTIME:

case SOURCE_TIME_BOOTTIME:

case SOURCE_TIME_MONOTONIC:

case SOURCE_TIME_REALTIME_ALARM:

case SOURCE_TIME_BOOTTIME_ALARM:

r = s->time.callback(s, s->time.next, s->userdata);

break;

所以懷疑機器上有大量的SOURCE_IO事件被觸發。找到觸發新建IO事件的函數sd_event_add_io，用systemtap打用戶stack，居然沒進這個函數。說明出問題時沒人大量發SOURCE_IO事件。那就是systemd卡在哪個循環里了。回到用戶stack，看代碼檢查循環，懷疑是這里，大概看了下是遍歷paths.unitpath中的所有目錄：

int unit_file_get_list(

UnitFileScope scope,

const char *root_dir,

Hashmap *h) {

_cleanup_lookup_paths_free_ LookupPaths paths = {};

char **i;

int r;

assert(scope >= 0);

assert(scope < _UNIT_FILE_SCOPE_MAX);

assert(h);

r = verify_root_dir(scope, &root_dir);

if (r < 0)

return r;

r = lookup_paths_init_from_scope(&paths, scope, root_dir);

if (r < 0)

return r;

STRV_FOREACH(i, paths.unit_path) {

_cleanup_closedir_ DIR *d = NULL;

_cleanup_free_ char *units_dir;

units_dir = path_join(root_dir, *i, NULL);

if (!units_dir)

return -ENOMEM;

d = opendir(units_dir);

if (!d) {

if (errno == ENOENT)

continue;

return -errno;

}

for (;;) {

_cleanup_(unit_file_list_free_onep) UnitFileList *f = NULL;

struct dirent *de;

errno = 0;

de = readdir(d);

if (!de && errno != 0)

return -errno;

if (!de)

break;

if (hidden_file(de->d_name))

continue;

為了驗證循環的位置，使用systemtap腳本驗證不同的行代碼是否正在被運行，找到不斷循環的地方確實是那個for (;;)。

probe process("/usr/lib/systemd/systemd").function("*@src/shared/install.c:2461").call{

printf("hit\n")

}

再打印出目錄，發現是/run/systemd/system/session-***.scope.d/

到/run/systemd/下一看，可真大，可真多。有的機器目錄本身的大小已經有32M里面有幾十W個東西

google一下，大量/run/systemd/system/session的原因，找到一個dbus的bug，現象一樣：https://bugs.freedesktop.org/show_bug.cgi?id=95263。根據帖子里的說明

I think what I'm going to do here is:

* For 1.11.x (master): apply the patches you used, and revert the

"uid 0" workaround.

* For 1.10.x: stick with the "uid 0" workaround, because that workaround

is enough to address this for logind, which is the most important impact.

We can consider additionally backporting the patches from Bug #95619

and this bug later, once they have had more testing in master; but

if we do, we will also keep the workaround.

這個bug在1.11.x 和1.10.x上各自有了修復。一看問題機器的dbus版本，dbus-1.10.24-13.el7_6.x86_64。這好像是已修復版本啊。。。陷入沉思。觀察了一下dbus升級時間：

再對一臺問題機器按文件時間排序，可以發現，3月4月的還大量存在，到5月就突然沒了。這與dbus的升級時間吻合。

所以這些sessions是在dbus升級之前老版本的dbus下積累的，升級dbus后sessions沒有被清除。沒清除本來也沒什么問題，直到有一天執行了systemd的某個命令觸發了一個SOURCE_IO事件引發了問題。將/run/systemd/system/下早于dbus升級時間的session*.scope刪除，問題解決。

總結

以上是生活随笔為你收集整理的editorloop 占用_systemd CPU占用100%，并出现大量僵尸进程的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python使用正则验证电子邮件_如何使
下一篇： 2019小程序没必要做了_企业有必要开发