并行make触发OOM排查¶
我在使用 并行make 时候遇到一个困难,大量并行的 cc1plus
触发系统OOM导致大型编译 升级CentOS 7 GCC 无法完成:
[Tue Aug 22 16:10:45 2023] cc1plus invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[Tue Aug 22 16:10:45 2023] cc1plus cpuset=system.slice mems_allowed=0-7
[Tue Aug 22 16:10:45 2023] CPU: 39 PID: 46588 Comm: cc1plus Kdump: loaded Tainted: G W OE K 4.19.91-011.ali4000.alios7.x86_64 #1
[Tue Aug 22 16:10:45 2023] Hardware name: Alibaba Alibaba Cloud ECS/65N32-US, BIOS 1.3.HZ.SG.S.005.00 08/19/2021
[Tue Aug 22 16:10:45 2023] Call Trace:
[Tue Aug 22 16:10:45 2023] dump_stack+0x66/0x8b
[Tue Aug 22 16:10:45 2023] dump_memcg_header+0x12/0x40
[Tue Aug 22 16:10:45 2023] oom_kill_process+0x201/0x2f0
[Tue Aug 22 16:10:45 2023] out_of_memory+0x12f/0x510
[Tue Aug 22 16:10:45 2023] mem_cgroup_out_of_memory+0xdd/0x100
[Tue Aug 22 16:10:45 2023] try_charge+0x83e/0x860 [kpatch_9092863]
[Tue Aug 22 16:10:45 2023] mem_cgroup_charge+0xe0/0x3d0
[Tue Aug 22 16:10:45 2023] __add_to_page_cache_locked+0x6e/0x2c0
[Tue Aug 22 16:10:45 2023] add_to_page_cache_lru+0x4a/0xc0
[Tue Aug 22 16:10:45 2023] pagecache_get_page+0xfc/0x310
[Tue Aug 22 16:10:45 2023] filemap_fault+0x601/0x9a0
[Tue Aug 22 16:10:45 2023] ? page_add_file_rmap+0xe0/0x170
[Tue Aug 22 16:10:45 2023] ? alloc_set_pte+0x2a0/0x4b0
[Tue Aug 22 16:10:45 2023] ? filemap_map_pages+0x18a/0x390
[Tue Aug 22 16:10:45 2023] ext4_filemap_fault+0x2c/0x3b
[Tue Aug 22 16:10:45 2023] __do_fault+0x32/0xb0
[Tue Aug 22 16:10:45 2023] do_fault+0x31b/0x470
[Tue Aug 22 16:10:45 2023] handle_mm_fault+0x8a4/0x960
[Tue Aug 22 16:10:45 2023] __do_page_fault+0x26b/0x4a0
[Tue Aug 22 16:10:45 2023] do_page_fault+0x32/0x110
[Tue Aug 22 16:10:45 2023] ? page_fault+0x8/0x30
[Tue Aug 22 16:10:45 2023] page_fault+0x1e/0x30
[Tue Aug 22 16:10:45 2023] RIP: 0033:0xbe760b
[Tue Aug 22 16:10:45 2023] Code: Bad RIP value.
[Tue Aug 22 16:10:45 2023] RSP: 002b:00007ffdeb1652a0 EFLAGS: 00010246
[Tue Aug 22 16:10:45 2023] RAX: 0000000000000037 RBX: 00007f8c63f3f6a8 RCX: 0000000000000001
[Tue Aug 22 16:10:45 2023] RDX: 0000000000000000 RSI: 00007f8c63f3f6c0 RDI: 0000000000000037
[Tue Aug 22 16:10:45 2023] RBP: 00007f8c63f3f6c0 R08: 0000000000000100 R09: 00007f8c7365f840
[Tue Aug 22 16:10:45 2023] R10: 0000000003dbbb01 R11: 0000000000020000 R12: 0000000001643426
[Tue Aug 22 16:10:45 2023] R13: 0000000000000012 R14: 0000000000000008 R15: 0000000003ee0280
[Tue Aug 22 16:10:45 2023] Task in /system.slice/sshd.service killed as a result of limit of /system.slice/sshd.service
[Tue Aug 22 16:10:45 2023] memory: usage 10485760kB, limit 10485760kB, failcnt 4134727
[Tue Aug 22 16:10:45 2023] memory+swap: usage 10485760kB, limit 9007199254740988kB, failcnt 0
[Tue Aug 22 16:10:45 2023] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Tue Aug 22 16:10:45 2023] Memory cgroup stats for /system.slice/sshd.service: cache:3036KB rss:10474860KB rss_huge:0KB shmem:0KB mapped_file:924KB dirty:0KB writeback:3696KB swap:0KB workingset_refault_anon:0KB workingset_refault_file:12214620KB workingset_activate_anon:0KB workingset_activate_file:11657976KB workingset_restore_anon:0KB workingset_restore_file:11403480KB workingset_nodereclaim:0KB inactive_anon:10473540KB active_anon:3036KB inactive_file:0KB active_file:0KB unevictable:0KB
[Tue Aug 22 16:10:45 2023] Tasks state (memory values in pages):
[Tue Aug 22 16:10:45 2023] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Tue Aug 22 16:10:45 2023] [ 123180] 0 123180 27733 1761 249856 0 -1000 sshd
[Tue Aug 22 16:10:45 2023] [ 429414] 0 429414 37647 1906 331776 0 0 sshd
...
[Tue Aug 22 16:10:45 2023] [ 433777] 500 433777 36752 5701 131072 0 0 screen
[Tue Aug 22 16:10:45 2023] [ 433778] 500 433778 29182 1174 77824 0 0 bash
[Tue Aug 22 16:10:45 2023] [ 437724] 500 437724 40874 1357 167936 0 0 top
[Tue Aug 22 16:10:45 2023] [ 441278] 500 441278 29182 1133 77824 0 0 bash
[Tue Aug 22 16:10:45 2023] [ 26966] 500 26966 27613 1079 73728 0 0 make
[Tue Aug 22 16:10:45 2023] [ 44310] 500 44310 28327 689 69632 0 0 sh
...
[Tue Aug 22 16:10:45 2023] [ 44311] 500 44311 82733 47056 512000 0 0 makeinfo
[Tue Aug 22 16:10:45 2023] [ 44976] 500 44976 25651 23914 249856 0 0 genattrtab
[Tue Aug 22 16:10:45 2023] [ 44977] 500 44977 28327 674 69632 0 0 sh
[Tue Aug 22 16:10:45 2023] [ 44982] 500 44982 74349 72647 630784 0 0 genautomata
[Tue Aug 22 16:10:45 2023] [ 45110] 500 45110 28134 462 69632 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 45112] 500 45112 28134 461 73728 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 45114] 500 45114 28134 493 69632 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 45116] 500 45116 130299 84135 815104 0 0 cc1plus
[Tue Aug 22 16:10:45 2023] [ 45118] 500 45118 136680 96896 929792 0 0 cc1plus
[Tue Aug 22 16:10:45 2023] [ 45119] 500 45119 193033 149300 1335296 0 0 cc1plus
[Tue Aug 22 16:10:45 2023] [ 45925] 500 45925 28134 471 61440 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 46197] 500 46197 28134 394 65536 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 46297] 500 46297 28134 493 65536 0 0 g++
[Tue Aug 22 16:10:45 2023] [ 46299] 500 46299 83917 44393 487424 0 0 cc1plus
[Tue Aug 22 16:10:45 2023] [ 46324] 500 46324 28134 470 69632 0 0 g++
...
[Tue Aug 22 16:10:45 2023] [ 47046] 500 47046 39421 1784 139264 0 0 cc1plus
[Tue Aug 22 16:10:45 2023] [ 47047] 500 47047 29613 1983 86016 0 0 as
[Tue Aug 22 16:10:45 2023] [ 47048] 500 47048 28386 598 69632 0 0 as
[Tue Aug 22 16:10:45 2023] oom_reaper: reaped process 45119 (cc1plus), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
可以看到内存限制大约10G,当达到内存限制时, cc1plus
被系统OOM杀掉
那么,这个ssh的子进程的内存限制是在哪里继承和配置的呢?
这里可以看到 Task in /system.slice/sshd.service killed as a result of limit of /system.slice/sshd.service
,这表明 Linux Out Of Memory 受到 Kernel Cgroup 限制
简单来说 system.slice/sshd.service
位于 /sys/fs/cgroup/memory/system.slice/sshd.service
可以找到 memory.limit_in_bytes
cat /sys/fs/cgroup/memory/system.slice/sshd.service/memory.limit_in_bytes
当前值是:
10737418240
折算G恰好就是 10GB
( 10737418240/1024/1024/1024
)
临时修订:
echo 1073741824000 > memory.limit_in_bytes
然后重新执行 并行make
备注
详细我准备系统学习 systemd管理应用资源 来完善这方面知识