内存快速分配和慢速分配
內存快速分配和慢速分配
內存頁面的分配最終都交由伙伴系統的頁面分配器。頁面分配的函數在內核有各種各樣的實現,但最終都會調用一個共同的接口::__alloc_pages_nodemask()
常見的頁面分配的API
__alloc_pages_node /*返回struct page的指針*/__alloc_pages__alloc_pages_nodemaskalloc_pages /*返回struct page的指針*/alloc_pages_current__alloc_pages_nodemask__get_free_pages /*返回頁面的虛擬地址*/__get_free_pagesalloc_pagesalloc_pages_current__alloc_pages_nodemask他們最終都調用了__alloc_pages_nodemask。
伙伴系統的心臟
__alloc_pages_nodemask()是伙伴系統的心臟
struct page * __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,nodemask_t *nodemask) {struct page *page;unsigned int alloc_flags = ALLOC_WMARK_LOW;gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */struct alloc_context ac = { };/** There are several places where we assume that the order value is sane* so bail out early if the request is out of bound.*/if (unlikely(order >= MAX_ORDER)) {//請求頁的階數超過了最大階數就失敗了WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));return NULL;}gfp_mask &= gfp_allowed_mask;alloc_mask = gfp_mask;if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))return NULL;finalise_ac(gfp_mask, &ac);/** Forbid the first pass from falling back to types that fragment* memory until all local zones are considered.*/alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); /* First allocation attempt */page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);if (likely(page))goto out;/** Apply scoped allocation constraints. This is mainly about GFP_NOFS* resp. GFP_NOIO which has to be inherited for all allocation requests* from a particular context which has been marked by* memalloc_no{fs,io}_{save,restore}.*/alloc_mask = current_gfp_context(gfp_mask);ac.spread_dirty_pages = false;/** Restore the original nodemask if it was potentially replaced with* &cpuset_current_mems_allowed to optimize the fast-path attempt.*/if (unlikely(ac.nodemask != nodemask))ac.nodemask = nodemask;page = __alloc_pages_slowpath(alloc_mask, order, &ac);out:if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&unlikely(__memcg_kmem_charge(page, gfp_mask, order) != 0)) {__free_pages(page, order);page = NULL;}trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);return page; } EXPORT_SYMBOL(__alloc_pages_nodemask);通過上述源碼其實可以總結出__alloc_pages_nodemask它的核心其實是做了3件事:
prepare_alloc_context //1.準備參數 alloc_flags_nofragment //2.根據區域和gfp掩碼請求添加分配標志 get_page_from_freelist //3.快路徑嘗試分配內存 __alloc_pages_slowpath //4.慢路徑嘗試分配內存prepare_alloc_context
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,int preferred_nid, nodemask_t *nodemask,struct alloc_context *ac, gfp_t *alloc_mask,unsigned int *alloc_flags) {ac->high_zoneidx = gfp_zone(gfp_mask);ac->zonelist = node_zonelist(preferred_nid, gfp_mask);ac->nodemask = nodemask;ac->migratetype = gfpflags_to_migratetype(gfp_mask);if (cpusets_enabled()) {*alloc_mask |= __GFP_HARDWALL;if (!ac->nodemask)ac->nodemask = &cpuset_current_mems_allowed;else*alloc_flags |= ALLOC_CPUSET;}fs_reclaim_acquire(gfp_mask);fs_reclaim_release(gfp_mask);might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);if (should_fail_alloc_page(gfp_mask, order))return false;if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)*alloc_flags |= ALLOC_CMA;return true;prepare_alloc_context它主要是做了如下的事情:
1.填充alloc_context結構體
2.對gfp掩碼做處理存放在alloc_mask中
3.填充alloc_flags字段
他做完預備工作之后,執行finalise_ac獲得可分配的內存域zone。
alloc_flags_nofragment
static inline unsigned int alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) {unsigned int alloc_flags = 0;if (gfp_mask & __GFP_KSWAPD_RECLAIM)alloc_flags |= ALLOC_KSWAPD;#ifdef CONFIG_ZONE_DMA32if (!zone)return alloc_flags;if (zone_idx(zone) != ZONE_NORMAL)return alloc_flags;/** If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and* the pointer is within zone->zone_pgdat->node_zones[]. Also assume* on UMA that if Normal is populated then so is DMA32.*/BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);if (nr_online_nodes > 1 && !populated_zone(--zone))return alloc_flags;alloc_flags |= ALLOC_NOFRAGMENT; #endif /* CONFIG_ZONE_DMA32 */return alloc_flags; }alloc_flags_nofragment主要做的事是先看掩碼是否允許kswapd周期回收,如果是的話就設置alloc標志允許在內存不足的時候周期回收。
將這些前期準備都做好內存首先執行的是快路徑(fastpath)分配。
快路徑分配(fast)
如果檢查完內存區發現內存水位線當前內存區的空閑頁面數大于設置比對的水位線,就可以直接分配,采取快路徑的方式。
get_page_from_freelist
該函數的主要作用是從空閑頁面鏈表中嘗試分配內存,是內存分配的fastpath。
static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,const struct alloc_context *ac) {struct zoneref *z;struct zone *zone;struct pglist_data *last_pgdat_dirty_limit = NULL;bool no_fallback;retry:/** Scan zonelist, looking for a zone with enough free.* See also __cpuset_node_allowed() comment in kernel/cpuset.c.*/no_fallback = alloc_flags & ALLOC_NOFRAGMENT;z = ac->preferred_zoneref;for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,ac->nodemask) {struct page *page;unsigned long mark;if (cpusets_enabled() &&(alloc_flags & ALLOC_CPUSET) &&!__cpuset_zone_allowed(zone, gfp_mask))continue;/** When allocating a page cache page for writing, we* want to get it from a node that is within its dirty* limit, such that no single node holds more than its* proportional share of globally allowed dirty pages.* The dirty limits take into account the node's* lowmem reserves and high watermark so that kswapd* should be able to balance it without having to* write pages from its LRU list.** XXX: For now, allow allocations to potentially* exceed the per-node dirty limit in the slowpath* (spread_dirty_pages unset) before going into reclaim,* which is important when on a NUMA setup the allowed* nodes are together not big enough to reach the* global limit. The proper fix for these situations* will require awareness of nodes in the* dirty-throttling and the flusher threads.*/if (ac->spread_dirty_pages) {if (last_pgdat_dirty_limit == zone->zone_pgdat)continue;if (!node_dirty_ok(zone->zone_pgdat)) {last_pgdat_dirty_limit = zone->zone_pgdat;continue;}}if (no_fallback && nr_online_nodes > 1 &&zone != ac->preferred_zoneref->zone) {int local_nid;/** If moving to a remote node, retry but allow* fragmenting fallbacks. Locality is more important* than fragmentation avoidance.*/local_nid = zone_to_nid(ac->preferred_zoneref->zone);if (zone_to_nid(zone) != local_nid) {alloc_flags &= ~ALLOC_NOFRAGMENT;goto retry;}}mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);if (!zone_watermark_fast(zone, order, mark,ac_classzone_idx(ac), alloc_flags)) {int ret;#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT/** Watermark failed for this zone, but see if we can* grow this zone if it contains deferred pages.*/if (static_branch_unlikely(&deferred_pages)) {if (_deferred_grow_zone(zone, order))goto try_this_zone;} #endif/* Checked here to keep the fast path fast */BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);if (alloc_flags & ALLOC_NO_WATERMARKS)goto try_this_zone;if (node_reclaim_mode == 0 ||!zone_allows_reclaim(ac->preferred_zoneref->zone, zone))continue;ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);switch (ret) {case NODE_RECLAIM_NOSCAN:/* did not scan */continue;case NODE_RECLAIM_FULL:/* scanned but unreclaimable */continue;default:/* did we reclaim enough */if (zone_watermark_ok(zone, order, mark,ac_classzone_idx(ac), alloc_flags))goto try_this_zone;continue;}}try_this_zone:page = rmqueue(ac->preferred_zoneref->zone, zone, order,gfp_mask, alloc_flags, ac->migratetype);if (page) {prep_new_page(page, order, gfp_mask, alloc_flags);/** If this is a high-order atomic allocation then check* if the pageblock should be reserved for the future*/if (unlikely(order && (alloc_flags & ALLOC_HARDER)))reserve_highatomic_pageblock(page, zone, order);return page;} else { #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT/* Try again if zone has deferred pages */if (static_branch_unlikely(&deferred_pages)) {if (_deferred_grow_zone(zone, order))goto try_this_zone;} #endif}}/** It's possible on a UMA machine to get through all zones that are* fragmented. If avoiding fragmentation, reset and try again.*/if (no_fallback) {alloc_flags &= ~ALLOC_NOFRAGMENT;goto retry;}return NULL; }函數遍歷在內存域鏈表上內存域,嘗試找到合適的頁面進行分配。首先就是做一些參數的檢查,若有不滿足,直接continue跳過當前zone;wmark_pages()會根據alloc_flags中設置的是min或low或high去算出該zone的watermark是多少;然后將該watermark傳入zone_watermark_ok()判斷該zone的free pages是否滿足該水線。(檢查過程會根據內存分配的緊急程度放寬watermark)其中high low min水位線用哪根兒具體由alloc_flags中的ALLOC_WMARK_xx標志決定,在__alloc_pages_nodemask中可以看到設置的線是low。若水位不ok,則根據回收模式node_reclaim的設置,判斷是回收或是跳過當前zone。之后進入分配的核心,調用rmqueue從伙伴系統中取頁。
rmqueue
內核中將order-0的請求和大于order-0的請求在處理上做了區分。現在的處理器動不動就十幾個核,而zone就那么幾個,當多個核要同時訪問同一個zone的時候,不免要在zone的鎖的競爭上耗費大量時間。社區開發者發現系統中對order-0的請求在內核中出現的頻次極高,且order-0所占內存僅一個頁的大小,于是就實現了per cpu的"內存池",用來滿足order-0頁面的分配,這樣就在一定程度上緩解了伙伴系統在zone的鎖上面的競爭。
如果order=0,調用rmqueue_pcplist()
static struct page *rmqueue_pcplist(...) {/*關閉本地中斷并保存中斷狀態(因為中斷上下文也可以分配內存)*/local_irq_save(flags);/*獲取當前CPU上目標zone中的per_cpu_pages指針*/pcp = &this_cpu_ptr(zone->pageset)->pcp;/*獲取per_cpu_pages中制定遷移類型的頁面list*/list = &pcp->lists[migratetype];/*從鏈表上摘取目標頁面*/page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);/*若分配成功,更新當前zone的統計信息*/if (page) {__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);zone_statistics(preferred_zone, zone);}/*恢復中斷*/local_irq_restore(flags);return page; }如果order>0,在__rmqueue_smallest()中從小到大循環遍歷各個order的free_list鏈表,直到使用get_page_from_free_area()成功從鏈表上摘取到最小且合適(order和migratetype都合適)的pageblock
__rmqueue_smallest()get_page_from_free_area()慢路徑分配(slow)
快路徑(fastpath)檢查了各個zone的low watermark,若所有zone的內存水位線都低于low,則失敗并進入慢路徑(slowpath),就要進行回收了。
static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,struct alloc_context *ac) {bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;struct page *page = NULL;unsigned int alloc_flags;unsigned long did_some_progress;enum compact_priority compact_priority;enum compact_result compact_result;int compaction_retries;int no_progress_loops;unsigned int cpuset_mems_cookie;int reserve_flags;/** We also sanity check to catch abuse of atomic reserves being used by* callers that are not in atomic context.*/if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))gfp_mask &= ~__GFP_ATOMIC;retry_cpuset:compaction_retries = 0;no_progress_loops = 0;compact_priority = DEF_COMPACT_PRIORITY;cpuset_mems_cookie = read_mems_allowed_begin();/** The fast path uses conservative alloc_flags to succeed only until* kswapd needs to be woken up, and to avoid the cost of setting up* alloc_flags precisely. So we do that now.*/alloc_flags = gfp_to_alloc_flags(gfp_mask);/** We need to recalculate the starting point for the zonelist iterator* because we might have used different nodemask in the fast path, or* there was a cpuset modification and we are retrying - otherwise we* could end up iterating over non-eligible zones endlessly.*/ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,ac->high_zoneidx, ac->nodemask);if (!ac->preferred_zoneref->zone)goto nopage;if (alloc_flags & ALLOC_KSWAPD)wake_all_kswapds(order, gfp_mask, ac);/** The adjusted alloc_flags might result in immediate success, so try* that first*/page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);if (page)goto got_pg;/** For costly allocations, try direct compaction first, as it's likely* that we have enough base pages and don't need to reclaim. For non-* movable high-order allocations, do that as well, as compaction will* try prevent permanent fragmentation by migrating from blocks of the* same migratetype.* Don't try this for allocations that are allowed to ignore* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.*/if (can_direct_reclaim &&(costly_order ||(order > 0 && ac->migratetype != MIGRATE_MOVABLE))&& !gfp_pfmemalloc_allowed(gfp_mask)) {page = __alloc_pages_direct_compact(gfp_mask, order,alloc_flags, ac,INIT_COMPACT_PRIORITY,&compact_result);if (page)goto got_pg;if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&!(gfp_mask & __GFP_RETRY_MAYFAIL)) {/** If allocating entire pageblock(s) and compaction* failed because all zones are below low watermarks* or is prohibited because it recently failed at this* order, fail immediately unless the allocator has* requested compaction and reclaim retry.** Reclaim is* - potentially very expensive because zones are far* below their low watermarks or this is part of very* bursty high order allocations,* - not guaranteed to help because isolate_freepages()* may not iterate over freed pages as part of its* linear scan, and* - unlikely to make entire pageblocks free on its* own.*/if (compact_result == COMPACT_SKIPPED ||compact_result == COMPACT_DEFERRED)goto nopage;}/** Checks for costly allocations with __GFP_NORETRY, which* includes THP page fault allocations*/if (costly_order && (gfp_mask & __GFP_NORETRY)) {/** If compaction is deferred for high-order allocations,* it is because sync compaction recently failed. If* this is the case and the caller requested a THP* allocation, we do not want to heavily disrupt the* system, so we fail the allocation instead of entering* direct reclaim.*/if (compact_result == COMPACT_DEFERRED)goto nopage;/** Looks like reclaim/compaction is worth trying, but* sync compaction could be very expensive, so keep* using async compaction.*/compact_priority = INIT_COMPACT_PRIORITY;}}retry:/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */if (alloc_flags & ALLOC_KSWAPD)wake_all_kswapds(order, gfp_mask, ac);reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);if (reserve_flags)alloc_flags = reserve_flags;/** Reset the nodemask and zonelist iterators if memory policies can be* ignored. These allocations are high priority and system rather than* user oriented.*/if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {ac->nodemask = NULL;ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,ac->high_zoneidx, ac->nodemask);}/* Attempt with potentially adjusted zonelist and alloc_flags */page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);if (page)goto got_pg;/* Caller is not willing to reclaim, we can't balance anything */if (!can_direct_reclaim)goto nopage;/* Avoid recursion of direct reclaim */if (current->flags & PF_MEMALLOC)goto nopage;/* Try direct reclaim and then allocating */page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,&did_some_progress);if (page)goto got_pg;/* Try direct compaction and then allocating */page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,compact_priority, &compact_result);if (page)goto got_pg;/* Do not loop if specifically requested */if (gfp_mask & __GFP_NORETRY)goto nopage;/** Do not retry costly high order allocations unless they are* __GFP_RETRY_MAYFAIL*/if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))goto nopage;if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,did_some_progress > 0, &no_progress_loops))goto retry;/** It doesn't make any sense to retry for the compaction if the order-0* reclaim is not able to make any progress because the current* implementation of the compaction depends on the sufficient amount* of free memory (see __compaction_suitable)*/if (did_some_progress > 0 &&should_compact_retry(ac, order, alloc_flags,compact_result, &compact_priority,&compaction_retries))goto retry;/* Deal with possible cpuset update races before we start OOM killing */if (check_retry_cpuset(cpuset_mems_cookie, ac))goto retry_cpuset;/* Reclaim has failed us, start killing things */page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);if (page)goto got_pg;/* Avoid allocations with no watermarks from looping endlessly */if (tsk_is_oom_victim(current) &&(alloc_flags == ALLOC_OOM ||(gfp_mask & __GFP_NOMEMALLOC)))goto nopage;/* Retry as long as the OOM killer is making progress */if (did_some_progress) {no_progress_loops = 0;goto retry;}nopage:/* Deal with possible cpuset update races before we fail */if (check_retry_cpuset(cpuset_mems_cookie, ac))goto retry_cpuset;/** Make sure that __GFP_NOFAIL request doesn't leak out and make sure* we always retry*/if (gfp_mask & __GFP_NOFAIL) {/** All existing users of the __GFP_NOFAIL are blockable, so warn* of any new users that actually require GFP_NOWAIT*/if (WARN_ON_ONCE(!can_direct_reclaim))goto fail;/** PF_MEMALLOC request from this context is rather bizarre* because we cannot reclaim anything and only can loop waiting* for somebody to do a work for us*/WARN_ON_ONCE(current->flags & PF_MEMALLOC);/** non failing costly orders are a hard requirement which we* are not prepared for much so let's warn about these users* so that we can identify them and convert them to something* else.*/WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);/** Help non-failing allocations by giving them access to memory* reserves but do not use ALLOC_NO_WATERMARKS because this* could deplete whole memory reserves which would just make* the situation worse*/page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);if (page)goto got_pg;cond_resched();goto retry;} fail:warn_alloc(gfp_mask, ac->nodemask,"page allocation failure: order:%u", order); got_pg:return page; }首先通過gfp_to_alloc_flags(), 根據gfp_mask對內存分配標識進行調整,通過first_zones_zonelist()重新計算首選內存域; 因為可能在fastpath中使用的nodemask不同,或者cpuset進行了修改,正在重試, 這樣需要重新計算preferred zone,以免無限的遍歷不符合要求的zone。 如果alloc_flag標志ALLOC_KSWAPD, 那么會通過wake_all_kswapds喚醒kswapd內核線程。使用調整后的標志來嘗試第一次慢速路徑內存分配,分配的函數也是get_page_from_freelist,如果分配失敗,滿足“允許直接回收內存(can_direct_reclaim)” 或者 "不適用pfmemalloc的內存分配請求"等條件,將會進行一次內存的壓縮并分配頁面。
retry 的過程中會重新喚醒kswapd線程(防止意外的休眠),調整zone后通過get_page_from_freelist 重新進行內存分配,如果分配失敗了,并且不能夠直接內存回收, 就跳轉到"no_page"。__alloc_pages_direct_reclaim()嘗試直接內存回收后分配頁面,__alloc_pages_direct_compact()進行第二次直接內存壓縮后分配頁面,should_reclaim_retry()會判斷是否需要重新回收,然后調轉到“retry”. 如果gfp_mask中有noretry標志或者GFP_RETRY_MAYFAIL標志,那么不會重新retry, 直接跳轉到"no_page".should_compact_retry()會判斷是否需要重新壓縮,然后跳轉到”retry",check_retry_cpuset()如果檢測到由于cpuset發生變化而檢測到競爭條件,跳轉到最開始的"retry_cpuset"。__alloc_pages_may_oom(), 如果內存回收失敗,會嘗試進行oom kill 一些進程,進行內存的回收。如果當前task由于OOM而處于被殺死的狀態,則跳轉移至“nopage”
最后的nopage,如果gfp_mask標志位有nofail選項,則將重試直到分配到頁面為止; 如果沒有該標志,說明page沒有分配成功,直接返回NULL。__alloc_pages_cpuset_fallback(), 使用ALLOC_HARDER標志,如果節點耗盡,則回退以忽略cpuset的限制。
總結
以上是生活随笔為你收集整理的内存快速分配和慢速分配的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: win7休眠开启与关闭
- 下一篇: 2022春 哈工大《近世代数》期末考试卷