日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

mmap 源码分析

發布時間:2024/4/18 编程问答 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 mmap 源码分析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

mmap 源碼分析

  • 準備
  • 函數原型
  • 虛擬內存區域管理
  • The Memory Descriptor(內存描述符)
  • Virtual Memory Area(虛擬內存區域描述符)
  • mmap映射執行流程
  • 源碼分析
  • do_mmap()
  • mmap_region()
  • 匿名映射
  • 總結
  • Q&A
  • 準備

    內核版本: 4.20.1

    上一篇Linux環境寫文件如何穩定跑滿磁盤I-O帶寬我們使用了mmap來幫助我們寫文件穩定的跑滿了磁盤I/O,這篇我們來詳細介紹一下mmap()的細節和源碼分析. 雖然我們使用mmap()只是簡單的映射文件至內存中,而mmap()的設計實現主要涉及內核中的虛擬內存空間和內存映射等細節.

    函數原型

    void *mmap(void *addr, size_t length, int prot, int flags,int fd, off_t offset);

    這是mmap的函數原型,而系統調用的接口在mm/mmap.c中的:

    unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,unsigned long prot, unsigned long flags,unsigned long fd, unsigned long pgoff);

    虛擬內存區域管理

    這里我們先介紹兩個關于虛擬內存的數據結構。虛擬內存概念的相關資料網上已經足夠的豐富,這里我們從內核的角度來分析。虛擬空間的管理是以進程為基礎的,每個進程都有各自的虛存空間,除此之外,每個進程的“內核虛擬空間”是為所有的進程所共享的。一個進程的虛擬地址空間主要由兩個數據結構來描述:?mm_struct(內存描述符)?和vm_area_struct(虛擬內存區域描述符)。

    The Memory Descriptor(內存描述符)

    mm_struct包括進程中虛擬地址空間的所有信息,mm_struct定義在include/linux/mm_types.h:

    struct mm_struct {struct {struct vm_area_struct *mmap; /* vm_area_struct的鏈表 */pgd_t * pgd; /* 指向進程的頁目錄 *//* ... */int map_count; /* vm_area_struct數量 *//* ... */unsigned long total_vm; /* 映射的Page數量 *//* ... */unsigned long start_code, end_code, start_data, end_data; /* 代碼段起始結束位置,數據段起始結束位置 */unsigned long start_brk, brk, start_stack; /* 堆的起始結束位置, 棧因為其性質,只有起始位置 */unsigned long arg_start, arg_end, env_start, env_end; /* 參數段,環境段的起始結束位置 *//* ... */}}

    結合mm_struct和下圖32位系統典型的虛擬地址空間分布更能直觀的理解(來自《深入理解計算機系統》):

    Virtual Memory Area(虛擬內存區域描述符)

    vm_area_struct描述了虛擬地址空間的一個區間, 一個進程的虛擬空間中可能有多個虛擬區間,?vm_area_struct同樣定義在include/linux/mm_types.h:

    /** This struct defines a memory VMM memory area. There is one of these* per VM-area/task. A VM area is any part of the process virtual memory* space that has a special rule for the page-fault handlers (ie a shared* library, the executable area etc).*/ struct vm_area_struct {/* The first cache line has the info for VMA tree walking. */unsigned long vm_start; /* 在虛擬地址空間的起始位置 */unsigned long vm_end; /* 在虛擬地址空間的結束位置*//* linked list of VM areas per task, sorted by address */struct vm_area_struct *vm_next, *vm_prev; /* 虛擬內存區域鏈表中的前繼,后繼指針 */struct rb_node vm_rb;/** Largest free memory gap in bytes to the left of this VMA.* Either between this VMA and vma->vm_prev, or between one of the* VMAs below us in the VMA rbtree and its ->vm_prev. This helps* get_unmapped_area find a free area of the right size.*/unsigned long rb_subtree_gap;/* Second cache line starts here. *//* Function pointers to deal with this struct. */const struct vm_operations_struct *vm_ops; /* 虛擬內存操作集合 */struct mm_struct *vm_mm; /* vma所屬的虛擬地址空間 */pgprot_t vm_page_prot; /* Access permissions of this VMA. */unsigned long vm_flags; /* Flags, see mm.h. */unsigned long vm_pgoff; /* 以Page為單位的偏移. */struct file * vm_file; /* 映射的文件,匿名映射即為nullptr*/

    下圖是某個進程的虛擬內存簡化布局以及相應的幾個數據結構之間的關系:

    mmap映射執行流程

    • 檢查參數,并根據傳入的映射類型設置vma的flags.
    • 進程查找其虛擬地址空間,找到一塊空閑的滿足要求的虛擬地址空間.
    • 根據找到的虛擬地址空間初始化vma.
    • 設置vma->vm_file.
    • 根據文件系統類型,將vma->vm_ops設為對應的file_operations.
    • 將vma插入mm的鏈表中.

    源碼分析

    我們接下來進入mmap的代碼分析:

    do_mmap()

    do_mmap()是整個mmap()的具體操作函數, 我們跳過系統調用來直接看具體實現:

    unsigned long do_mmap(struct file *file, unsigned long addr,unsigned long len, unsigned long prot,unsigned long flags, vm_flags_t vm_flags,unsigned long pgoff, unsigned long *populate,struct list_head *uf) {struct mm_struct *mm = current->mm; /* 獲取該進程的memory descriptorint pkey = 0;*populate = 0;/*函數對傳入的參數進行一系列檢查, 假如任一參數出錯,都會返回一個errno*/if (!len)return -EINVAL;/** Does the application expect PROT_READ to imply PROT_EXEC?** (the exception is when the underlying filesystem is noexec* mounted, in which case we dont add PROT_EXEC.)*/if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))if (!(file && path_noexec(&file->f_path)))prot |= PROT_EXEC;/* force arch specific MAP_FIXED handling in get_unmapped_area */if (flags & MAP_FIXED_NOREPLACE)flags |= MAP_FIXED;/* 假如沒有設置MAP_FIXED標志,且addr小于mmap_min_addr, 因為可以修改addr, 所以就需要將addr設為mmap_min_addr的頁對齊后的地址 */if (!(flags & MAP_FIXED))addr = round_hint_to_min(addr);/* Careful about overflows.. *//* 進行Page大小的對齊 */len = PAGE_ALIGN(len);if (!len)return -ENOMEM;/* offset overflow? */if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)return -EOVERFLOW;/* Too many mappings? *//* 判斷該進程的地址空間的虛擬區間數量是否超過了限制 */if (mm->map_count > sysctl_max_map_count)return -ENOMEM;/* Obtain the address to map to. we verify (or select) it and ensure* that it represents a valid section of the address space.*//* get_unmapped_area從當前進程的用戶空間獲取一個未被映射區間的起始地址 */addr = get_unmapped_area(file, addr, len, pgoff, flags);/* 檢查addr是否有效 */if (offset_in_page(addr))return addr;/* ?假如flags設置MAP_FIXED_NOREPLACE,需要對進程的地址空間進行addr的檢查. 如果搜索發現存在重合的vma, 返回-EEXIST。這是MAP_FIXED_NOREPLACE標志所要求的*/if (flags & MAP_FIXED_NOREPLACE) {struct vm_area_struct *vma = find_vma(mm, addr);if (vma && vma->vm_start < addr + len)return -EEXIST;}if (prot == PROT_EXEC) {pkey = execute_only_pkey(mm);if (pkey < 0)pkey = 0;}/* Do simple checking here so the lower-level routines won't have* to. we assume access permissions have been handled by the open* of the memory object, so we don't do any here.*/vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;/* 假如flags設置MAP_LOCKED,即類似于mlock()將申請的地址空間鎖定在內存中, 檢查是否可以進行lock*/if (flags & MAP_LOCKED)if (!can_do_mlock())return -EPERM;if (mlock_future_check(mm, vm_flags, len))return -EAGAIN;if (file) { /* file指針不為nullptr, 即從文件到虛擬空間的映射 */struct inode *inode = file_inode(file); /* 獲取文件的inode */unsigned long flags_mask;if (!file_mmap_ok(file, inode, pgoff, len))return -EOVERFLOW;flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags;/*...根據標志指定的map種類,把為文件設置的訪問權考慮進去。如果所請求的內存映射是共享可寫的,就要檢查要映射的文件是為寫入而打開的,而不是以追加模式打開的,還要檢查文件上沒有上強制鎖。對于任何種類的內存映射,都要檢查文件是否為讀操作而打開的。...*/} else {switch (flags & MAP_TYPE) {case MAP_SHARED:if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))return -EINVAL;/** Ignore pgoff.*/pgoff = 0;vm_flags |= VM_SHARED | VM_MAYSHARE;break;case MAP_PRIVATE:/** Set pgoff according to addr for anon_vma.*/pgoff = addr >> PAGE_SHIFT;break;default:return -EINVAL;}}/** Set 'VM_NORESERVE' if we should not account for the* memory use of this mapping.*/if (flags & MAP_NORESERVE) {/* We honor MAP_NORESERVE if allowed to overcommit */if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)vm_flags |= VM_NORESERVE;/* hugetlb applies strict overcommit unless MAP_NORESERVE */if (file && is_file_hugepages(file))vm_flags |= VM_NORESERVE;}addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);if (!IS_ERR_VALUE(addr) &&((vm_flags & VM_LOCKED) ||(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))*populate = len;return addr;

    mmap_region()

    do_mmap()根據用戶傳入的參數做了一系列的檢查,然后根據參數初始化vm_area_struct的標志vm_flags,vma->vm_file = get_file(file)建立文件與vma的映射, mmap_region()負責創建虛擬內存區域:

    unsigned long mmap_region(struct file *file, unsigned long addr,unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,struct list_head *uf) {struct mm_struct *mm = current->mm; // 獲取該進程的memory descriptorstruct vm_area_struct *vma, *prev;int error;struct rb_node **rb_link, *rb_parent;unsigned long charged = 0;/* Check against address space limit. *//* 檢查申請的虛擬內存空間是否超過了限制. */if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {unsigned long nr_pages;/** MAP_FIXED may remove pages of mappings that intersects with* requested mapping. Account for the pages it would unmap.*/nr_pages = count_vma_pages_range(mm, addr, addr + len);if (!may_expand_vm(mm, vm_flags,(len >> PAGE_SHIFT) - nr_pages))return -ENOMEM;}/* 檢查[addr, addr+len)的區間是否存在映射空間,假如存在重合的映射空間需要munmap */while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,&rb_parent)) {if (do_munmap(mm, addr, len, uf))return -ENOMEM;}/** Private writable mapping: check memory availability*/if (accountable_mapping(file, vm_flags)) {charged = len >> PAGE_SHIFT;if (security_vm_enough_memory_mm(mm, charged))return -ENOMEM;vm_flags |= VM_ACCOUNT;}/* 檢查是否可以合并[addr, addr+len)區間內的虛擬地址空間vma*/vma = vma_merge(mm, prev, addr, addr + len, vm_flags,NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);if (vma) /* 假如合并成功,即使用合并后的vma, 并跳轉至out */goto out;/** Determine the object being mapped and call the appropriate* specific mapper. the address has already been validated, but* not unmapped, but the maps are removed from the list.*//* 如果不能和已有的虛擬內存區域合并,通過Memory Descriptor來申請一個vma */vma = vm_area_alloc(mm);if (!vma) {error = -ENOMEM;goto unacct_error;}/* 初始化vma */vma->vm_start = addr;vma->vm_end = addr + len;vma->vm_flags = vm_flags;vma->vm_page_prot = vm_get_page_prot(vm_flags);vma->vm_pgoff = pgoff;if (file) { /* 假如指定了文件映射 */if (vm_flags & VM_DENYWRITE) { /* 映射的文件不允許寫入,調用deny_write_accsess(file)排斥常規的文件操作 */error = deny_write_access(file);if (error)goto free_vma;}if (vm_flags & VM_SHARED) { /* 映射的文件允許其他進程可見, 標記文件為可寫 */error = mapping_map_writable(file->f_mapping);if (error)goto allow_write_and_free_vma;}/* ->mmap() can change vma->vm_file, but must guarantee that* vma_link() below can deny write-access if VM_DENYWRITE is set* and map writably if VM_SHARED is set. This usually means the* new file must not have been exposed to user-space, yet.*/vma->vm_file = get_file(file); /* 遞增File的引用次數,返回File賦給vma*/error = call_mmap(file, vma); /* 調用文件系統指定的mmap函數,后面會介紹 */if (error)goto unmap_and_free_vma;/* Can addr have changed??** Answer: Yes, several device drivers can do it in their* f_op->mmap method. -DaveM* Bug: If addr is changed, prev, rb_link, rb_parent should* be updated for vma_link()*/WARN_ON_ONCE(addr != vma->vm_start);addr = vma->vm_start;vm_flags = vma->vm_flags;} else if (vm_flags & VM_SHARED) {/* 假如標志為VM_SHARED,但沒有指定映射文件,需要調用shmem_zero_setup()shmem_zero_setup()實際映射的文件是dev/zero*/error = shmem_zero_setup(vma);if (error)goto free_vma;} else {/* 既沒有指定file, 也沒有設置VM_SHARED, 即設置為匿名映射 */vma_set_anonymous(vma);}/* 將申請的新vma加入mm中的vma鏈表*/vma_link(mm, vma, prev, rb_link, rb_parent);/* Once vma denies write, undo our temporary denial count */if (file) {if (vm_flags & VM_SHARED)mapping_unmap_writable(file->f_mapping);if (vm_flags & VM_DENYWRITE)allow_write_access(file);}file = vma->vm_file; out:perf_event_mmap(vma);/* 更新進程的虛擬地址空間mm */vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);if (vm_flags & VM_LOCKED) {if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||is_vm_hugetlb_page(vma) ||vma == get_gate_vma(current->mm))vma->vm_flags &= VM_LOCKED_CLEAR_MASK;elsemm->locked_vm += (len >> PAGE_SHIFT);}if (file)uprobe_mmap(vma);/** New (or expanded) vma always get soft dirty status.* Otherwise user-space soft-dirty page tracker won't* be able to distinguish situation when vma area unmapped,* then new mapped in-place (which must be aimed as* a completely new data area).*/vma->vm_flags |= VM_SOFTDIRTY;vma_set_page_prot(vma);return addr;unmap_and_free_vma:vma->vm_file = NULL;fput(file);/* Undo any partial mapping done by a device driver. */unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);charged = 0;if (vm_flags & VM_SHARED)mapping_unmap_writable(file->f_mapping); allow_write_and_free_vma:if (vm_flags & VM_DENYWRITE)allow_write_access(file); free_vma:vm_area_free(vma); unacct_error:if (charged)vm_unacct_memory(charged);return error; }

    mmap_region()調用了call_mmap(file, vma):?call_mmap根據文件系統的類型選擇適配的mmap()函數,我們選擇目前常用的ext4:

    ext4_file_mmap()是ext4對應的mmap, 功能非常簡單,更新了file的修改時間(file_accessed(flie)),將對應的operation賦給vma->vm_flags:

    三個操作函數的意義:

    • .fault: 處理Page Fault
    • .map_pages: 映射文件至Page Cache
    • .page_mkwrite: 修改文件的狀態為可寫
    static const struct vm_operations_struct ext4_file_vm_ops = {.fault = ext4_filemap_fault,.map_pages = filemap_map_pages,.page_mkwrite = ext4_page_mkwrite, };static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) {struct inode *inode = file->f_mapping->host;if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))return -EIO;/** We don't support synchronous mappings for non-DAX files. At least* until someone comes with a sensible use case.*/if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))return -EOPNOTSUPP;file_accessed(file);if (IS_DAX(file_inode(file))) {vma->vm_ops = &ext4_dax_vm_ops;vma->vm_flags |= VM_HUGEPAGE;} else {vma->vm_ops = &ext4_file_vm_ops;}return 0; }

    通過分析mmap的源碼我們發現在調用mmap()的時候僅僅申請一個vm_area_struct來建立文件與虛擬內存的映射,并沒有建立虛擬內存與物理內存的映射。假如沒有設置MAP_POPULATE標志位,Linux并不在調用mmap()時就為進程分配物理內存空間,直到下次真正訪問地址空間時發現數據不存在于物理內存空間時,觸發Page Fault即缺頁中斷,Linux才會將缺失的Page換入內存空間. 后面的文章我們會介紹Linux的缺頁(Page fault)處理和請求Page的機制.

    匿名映射

    mmap()設置參數MAP_ANONYMOUS即可指定匿名映射,mmap的匿名映射并不執行文件或設備為映射地址,實際上映射的文件為/dev/zero,匿名頁的物理內存一般分配用來作為進程的棧或堆的虛擬內存映射.

    總結

    常用的read()首先從文件的Page讀取至內核頁緩存,然后再從內核態的內存空間拷貝到用戶態的內存空間,而mmap直接建立了文件與虛擬地址空間的映射, 可以直接通過MMU根據虛擬地址空間的地址映射從內核的物理內存區讀取數據, 省去了內核態拷貝數據至用戶態的開銷. 因為mmap的修改直接反映在物理內存時,所以kill -9進程不會丟數據.

    Q&A

    • vm_area_struct如何尋找對應的物理內存頁?

      vm_area_struct結構中并沒有直接的存放Page指針的結構體,但包含虛擬地址的起始地址和結束地址vm_start和vm_end, 通過虛擬地址轉換物理地址的方法可以直接尋找到指定的Page.

    • 如何處理變長的文件?

      Rocksdb使用了mmap的方式寫文件, 首先fallocate固定長度len的文件,然后通過mmap建立映射,使用一個base指針來滑動寫入位置,寫滿長度len之后,調用munmap. 假如Close文件時寫不夠長度len, 即mummap寫入的長度,然后使用ftruncate()將多余的映射部分截去.

    • mmap()之后memcpy()出現SIGBUS錯誤:

      SIGBUS出現在缺頁中斷處理的過程中,即前面我們提到的ext4_file_vm_ops的ext4_file_vm_ops():?do_mmap()有一行len = PAGE_ALIGN(len), 即根據傳入的參數len進行頁對齊后的長度來映射文件,但這里并沒有考慮文件size.
      而缺頁中斷后真正的文件映射讀取會考慮文件長度,即讀取的offset假如超過了文件size頁對齊后的長度,即會返回SIGBUS.

      /** DIV_ROUND_UP()意為向上取整, i_size_read(inode)返回文件的長度(inode->i_size)* 假如文件長度為7000, 經過DIV_ROUND_UP(), max_off返回8192*/ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);/** offset為memcpy()中目標地址addr所指向的偏移位置,假如超過了max_off,返回了SIGBUS*/ if (unlikely(offset >= max_off))return VM_FAULT_SIGBUS;
    • mmap()之后memcpy()出現SIGSEGV錯誤: (mm/memory.c:handle_mm_fault())

      if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,flags & FAULT_FLAG_INSTRUCTION,flags & FAULT_FLAG_REMOTE))/* * 當進程訪問試圖訪問非法的虛擬地址空間,返回SIGSEGV錯誤*/ return VM_FAULT_SIGSEGV;
    • mmap是銀彈嗎?

      不是, 隨機寫頻繁觸發的Page Fault和臟頁回寫使得mmap避免在內核態與用戶態之間的拷貝的優勢減弱,下圖是Linux環境寫文件如何穩定跑滿磁盤I-O帶寬中方案三的mmap順序寫入的火焰圖,我們可以更直觀的看到mmap的瓶頸所在:

    • mmap設置MAP_SHARED, 這部分使用的內存會計算在RSS中嗎?

      會,RSS(Resident set size)意為常駐使用內存,一般理解為真正使用的物理內存,當這部分設置了MAP_SHARED的內存觸發了Page Fault,被OS真正分配了物理內存,就會在RSS的數值上體現.

    總結

    以上是生活随笔為你收集整理的mmap 源码分析的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。