阅读 Linux 内核源码——共享内存
介紹
我看的是linux-4.2.3的源碼。參考了《邊干邊學——Linux內核指導》(鬼畜的書名)第16章內容,他們用的是2.6.15的內核源碼。
現在linux中可以使用共享內存的方式有兩種
POSIX的shm_open()在/dev/shm/下打開一個文件,用mmap()映射到進程自己的內存地址
System V的shmget()得到一個共享內存對象的id,用shmat()映射到進程自己的內存地址
POSIX的實現是基于tmpfs的,函數都寫在libc里,沒什么好說的,主要還是看System V的實現方式。在System V中共享內存屬于IPC子系統。所謂ipc,就是InterProcess Communication即進程間通信的意思,System V比前面的Unix增加了3中進程間通信的方式,共享內存、消息隊列、信號量,統稱IPC。主要代碼在以下文件中
ipc/shm.c
include/linux/shm.h
ipc/util.c
ipc/util.h
include/linux/ipc.h
同一塊共享內存在內核中至少有3個標識符
IPC對象id(IPC對象是保存IPC信息的數據結構)
進程虛擬內存中文件的inode,即每個進程中的共享內存也是以文件的方式存在的,但并不是顯式的??梢酝ㄟ^某個vm_area_struct->vm_file->f_dentry->d_inode->i_ino表示
IPC對象的key。如果在shmget()中傳入同一個key可以獲取到同一塊共享內存。但由于key是用戶指定的,可能重復,而且也很少程序寫之前會約定一個key,所以這種方法不是很常用。通常System V這種共享內存的方式是用于有父子關系的進程的?;蛘哂胒tok()函數用路徑名來生成一個key。
首先看一下在內核中表示一塊共享內存的數據結構,在include/linux/shm.h中
/* */是內核源碼的注釋,// 是我的注釋
再看一下struct shmid_kernel中存儲權限信息的shm_perm,在include/linux/ipc.h中
/* used by in-kernel data structures */ struct kern_ipc_perm {spinlock_t lock;bool deleted;int id; // IPC對象idkey_t key; // IPC對象鍵值,即創建共享內存時用戶指定的kuid_t uid; // IPC對象擁有者idkgid_t gid; // 組idkuid_t cuid; // 創建者idkgid_t cgid;umode_t mode; unsigned long seq;void *security; };為啥有這樣一個struct呢?因為這些權限、id、key是IPC對象都有的屬性,所以比如表示semaphore的結構struct semid_kernel中也有一個這樣的struct kern_ipc_perm。然后在傳遞IPC對象的時候,傳的也是struct kern_ipc_perm的指針,再用container_of這樣的宏獲得外面的struct,這樣就能用同一個函數操作3種IPC對象,達到較好的代碼重用。
接下來我們看一下共享內存相關函數。首先它們都是系統調用,對應的用戶API在libc里面,參數是相同的,只是libc中的API做了一些調用系統調用需要的日常工作(保護現場、恢復現場之類的),所以就直接看這個系統調用了。
聲明在include/linux/syscalls.h中
asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg); asmlinkage long sys_shmget(key_t key, size_t size, int flag); asmlinkage long sys_shmdt(char __user *shmaddr); asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);定義在ipc/shm.c中
shmget
SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg) {struct ipc_namespace *ns;static const struct ipc_ops shm_ops = {.getnew = newseg,.associate = shm_security,.more_checks = shm_more_checks,};struct ipc_params shm_params;ns = current->nsproxy->ipc_ns;shm_params.key = key;shm_params.flg = shmflg;shm_params.u.size = size;return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params); }首先看到這個函數定義可能會很奇怪,不過這個SYSCALL_DEFINE3的宏展開來最后形式肯定和.h文件中聲明的一樣,即還是long sys_shmget(key_t key, size_t size, int flag)這個宏是為了修一個bug,純粹黑科技,這里不提它。
然后這里實際調用的函數是ipcget()。為了統一一個ipc的接口也是煞費苦心,共享內存、信號量、消息隊列三種對象創建的時候都會調用這個函數,但其實創建的邏輯并不在這里。而在shm_ops中的三個函數里。
namespace
順便提一下其中的current->nsproxy->ipc_ns。這個的類型是struct ipc_namespace。它是啥呢?我們知道,共享內存這些進程間通信的數據結構是全局的,但有時候需要把他們隔離開,即某一組進程并不知道另外的進程的共享內存,它們只希望在組內共用這些東西,這樣就不會與其他進程沖突。于是就煞費苦心在內核中加了一個namespace。只要在clone()函數中加入CLONE_NEWIPC標志就能創建一個新的IPC namespace。
那么這個IPC namespace和我們的共享內存的數據結構有什么關系呢,可以看一下結構體
struct ipc_ids {int in_use;unsigned short seq;struct rw_semaphore rwsem;struct idr ipcs_idr;int next_id; };struct ipc_namespace {atomic_t count;struct ipc_ids ids[3];... };比較重要的是其中的ids,它存的是所用IPC對象的id,其中共享內存都存在ids[2]中。而在ids[2]中真正負責管理數據的是ipcs_idr,它也是內核中一個煞費苦心弄出來的id管理機制,一個id可以對應任意唯一確定的對象。把它理解成一個數組就好。它們之間的關系大概如下圖所示。
[0] struct kern_ipc_perm <==> struct shmid_kernel struct ipc_namespace => struct ipc_ids => struct idr => [1] struct kern_ipc_perm <==> struct shmid_kernel[2] struct kern_ipc_perm <==> struct shmid_kernel回到shmget
好的,我們回頭來看看shmget()究竟干了啥,首先看一下ipcget()
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,const struct ipc_ops *ops, struct ipc_params *params) {if (params->key == IPC_PRIVATE)return ipcget_new(ns, ids, ops, params);elsereturn ipcget_public(ns, ids, ops, params); }如果傳進來的參數是IPC_PRIVATE(這個宏的值是0)的話,無論是什么mode,都會創建一塊新的共享內存。如果非0,則會去已有的共享內存中找有沒有這個key的,有就返回,沒有就新建。
首先看一下新建的函數newseg()
static int newseg(struct ipc_namespace *ns, struct ipc_params *params) {key_t key = params->key;int shmflg = params->flg;size_t size = params->u.size;int error;struct shmid_kernel *shp;size_t numpages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;struct file *file;char name[13];int id;vm_flags_t acctflag = 0;if (size < SHMMIN || size > ns->shm_ctlmax)return -EINVAL;if (numpages << PAGE_SHIFT < size)return -ENOSPC;if (ns->shm_tot + numpages < ns->shm_tot ||ns->shm_tot + numpages > ns->shm_ctlall)return -ENOSPC;shp = ipc_rcu_alloc(sizeof(*shp));if (!shp)return -ENOMEM;shp->shm_perm.key = key;shp->shm_perm.mode = (shmflg & S_IRWXUGO);shp->mlock_user = NULL;shp->shm_perm.security = NULL;error = security_shm_alloc(shp);if (error) {ipc_rcu_putref(shp, ipc_rcu_free);return error;}sprintf(name, "SYSV%08x", key);if (shmflg & SHM_HUGETLB) {struct hstate *hs;size_t hugesize;hs = hstate_sizelog((shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);if (!hs) {error = -EINVAL;goto no_file;}hugesize = ALIGN(size, huge_page_size(hs));/* hugetlb_file_setup applies strict accounting */if (shmflg & SHM_NORESERVE)acctflag = VM_NORESERVE;file = hugetlb_file_setup(name, hugesize, acctflag,&shp->mlock_user, HUGETLB_SHMFS_INODE,(shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);} else {/** Do not allow no accounting for OVERCOMMIT_NEVER, even* if it's asked for.*/if ((shmflg & SHM_NORESERVE) &&sysctl_overcommit_memory != OVERCOMMIT_NEVER)acctflag = VM_NORESERVE;file = shmem_kernel_file_setup(name, size, acctflag);}error = PTR_ERR(file);if (IS_ERR(file))goto no_file;id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);if (id < 0) {error = id;goto no_id;}shp->shm_cprid = task_tgid_vnr(current);shp->shm_lprid = 0;shp->shm_atim = shp->shm_dtim = 0;shp->shm_ctim = get_seconds();shp->shm_segsz = size;shp->shm_nattch = 0;shp->shm_file = file;shp->shm_creator = current;list_add(&shp->shm_clist, ¤t->sysvshm.shm_clist);/** shmid gets reported as "inode#" in /proc/pid/maps.* proc-ps tools use this. Changing this will break them.*/file_inode(file)->i_ino = shp->shm_perm.id;ns->shm_tot += numpages;error = shp->shm_perm.id;ipc_unlock_object(&shp->shm_perm);rcu_read_unlock();return error;no_id:if (is_file_hugepages(file) && shp->mlock_user)user_shm_unlock(size, shp->mlock_user);fput(file); no_file:ipc_rcu_putref(shp, shm_rcu_free);return error; }這個函數首先幾個if檢查size是不是合法的參數,并且檢查有沒有足夠的pages。然后調用ipc_rcu_alloc()函數給共享內存數據結構shp分配空間。然后把一些參數寫到shp的shm_perm成員中。然后sprintf下面那個大的if-else是為表示共享內存內容的file分配空間。再然后ipc_addid()是一個比較重要的函數,它把剛才新建的這個共享內存的數據結構的指針加入到namespace的ids里,即可以想象成加入到數組里,并獲得一個可以找到它的id。這里的id并不完全是數組的下標,因為要避免重復,所以這里有一個簡單的機制來保證生成的id幾乎是unique的,即ids里面有個seq變量,每次新加入共享內存對象時都會加1,而真正的id是這樣生成的SEQ_MULTIPLIER * seq + id。然后初始化一些成員,再把這個數據結構的指針加到當前進程的一個list里。這個函數的工作就基本完成了。
接下來我們再看一下如果創建時傳入一個已有的key,即ipcget_public()的邏輯
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,const struct ipc_ops *ops, struct ipc_params *params) {struct kern_ipc_perm *ipcp;int flg = params->flg;int err;/** Take the lock as a writer since we are potentially going to add* a new entry + read locks are not "upgradable"*/down_write(&ids->rwsem);ipcp = ipc_findkey(ids, params->key);if (ipcp == NULL) {/* key not used */if (!(flg & IPC_CREAT))err = -ENOENT;elseerr = ops->getnew(ns, params);} else {/* ipc object has been locked by ipc_findkey() */if (flg & IPC_CREAT && flg & IPC_EXCL)err = -EEXIST;else {err = 0;if (ops->more_checks)err = ops->more_checks(ipcp, params);if (!err)/** ipc_check_perms returns the IPC id on* success*/err = ipc_check_perms(ns, ipcp, ops, params);}ipc_unlock(ipcp);}up_write(&ids->rwsem);return err; }邏輯非常簡單,先去找有沒有這個key。沒有的話還是創建一個新的,注意ops->getnew()對應的就是剛才的newseg()函數。如果找到了就判斷一下權限有沒有問題,沒有問題就直接返回IPC id。
可以再看下ipc_findkey()這個函數
static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key) {struct kern_ipc_perm *ipc;int next_id;int total;for (total = 0, next_id = 0; total < ids->in_use; next_id++) {ipc = idr_find(&ids->ipcs_idr, next_id);if (ipc == NULL)continue;if (ipc->key != key) {total++;continue;}rcu_read_lock();ipc_lock_object(ipc);return ipc;}return NULL; }邏輯也很簡單,注意到ids->ipcs_idr就是之前提到的Interger ID Managenent機制,里面存的就是shmid和對象一一對應的關系。然后這里可以看到ids->in_use表示的是共享內存的個數,由于中間的有些可能刪掉了,所以total在找到一個不為空的共享內存的時候才++。然后我們也可以看到,這里對重復的key并沒有做任何處理。所以我們在編程的時候也應該避免直接約定用某一個數字當key。
shmat
接下來我們看一下shmat(),它的邏輯全在do_shmat()中,所以我們直接看這個函數。
long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,unsigned long shmlba) {struct shmid_kernel *shp;unsigned long addr;unsigned long size;struct file *file;int err;unsigned long flags;unsigned long prot;int acc_mode;struct ipc_namespace *ns;struct shm_file_data *sfd;struct path path;fmode_t f_mode;unsigned long populate = 0;err = -EINVAL;if (shmid < 0)goto out;else if ((addr = (ulong)shmaddr)) {if (addr & (shmlba - 1)) {if (shmflg & SHM_RND)addr &= ~(shmlba - 1); /* round down */else #ifndef __ARCH_FORCE_SHMLBAif (addr & ~PAGE_MASK) #endifgoto out;}flags = MAP_SHARED | MAP_FIXED;} else {if ((shmflg & SHM_REMAP))goto out;flags = MAP_SHARED;}if (shmflg & SHM_RDONLY) {prot = PROT_READ;acc_mode = S_IRUGO;f_mode = FMODE_READ;} else {prot = PROT_READ | PROT_WRITE;acc_mode = S_IRUGO | S_IWUGO;f_mode = FMODE_READ | FMODE_WRITE;}if (shmflg & SHM_EXEC) {prot |= PROT_EXEC;acc_mode |= S_IXUGO;}/** We cannot rely on the fs check since SYSV IPC does have an* additional creator id...*/ns = current->nsproxy->ipc_ns;rcu_read_lock();shp = shm_obtain_object_check(ns, shmid);if (IS_ERR(shp)) {err = PTR_ERR(shp);goto out_unlock;}err = -EACCES;if (ipcperms(ns, &shp->shm_perm, acc_mode))goto out_unlock;err = security_shm_shmat(shp, shmaddr, shmflg);if (err)goto out_unlock;ipc_lock_object(&shp->shm_perm);/* check if shm_destroy() is tearing down shp */if (!ipc_valid_object(&shp->shm_perm)) {ipc_unlock_object(&shp->shm_perm);err = -EIDRM;goto out_unlock;}path = shp->shm_file->f_path;path_get(&path);shp->shm_nattch++;size = i_size_read(d_inode(path.dentry));ipc_unlock_object(&shp->shm_perm);rcu_read_unlock();err = -ENOMEM;sfd = kzalloc(sizeof(*sfd), GFP_KERNEL);if (!sfd) {path_put(&path);goto out_nattch;}file = alloc_file(&path, f_mode,is_file_hugepages(shp->shm_file) ?&shm_file_operations_huge :&shm_file_operations);err = PTR_ERR(file);if (IS_ERR(file)) {kfree(sfd);path_put(&path);goto out_nattch;}file->private_data = sfd;file->f_mapping = shp->shm_file->f_mapping;sfd->id = shp->shm_perm.id;sfd->ns = get_ipc_ns(ns);sfd->file = shp->shm_file;sfd->vm_ops = NULL;err = security_mmap_file(file, prot, flags);if (err)goto out_fput;down_write(¤t->mm->mmap_sem);if (addr && !(shmflg & SHM_REMAP)) {err = -EINVAL;if (addr + size < addr)goto invalid;if (find_vma_intersection(current->mm, addr, addr + size))goto invalid;}addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate);*raddr = addr;err = 0;if (IS_ERR_VALUE(addr))err = (long)addr; invalid:up_write(¤t->mm->mmap_sem);if (populate)mm_populate(addr, populate);out_fput:fput(file);out_nattch:down_write(&shm_ids(ns).rwsem);shp = shm_lock(ns, shmid);shp->shm_nattch--;if (shm_may_destroy(ns, shp))shm_destroy(ns, shp);elseshm_unlock(shp);up_write(&shm_ids(ns).rwsem);return err;out_unlock:rcu_read_unlock(); out:return err; }首先檢查shmaddr的合法性并進行對齊,即調整為shmlba的整數倍。如果傳入addr是0,前面檢查部分只會加上一個MAP_SHARED標志,因為后面的mmap會自動為其分配地址。然后從那一段兩行的注釋開始,函數通過shmid嘗試獲取共享內存對象,并進行權限檢查。然后修改shp中的一些數據,比如連接進程數加一。然后是通過alloc_file()創建真正的要做mmap的file。在mmap之前還要對地址空間進行檢查,檢查是否和別的地址重疊,是否夠用。實際的映射工作就在do_mmap_pgoff()函數中做了。
shmdt
SYSCALL_DEFINE1(shmdt, char __user *, shmaddr) {struct mm_struct *mm = current->mm;struct vm_area_struct *vma;unsigned long addr = (unsigned long)shmaddr;int retval = -EINVAL; #ifdef CONFIG_MMUloff_t size = 0;struct file *file;struct vm_area_struct *next; #endifif (addr & ~PAGE_MASK)return retval;down_write(&mm->mmap_sem);/** This function tries to be smart and unmap shm segments that* were modified by partial mlock or munmap calls:* - It first determines the size of the shm segment that should be* unmapped: It searches for a vma that is backed by shm and that* started at address shmaddr. It records it's size and then unmaps* it.* - Then it unmaps all shm vmas that started at shmaddr and that* are within the initially determined size and that are from the* same shm segment from which we determined the size.* Errors from do_munmap are ignored: the function only fails if* it's called with invalid parameters or if it's called to unmap* a part of a vma. Both calls in this function are for full vmas,* the parameters are directly copied from the vma itself and always* valid - therefore do_munmap cannot fail. (famous last words?)*//** If it had been mremap()'d, the starting address would not* match the usual checks anyway. So assume all vma's are* above the starting address given.*/vma = find_vma(mm, addr);#ifdef CONFIG_MMUwhile (vma) {next = vma->vm_next;/** Check if the starting address would match, i.e. it's* a fragment created by mprotect() and/or munmap(), or it* otherwise it starts at this address with no hassles.*/if ((vma->vm_ops == &shm_vm_ops) &&(vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) {/** Record the file of the shm segment being* unmapped. With mremap(), someone could place* page from another segment but with equal offsets* in the range we are unmapping.*/file = vma->vm_file;size = i_size_read(file_inode(vma->vm_file));do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);/** We discovered the size of the shm segment, so* break out of here and fall through to the next* loop that uses the size information to stop* searching for matching vma's.*/retval = 0;vma = next;break;}vma = next;}/** We need look no further than the maximum address a fragment* could possibly have landed at. Also cast things to loff_t to* prevent overflows and make comparisons vs. equal-width types.*/size = PAGE_ALIGN(size);while (vma && (loff_t)(vma->vm_end - addr) <= size) {next = vma->vm_next;/* finding a matching vma now does not alter retval */if ((vma->vm_ops == &shm_vm_ops) &&((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&(vma->vm_file == file))do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);vma = next;}#else /* CONFIG_MMU *//* under NOMMU conditions, the exact address to be destroyed must be* given */if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);retval = 0;}#endifup_write(&mm->mmap_sem);return retval; }接下來是shmdt(),這個函數非常簡單,找到傳入的shmaddr對應的虛擬內存數據結構vma,檢查它的地址是不是正確的,然后調用do_munmap()函數斷開對共享內存的連接。注意此操作并不會銷毀共享內存,即使沒有進程連接到它也不會,只有手動調用shmctl(id, IPC_RMID, NULL)才能銷毀。
shmctl()總體就是一個switch語句,大多數做的是讀取信息的或者設置標志位的工作,這里不贅述。
總結
以上是生活随笔為你收集整理的阅读 Linux 内核源码——共享内存的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 文本挖掘之文本相似度判定
- 下一篇: vmware中linux无法动态获取dh