Wednesday, March 8, 2006

VFS/Vnode Layer in Solaris

In past I have mostly written on dispatcher locks (thread locks), scheduler, signal, procfs. This is for the first time, I'm writing about filesystem. I hope it'll help you in increasing awareness on filesystem so that developing filesystem specific things on Solaris is made easy.

In this blog, I'll dessribe about how to implement VFS (Virtual Filesystem) Layer and Vnode layer for any filesystem. There are two ways you can read disk data :-

(a) using buffer cache : bread() is used to read a block of the device. The block number is always with respect to the device. brelse() must be called once buffer data is read from buf_t->b_un.b_addr

(b) using segmap driver and setting up the pages.

Using VFS layer, we can export following filesystem operations :-

(a) mount : In this operation, we need to first see whether device can be mounted or not. We also need to read the super-block (depending upon whether it's primary partition or logical partition). We are required to create pseudo device also using following calls

pseudodev = makedevice(getmajor(xdev), minor); // xdev is the device passed to mount(1m) devvp = makespecvp(xdev, VBLK); // devvp is used to do reads

Once the pseudo device is created, we open the device to read super-block and check the filesystem signature. This information is copied to in-core super-block. Now comes the hard work to mount the filesystem. Here we get the vnode for the mount point and mark it VROOT (vp->v_flag). The VFS structure is also filled. For instance vfs_data will have pointer to fs structure (struct ufsvfs) which will have super-block and other general other information about the filesystem. VFS layer routines takes care of adding vfs structure to the global array 'vfssw' of struct vfssw type.

(b) unmount : This operation is very critical. Unmount should not go through while processes are inside the mount point unless -f (force flag is passed to umount(1m)). We need to maintain the reference count so that we don't allow unmount to go through while process's current working directory is inside the mount point. For this we can increment the reference count whenever vnode is allocated and decrement it whenever vnode is released via VOP_INACTIVE(). Hence xxx_unmount() operation should first check whether it's safe to unmount the filesystem or not. DNLC will be purged by VFS layer routines before we land in filesystem specific unmount operation.

(c) stat on the filesytem : df(1m) calls stat for each mount point. In this operations, we are required to return following information in statvfs64 structure :

f_bsize    // block size
f_frsize // block size. UFS has fragment size to accomodate small files.
f_blocks // total number of blocks in the filesystem
f_bfree // free blocks
f_files = (fsfilcnt64_t)-1;
f_ffree = (fsfilcnt64_t)-1;
f_favail = (fsfilcnt64_t)-1;
f_fsid // filesystem id
(void) strcpy(sp->f_basetype, vfssw[vfsp->vfs_fstype].vsw_name); // name
f_flag = vf_to_stf(vfsp->vfs_flag); // flag
f_namemax // MAX filename size.


(d) sync operation : For read-only filesystem, we don't need to implement sync. Otherwise, it's used for flushing dirty pages in the filesystem.

(e) root operation : used by filesystem lookups to determine the root (or mount point). We are required to hold the vnode.

Vnode layer exports following operations. We will focus on operations which are required to support read operations on the filesystem. Write operations are very tricky as you need to implement host of other operations and locking the filesystem.

(a) read : This operation is invoked whether read(2) is called. In this routine, we use segmap to read the data of the file. We force fault the pages using

segmap_getmapflt(segkmap, vp, (off + mapon), , 1, S_READ);

and then uiomove is called to copy back to userland. We also release the smp (segmap entry) using segmap_release() once uiomove() is done. Please note that segmap uses 8192 (MAXBSIZE), so according you're required to manage the offset (off) and mapon which are calculated as :

off = uoff & (offset_t)MAXBMASK; mapon = (u_offset_t)(uoff & (offset_t)MAXBOFFSET);

(b) getattr : In this operation, we need to return 'vattr' struture. 'ls -l' read this struture. Following members are relvant here :-

va_type   // type of vnode
va_mode // mode
va_uid // uid
va_gid // gid
va_atime.tv_sec // access time
va_mtime.tv_sec // modification time
va_ctime.tv_sec // creation time
va_size // size
va_nlink // link count
va_blksize // block size
va_nblocks // number of blocks


(c) lookup : This is the heart of any filesystem. We must provide lookup in the filesystem before we can read files or seach in a directory. This routine understands the filesystem structure. In this operation, you can also use DNLC (Directory name lookup cache) to enhance the fs lookup. The Vnode and name will be cached and we don't to go to the disk all the time to search for a file/directory. dnlc_enter() can be used to put an entry in DNLC and dnlc_lookup() can be used to search whether vnode can be found in DNLC given the name. Both the routines increment v_count using VN_HOLD().

(d) getpage_miss/getpage : This routine will read the block of a file given the offset. Here we need to setup the page using page_create_va() and prepare for reading the block data using pageio_setup(). In order to issue the IO, we do following things in order -- bdev_strategy(), biowait() and then pageio_done(). In order to support read-ahead, we can use pvn_read_kluster() routines. Filesystem specific getpage() routine will call getpage_miss() to read the block. In getpage(), we also do page_lookup() in order to save going to disk if page is already there in memory.

(e) readdir : This operation is used to read the directory entries. uio_offset passed in uio struture is the key thing here. If uio_offset is same as the filesize, then we have read all the directory entries. If that's not the case, then we read directory entries starting from the last offset which is passed to us in uio_offset. At the end, we are required to return the new offset in uio_offset, so that next time when readdir() is call again, we can read more directory entries.

There are host of other functions which are required when write is also supported on the filesystem. For instance putpage, write etc. In order to support mmap(), we need to use segvn segment driver instead of segmap.