Tuesday, October 10, 2006

Multi-CPU Binding in Solaris

We are working on a framework which would allow processes/thread to have affinity to more than one CPU. The affinities could be divided into three categories -- (a) strong affinity (b) weak affinity and (c) negative affinity.

(a) strong affinity :- This type of affinity would allow processes/threads to run only on specified CPUs.

(b) weak affinity :- This type of affinity would allow processes/threads to run on its home lgroup or CPUs specified or any CPUs if it can't run on home lgroup/CPUs. The order is also followed in the same way when Solaris Dispatcher would choose a CPU.

(c) negative affinity :- This type of affinity would allow processes/threads to not run on the CPUs specified.

At present, only strong/negative affinity could change thread's home lgroup; so on a NUMA aware machine, users need to be more cautious. These affinity are stored in bitmask of CPUs (cpuset_t). During offline phase, CPU will be removed from thread's bitmask and if it happens to be the only CPU in its bitmask, we would generate an event using contract fs so that application programs can take appropriate action in an event when affinity is revoked during offline or even when a CPU goes out from processor set.

The boundaries laid by CPU partitions will still be there and Multi-CPU binding will not allow processes/threads to cross partitions (or proessesor sets).

Idle thread is also modified to accordingly look for work. Strong affinity threads can't be stolen if a thread doesn't have that CPU in its bitmask. Weak affinity threads can be stolen. Run queue balancing done by setbackdq() is done for all the affinities.

An example of it :-

bash-3.00# ./pbind -s 528-530 `pgrep aff`

bash-3.00# dtrace -s ./a.d ## D script capturing context switches.
CPU no. of times ran
529 197
528 208
530 210

bash-3.00# ./pbind -q `pgrep aff`
process id 3211: not bound
process id 3211: strong affinity to: 528-530

bash-3.00# psradm -f 529 528

bash-3.00# dtrace -s ./a.d ## D script capturing context switches.
CPU no. of times ran
530 255


If you were to offline CPU 530 also, this would cause us to revoke the affinities because this process had strong affinity and there wouldn't be any CPU where it can run. The purpose is to allow offline (for DR or other FMA events). Same hold true for processor set as well if a CPU is removed from the pset and it happens to be be last CPU in the threads CPU bitmask.

We can preserve affinity to a CPU when a CPU is offlined so that when it is brought back users don't have to bother about finding a suitable CPU provided it's not the last CPU in its bitmask. I'm not sure whether it would be good or do we really want to do this. I do have a prototype based on that.

The above demo is just for what we are trying to achive and it's in the prototyping stage.

Wednesday, March 8, 2006

VFS/Vnode Layer in Solaris

In past I have mostly written on dispatcher locks (thread locks), scheduler, signal, procfs. This is for the first time, I'm writing about filesystem. I hope it'll help you in increasing awareness on filesystem so that developing filesystem specific things on Solaris is made easy.

In this blog, I'll dessribe about how to implement VFS (Virtual Filesystem) Layer and Vnode layer for any filesystem. There are two ways you can read disk data :-

(a) using buffer cache : bread() is used to read a block of the device. The block number is always with respect to the device. brelse() must be called once buffer data is read from buf_t->b_un.b_addr

(b) using segmap driver and setting up the pages.

Using VFS layer, we can export following filesystem operations :-

(a) mount : In this operation, we need to first see whether device can be mounted or not. We also need to read the super-block (depending upon whether it's primary partition or logical partition). We are required to create pseudo device also using following calls

pseudodev = makedevice(getmajor(xdev), minor); // xdev is the device passed to mount(1m) devvp = makespecvp(xdev, VBLK); // devvp is used to do reads

Once the pseudo device is created, we open the device to read super-block and check the filesystem signature. This information is copied to in-core super-block. Now comes the hard work to mount the filesystem. Here we get the vnode for the mount point and mark it VROOT (vp->v_flag). The VFS structure is also filled. For instance vfs_data will have pointer to fs structure (struct ufsvfs) which will have super-block and other general other information about the filesystem. VFS layer routines takes care of adding vfs structure to the global array 'vfssw' of struct vfssw type.

(b) unmount : This operation is very critical. Unmount should not go through while processes are inside the mount point unless -f (force flag is passed to umount(1m)). We need to maintain the reference count so that we don't allow unmount to go through while process's current working directory is inside the mount point. For this we can increment the reference count whenever vnode is allocated and decrement it whenever vnode is released via VOP_INACTIVE(). Hence xxx_unmount() operation should first check whether it's safe to unmount the filesystem or not. DNLC will be purged by VFS layer routines before we land in filesystem specific unmount operation.

(c) stat on the filesytem : df(1m) calls stat for each mount point. In this operations, we are required to return following information in statvfs64 structure :

f_bsize    // block size
f_frsize // block size. UFS has fragment size to accomodate small files.
f_blocks // total number of blocks in the filesystem
f_bfree // free blocks
f_files = (fsfilcnt64_t)-1;
f_ffree = (fsfilcnt64_t)-1;
f_favail = (fsfilcnt64_t)-1;
f_fsid // filesystem id
(void) strcpy(sp->f_basetype, vfssw[vfsp->vfs_fstype].vsw_name); // name
f_flag = vf_to_stf(vfsp->vfs_flag); // flag
f_namemax // MAX filename size.


(d) sync operation : For read-only filesystem, we don't need to implement sync. Otherwise, it's used for flushing dirty pages in the filesystem.

(e) root operation : used by filesystem lookups to determine the root (or mount point). We are required to hold the vnode.

Vnode layer exports following operations. We will focus on operations which are required to support read operations on the filesystem. Write operations are very tricky as you need to implement host of other operations and locking the filesystem.

(a) read : This operation is invoked whether read(2) is called. In this routine, we use segmap to read the data of the file. We force fault the pages using

segmap_getmapflt(segkmap, vp, (off + mapon), , 1, S_READ);

and then uiomove is called to copy back to userland. We also release the smp (segmap entry) using segmap_release() once uiomove() is done. Please note that segmap uses 8192 (MAXBSIZE), so according you're required to manage the offset (off) and mapon which are calculated as :

off = uoff & (offset_t)MAXBMASK; mapon = (u_offset_t)(uoff & (offset_t)MAXBOFFSET);

(b) getattr : In this operation, we need to return 'vattr' struture. 'ls -l' read this struture. Following members are relvant here :-

va_type   // type of vnode
va_mode // mode
va_uid // uid
va_gid // gid
va_atime.tv_sec // access time
va_mtime.tv_sec // modification time
va_ctime.tv_sec // creation time
va_size // size
va_nlink // link count
va_blksize // block size
va_nblocks // number of blocks


(c) lookup : This is the heart of any filesystem. We must provide lookup in the filesystem before we can read files or seach in a directory. This routine understands the filesystem structure. In this operation, you can also use DNLC (Directory name lookup cache) to enhance the fs lookup. The Vnode and name will be cached and we don't to go to the disk all the time to search for a file/directory. dnlc_enter() can be used to put an entry in DNLC and dnlc_lookup() can be used to search whether vnode can be found in DNLC given the name. Both the routines increment v_count using VN_HOLD().

(d) getpage_miss/getpage : This routine will read the block of a file given the offset. Here we need to setup the page using page_create_va() and prepare for reading the block data using pageio_setup(). In order to issue the IO, we do following things in order -- bdev_strategy(), biowait() and then pageio_done(). In order to support read-ahead, we can use pvn_read_kluster() routines. Filesystem specific getpage() routine will call getpage_miss() to read the block. In getpage(), we also do page_lookup() in order to save going to disk if page is already there in memory.

(e) readdir : This operation is used to read the directory entries. uio_offset passed in uio struture is the key thing here. If uio_offset is same as the filesize, then we have read all the directory entries. If that's not the case, then we read directory entries starting from the last offset which is passed to us in uio_offset. At the end, we are required to return the new offset in uio_offset, so that next time when readdir() is call again, we can read more directory entries.

There are host of other functions which are required when write is also supported on the filesystem. For instance putpage, write etc. In order to support mmap(), we need to use segvn segment driver instead of segmap.