Saurabh Mishra (Technical Blog)

Wednesday, September 2, 2009

Writing a new Ethernet device driver for Solaris

This blog entry goes into describing what all you should keep in mind while writing a new Ethernet device driver for Solaris. What we will not go into are LSO, HW checksum and supporting multiple RX rings as I have not written code for these features.

Most Ethernet controllers will have descriptor based TX and RX. The starting point for writing a new device driver is getting attach() and detach() working. Well that's fairly easy but mostly we would want to do following things in attach() :

- Get the vendor/device-id and make sure we have correct chip by looking at the revision.

- Pre-allocate all DMA buffers for TX. You will have to anyway pre-allocate all RX buffers. This is the simplest model you can think off but will require bcopy (an extra copy during TX/RX). But hey you are just starting...

- Allocate interrupts, Register MAC and MII.

- Reset PHY if required and do it before starting MII (mii_start() function). Reset the device too...

- You must enable device interrupts before returning from attach() and this should be the last operation before returning from attach().

- MII layer in Solaris will take care of PHY operations and dladm link properties too. So you need to have getprop and setprop in MAC callback (m_callback). MII can also take care of some common Statistics and ndd. You need to implement PHY read/write/reset operations which are PHY specific.

One noticeable thing I'd like to point out here is that have one DMA alloc and free function to allocate and free a DMA handle/memory. It simplifies code a lot. The same function can be used to allocate TX/RX descriptor ring, DMA buffers for TX/RX and memory for statistics or control block. You need to pass DMA attribute structure and a flag (DMA Read/Write flag). A typical example of such a function will look like this :-

typedef struct xxxx_dma_data {
ddi_dma_handle_t hdl;
ddi_acc_handle_t acchdl;
ddi_dma_cookie_t cookie;
caddr_t addr;
size_t len;
uint_t count;
} xxxx_dma_t;

xxxx_dma_t *
xxxx_alloc_a_dma_blk(xxxx_t *xxxxp, ddi_dma_attr_t *attr, int size, int flag)
{

 int err;
 xxxx_dma_t *dma;

 dma = kmem_zalloc(sizeof (xxxx_dma_t), KM_SLEEP);

 err = ddi_dma_alloc_handle(xxxxp->xxxx_dip, attr,
     DDI_DMA_SLEEP, NULL, &dma->hdl);

 if (err != DDI_SUCCESS) {
  goto fail;
 }

 err = ddi_dma_mem_alloc(dma->hdl,
     size, &xxxx_mem_attr, DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL,
     &dma->addr, &dma->len, &dma->acchdl);

 if (err != DDI_SUCCESS) {
  ddi_dma_free_handle(&dma->hdl);
  goto fail;
 }

 err = ddi_dma_addr_bind_handle(dma->hdl, NULL, dma->addr,
     dma->len, flag | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,
     NULL, &dma->cookie, &dma->count);

 if (err != DDI_SUCCESS) {
  ddi_dma_mem_free(&dma->acchdl);
  ddi_dma_free_handle(&dma->hdl);
  goto fail;
 }

 return (dma);
fail:
 kmem_free(dma, sizeof (xxxx_dma_t));
 return (NULL);

}

void
xxxx_free_a_dma_blk(xxxx_dma_t *dma)
{

 if (dma != NULL) {
  (void) ddi_dma_unbind_handle(dma->hdl);
  ddi_dma_mem_free(&dma->acchdl);
  ddi_dma_free_handle(&dma->hdl);
  kmem_free(dma, sizeof (xxxx_dma_t));
 }

}

Some of the corner cases you must take care:

- Test the code path when there are no more TX descriptors available for the driver to send a pkt. You must call mac_tx_update() once a descriptor is reclaimed. Some drivers start reclaiming once threshold is reached.

- Make sure you handle RX FIFO overflow interrupt properly. The driver may not have enough RX descriptor to receive pkts further and hence you must consume posted RX descriptors. Some chips require reset during RX FIFO.

General things that you may want to consider:

- Call mac_tx_update() outside lock.

- Try to raise a software interrupt whenever a hardware interrupt is raised. Don't spend too much time processing pkts in the hardware interrupt context.

- Make sure chip is quiesced when detach is called.

- Use DDI's ddi_periodic_add(9F) instead of timeout(9F).

- Test suspend/resume and quiesce (for fast reboot to work).

- I think most the Multicast filters are hash-based but I have seen a CAM (Content Addressable Memory) based filter too. It can get tricky to support multicasting and in that case just enable ALL multicast. Hash-based multicast filter are easy to implement. You can have a reference count for every bit in the 64-bit variable. Once the reference count for the bit reaches zero, you make the bit zero. Otherwise it should remain set.

- Make sure you handle link status change properly and re-program the MAC register if required at different link speed/duplex.

- Look for memory leaks (enable kmem_flags = 0xf in /etc/system and take crash dump; then run ::findleaks in mdb)

You can use NICDRV or HCTS for testing and NICDRV will stress test most of the components in your driver including MAXQ, FTP, Ping with different payloads, load/unload of the driver, Multicast, dladm(1m) features, VLAN, VNIC etc.

Tuesday, August 4, 2009

EOI (End-of-Interrupt) vs Directed-EOI

This post is to help us distinguish between EOI and Directed-EOI. When a local APIC clears EOI register, it does two things :-

- Clear the appropriate bit in the ISR register of the local APIC.

- Issue a broadcast EOI message to all the IOAPICs in the system.

In Solaris, we clear EOI register of the local APIC at two different places:

- For edge interrupts, we clear EOI register while raising the TPR (Task Priroity register) i.e apic_intr_enter().

- For level-triggered interrupts, we clear EOI register when exiting from interrupt handler i.e apic_intr_exit().

The notion of Directed-EOI had come from x2APIC specification. The Directed-EOI here does not refer to generating broadcast EOI message to all the IOAPICs. What we do here is clear ISR in the local APIC (by writing 0 to EOI register in the local APIC) and then clear the appropriate vector index in the IOAPIC. Some CPUs are capable of masking the broadcast EOI message and that's when Directed-EOI comes handy. Note that Directed-EOI has no meaning when interrupt is Edge. For Edge interrupt, we don't send any Directed-EOI.

Tuesday, June 16, 2009

x2APIC and a new device driver for Broadcom Fast Ethernet chips

Its been quite a while since I wrote something technical on my blog. I have been working on quite a few things off-late. Since my integration of x2APIC - a new Local APIC model which uses MSR (Model Specific Register) on future generation Intel Processors, I took a small challenge to work on Device Drivers and that too an Ethernet Controller. Having gained no knowledge about Networking and Device Driver in past years, I thought this is the time to jump-in. Better late than never you know. So this blog is really about two major things :-

x2APIC - A new Local APIC (Advance Programmable Interrupt Controller). It improves performance as the local APICs can write to registers parallely. With xAPIC (MMIO model), we use-to map local APIC registers in memory and hence any write to I/O space used to get serialize. x2APIC has some improvements in IPI (Inter-Processor Interrupt) too. It also extends support for Local APIC ID > 255 but I don't think any BIOS programs Local APIC ID > 255 as of now.

Broadcom Fast Ethernet (SUNWbfe) - This is a project which turned out to be a good experience. I had no prior knowledge of writing device drivers or Ethernet controllers. Initially, I was quite confused about the Ring-Architecture, Descriptors and Buffers. I was not able to fit everything in a big picture and convince myself that it works. I managed to learn about them after spending some two weeks looking for some documents on how TX/RX rings are organized. So the first thing was to document how a TX/RX ring is organized and it's well described here.

Solaris now have support for Broadcom 100-T-Base Fast Ethernet controller. It is a bit old Ethernet controller but a popular one. Moreover it makes lot of sense on Netbooks than laptops. This chip has only one TX and RX ring. The number of descriptors are programmable and it supports Multicast through CAM (Content Addressable Memory for 64 entries). It does not have support for Jumbo frame though and hence MTU is 1500. Having integrated bfe in Solaris Nevada the other day, my next target is to add support for Atheros/Attansic Ethernet controllers. They come in three flavors :-

- Atheros/Attansic L2 Fast Ethernet as device-id 0x2048

- Atheros/Attansic's AR8121/AR8113 PCI-E Ethernet Controller as device-id 0x1026

- Atheros/Attansic L1 Gigabit Ethernet 10/100/1000 Base as device-id 0x1048

The plan is to have support for all the three chips in atge (a new device driver or SUNWatge). I have started the work and I expect it to complete in two-three months timeframe.

/Saurabh

http://saurabhslr.blogspot.com

Sunday, April 19, 2009

Install-Time-Update (ITU) and Driver Binding in Solaris

If you ever wonder how to create install time driver updates for Solaris 10 and Nevada, then you may want to read this blog entry as it involves few tricks here and there. There are two ways to make your device work with Solaris. The install-time-update (aka ITU DU or ITU diskette) is only required for the case where the disk drive will become the Solaris boot drive. For all other case, you should be able to generate a package and run pkgadd(1m) command to install the driver package on running Solaris.

ITU Method

In order to install Solaris onto a bootable drive supported by your driver, you can use an Install Time Update (ITU). The ITU must have your driver (both 32-bit and 64-bit binaries) and PCI-IDs of the device your driver supports.

How to construct an ITU

Make sure you have Solaris 10 and Nevada binaries of yours driver for both the 32-bit and 64-bit Operating System and the your_driver.conf (driver configuration) file. You should get the pkg_drv(1m) command by installing the SUNWpkgd package from this link

In order to create an ITU for Solaris 10 and Nevada, you would want to create two directories and run pkg_drv(1m) there.

For Solaris 10

# mkdir -p /var/tmp/your_driver.5.10
# cd /var/tmp/your_driver.5.10

Copy your driver and your_driver.conf file in the current directory.

# mkdir -p kernel/drv/amd64
# cp <32-bit> .
# cp <32-bit> kernel/drv/your_driver
# cp <64-bit> kernel/drv/amd64
# cp your_driver.conf .
# pkg_drv -i '"pciVVVV,DDDD.SSSS.ssss"' -o `pwd`/PKG -c scsi -r 5.10 your_driver

VVVV = Vendor-id
DDDD = Device-id
SSSSS = Subsystem-vendor-id
ssss = Subsystem-device-id
PKG = your_driver.
'-c scsi' is for device class and in this example we have been discussing about disk drive.

The output of the pkg_drv(1m) will resemble the output below :-

input file: drv=your_driver
input file: conf=your_driver.conf
WARNING: pkg_drv: pkg/driver name exists in /etc/driver_aliases
Suggested Package Naming Conventions: 8 characters, with the first capitalized characters uniquely specifying the company (e.g. stock market ticker). The remaining characters specify the driver (e.g. SUNWcadd for a CAD driver from Sun Microsystems). The driver name must be unique across all Solaris platforms and releases.

## Building pkgmap from package prototype file.
## Processing pkginfo file.
## Attempting to volumize 8 entries in pkgmap.
part 1 -- 276 blocks, 29 entries
## Packaging one part.
/tmp/12546/PKG/pkgmap
/tmp/12546/PKG/pkginfo
/tmp/12546/PKG/reloc/boot/solaris/devicedb/master
/tmp/12546/PKG/install/copyright
/tmp/12546/PKG/install/depend
/tmp/12546/PKG/install/i.master
/tmp/12546/PKG/reloc/kernel/drv/your_driver
/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf
/tmp/12546/PKG/install/postinstall
/tmp/12546/PKG/install/postremove
/tmp/12546/PKG/install/r.master
## Validating control scripts.
## Packaging complete.
output pkg: See package directory PKG in /tmp/12546
pkg_drv: 2 warnings 0 errors

bash-3.2# find /tmp/12546
/tmp/12546
/tmp/12546/PKG
/tmp/12546/PKG/pkgmap
/tmp/12546/PKG/pkginfo
/tmp/12546/PKG/reloc
/tmp/12546/PKG/reloc/boot
/tmp/12546/PKG/reloc/boot/solaris
/tmp/12546/PKG/reloc/boot/solaris/devicedb
/tmp/12546/PKG/reloc/boot/solaris/devicedb/master
/tmp/12546/PKG/reloc/kernel
/tmp/12546/PKG/reloc/kernel/drv
/tmp/12546/PKG/reloc/kernel/drv/your_driver
/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf
/tmp/12546/PKG/install
/tmp/12546/PKG/install/copyright
/tmp/12546/PKG/install/depend
/tmp/12546/PKG/install/i.master
/tmp/12546/PKG/install/postinstall
/tmp/12546/PKG/install/postremove
/tmp/12546/PKG/install/r.master

Copy the following files from '/tmp/12546' as follows :-

# cd /var/tmp/your_driver.5.10
# cp /tmp/12546/PKG/pkgmap .
# cp /tmp/12546/PKG/install/postinstall .
# cp /tmp/12546/PKG/install/postremove .
# cp /tmp/12546/PKG/install/copyright .

You can run 'pkgproto' command or make a prototype file manually :

bash-3.2# cat > prototype
i copyright
i postremove
i postinstall
i pkginfo
d none kernel 0755 root sys
d none kernel/drv 0755 root sys
d none kernel/drv/amd64 0755 root sys
f none kernel/drv/amd64/your_driver 0644 root sys
f none kernel/drv/your_driver 0644 root sys
f none kernel/drv/your_driver.conf 0644 root sys

Make sure you include both the 32-bit and 64-bit binaries of your driver. Once this is completed, we will construct the package again to include 64-bit binary of the driver.

# cd /var/tmp/your_driver.5.10
# pkgmk -r . -d /tmp

This will create '/tmp/PKG' directory under /tmp and that's where the package is. For example :-

bash-3.2# pkgmk -r . -d /tmp
## Building pkgmap from package prototype file.
## Processing pkginfo file.
## Attempting to volumize 6 entries in pkgmap.
part 1 -- 444 blocks, 23 entries
## Packaging one part.
/tmp/PKG/pkgmap
/tmp/PKG/pkginfo
/tmp/PKG/install/copyright
/tmp/PKG/reloc/kernel/drv/amd64/your_driver
/tmp/PKG/reloc/kernel/drv/your_driver
/tmp/PKG/reloc/kernel/drv/your_driver.conf
/tmp/PKG/install/postinstall
/tmp/PKG/install/postremove
## Validating control scripts.
## Packaging complete.
bash-3.2#

Do following things to repack package in DU (Diskette) :-

# cd /tmp
# find PKG -print | cpio -o > /tmp/pkg_of_your_driver
# compress /tmp/pkg_of_your_driver
# cd /var/tmp/your_driver.5.10/PKG
# cp /tmp/pkg_of_your_driver.Z PKG/DU/sol_210/i86pc/Product/your_driver.Z

For Solaris Neavda

Repeat the same steps as we did for Solaris 10 except for following things :-

Create a new directory '/var/tmp/your_driver.5.11' since you are working on Solaris Nevada. Make sure pkg_drv(1m) command run with '-r 5.11'.
When copying your_driver.Z copy to DU, make sure you change the path to 'sol_211' in ' PKG/DU/sol_210/i86pc/Product/your_driver.Z'.

Once you have created ITU for Solaris 10 and Nevada, we will bundle them in one DVD/CD (or ISO file). In the directories '/var/tmp/your_driver.5.11' and '/var/tmp/your_driver.5.10', you will find a directory called 'PKG'. You must copy the files under 'PKG' to one directory in order to bundle them together.

# mkdir -p /var/tmp/YOUR_DRIVER-DU
# cd /var/tmp/YOUR_DRIVER-DU
# cp -rf /var/tmp/your_driver.5.11/PKG/* .
# cp -rf /var/tmp/your_driver.5.10/PKG/* .

Please run the following command to make an ISO file from the directory /var/tmp/YOUR_DRIVER-DU :

# mkisofs -o your_driver.iso -r /var/tmp/YOUR_DRIVER-DU

This will create an ISO file 'your_driver.iso' and a DVD/CD can be burned by running the following command line at the prompt :-

# cdrw -i /var/tmp/YOUR_DRIVER-DU/your_driver.iso

In order to install Solaris on boot drives, you use Solaris Installer DVD and choose option '5' (Apply Driver Updates)'. Kindly follow the instructions when prompted.

The other way is to bundle the device driver in Solaris bootable media itself or for network installation. Kindly follow the instructions described at this link.At the above link, it describes how to pack/unpack Solaris miniroot in order to make changes to Solaris bootable media.

Driver Binding in Solaris

Driver binding in Solaris is not so easy to understand. The way Solaris binds a driver is based on the precedence. This precedence list is maintained in the 'compatible' property of the device driver. The two functions which are responsible for creating 'compatible' property and finding the correct binding for the driver are - add_compatible() and ddi_compatible_driver_major() respectively.

The responsibility of add_compatible() function is to create 'compatible property' for driver binding in the order described below. For PCI Card, the precedence is created as follows :-

* pciVVVV,DDDD.SSSS.ssss.RR (0)
* pciVVVV,DDDD.SSSS.ssss (1)
* pciSSSS,ssss (2)
* pciVVVV,DDDD.RR (3)
* pciVVVV,DDDD (4)
* pciclass,CCSSPP (5)
* pciclass,CCSS (6)

For PCI Express card, the precedence will look like this :

* pciexVVVV,DDDD.SSSS.ssss.RR (0)
* pciexVVVV,DDDD.SSSS.ssss (1)
* pciexVVVV,DDDD.RR (2)
* pciexVVVV,DDDD (3)
* pciexclass,CCSSPP (4)
* pciexclass,CCSS (5)
* pciVVVV,DDDD.SSSS.ssss.RR (6)
* pciVVVV,DDDD.SSSS.ssss (7)
* pciSSSS,ssss (8)
* pciVVVV,DDDD.RR (9)
* pciVVVV,DDDD (10)
* pciclass,CCSSPP (11)
* pciclass,CCSS (12)

RR = Revision number
CC = Class code
(0) = being the highest precedence
(12) = being the least precedence.

You can get the 'compatible' property by running 'prtconf -vp' command. If the Solaris fails to find a binding using 'compatible' property, then it tries by 'nodename' and the 'nodename' is constructed from Subsystem-vendor-id (SSSS) and Subsystem-device-id (ssss) of the device. The PCI-ID which we have been seeing here is embedded in the PCI Config space of the device.

Device Drivers and device firmware must make sure that the proper PCI-IDs are chosen to avoid conflict with existing PCI-IDs. If your device is PCI-Express based card, then you must add 'pciexVVVV,DDDD.SSSS' like PCI-IDs in /etc/driver_aliases or via add_drv(1m) or pkg_drv(1m) command.

Friday, March 20, 2009

Latency group (lgroup) in Solaris on NUMA aware machines

All of you would have heard about NUMA (Non-uniform-memory-access) machines. I'm going to describe how the memory latency groups (called lgroup in Solaris) are layed out. While working on Multi-CPU binding project, I had to learn these aspects to implement how to choose a lgroup for a thread having least latency from its earlier home lgroup.

This figure below describes how the lgroup structures are layed out on SPARC based NUMA aware machines. The root lgroup (0) is the top most level of the hierarchy having all the resource sets in the system. lgroup id 1, 2 and 3 are having four CPUs each (system board) and are leaf nodes in this case. On sparc, the remote latency from lgroup 2 to 1 or 3 is same i.e they are equidistant having local and remote latency. In Solaris, we have something called lgroup partition load (lpl_t) which represents the leaf-nodes having CPUs and memory. Each cpu_t (CPU struture) will have cpu_lpl. lpl's are also used when CPU partitons are created (processor sets are the best example). There's a global table of lgroups called lgrp_table[]. Each partition will have its lpl's in cp_lgrploads[] (cpupart_t). Both the tables are indexed by lgroup id. A thread will be homed to an lpl with in the CPU partition.

On a 4-way amd64, the lgroup representation is quite interesting as we have local and in remote we have one and two hops. For example psrinfo(1M) revealed this :-
0 on-line since 06/09/2006 06:49:25
1 on-line since 06/09/2006 06:49:31
2 on-line since 06/09/2006 06:49:33
3 on-line since 06/09/2006 06:49:35

Each CPU is a leaf lgroup. The diagram below explains this very well. In the this kind of configuration, we will have non-leaf nodes as 5, 6, 7 and 8 representing resource sets which are one hop away. For example lgroup id 5 is having 1,2,3 (local and one hop away from lgroup 1). The root lgroup id (0) will have everything.

On SPARC, we have two levels of memory hierarchy whereas on 4-way amd64 has three levels of memory hierarchy. 8 way amd64 should have four levels of memory hierarchy. The scheduling of threads starts from it's home lgroup and goes up the hierarchy. For example if the home of a thread (t->t_lpl) is lgroup 1 (CPU 0 is the resource set), then we would first look at CPU 0 and if thread can't run there, then we will look at the parent of lgroup 1 (lpl_parent) which is lgroup 5 having 1,2,3 as resource sets. Same is true when idle thread steals the work from other CPUs. The locality is kept in mind.

The lgroup hierarchical representation is more interesting when there are three hops (for example on a 8-way amd64 box). I'll leave it for next time. Thanks to Jonathan Chew for taking time and explaining all this. I thought it'd be worth to blog about this since it's a bit complex design.

Solaris APIC implementation with respect to MSI/MSI-x interrupts

Here's some basic information on APIC before we dive into Solaris details and if you want more detail on APIC then you can refer to this Wiki. Solaris details are based on Solaris Neavda Build 84.

What's Local APIC

Local APIC (LAPIC) is part of the CPU chip and it contains (a) mechanism for generating/accepting interrupts (b) a timer (c) manages all external interrupts for the processor and (d) accept and generate inter-processor-interrupts (IPIs).

What's IOAPIC

This is a separate chip that is wired to the local APIC so that it can forward interrupts to the appropriate CPU (and to local APIC).

What's Local APIC Table

Interrupt vectors are numbered 0x00 through 0xFF in APIC and 0x00...0x1F are reserved for exceptions. The interrupt vectors in the range 0x20...0xFF are available for programming the interrupts in APIC. Like the Local APIC's, the IOAPIC will assign a priority to the interrupt based on the vector number and and it uses 4 top bits of the vector number to distinguish priority and ignores the lower 4 bits. For example if the vector number is 0x3F then the priority would be 0x3. In Solaris, this priority mask is represented by APIC_IPL_MASK (0xF0) and the vector mask is represented by APIC_VECTOR_MASK (0x0F).

Since we can't use vector range from 0x00...0x1F, Solaris represents APIC_BASE_VECT (0x20) as the base vector and APIC_MAX_VECTOR (0xFF) being the maximum number of vectors in the local APIC. APIC_AVAIL_VECTOR is calculated based on this formula :-

APIC_MAX_VECTOR+1-APIC_BASE_VECT and it translates to (0xFF+1-0x20) which is 224 vectors in decimal.

Note that vectors are grouped in 16 priority groups and each group has 0x10 number of vectors. These 16 vectors share the same priority.

APIC Data Structures in Solaris

Here is the big picture on how the various APIC data structures are related to each other. These data structures are described below :-

apic_irq_table[] - Holds all IRQ entires. Each entry is of type apic_irq_t and total size of the table is APIC_MAX_VECTOR + 1. Note that IRQ has no meaning with respect to MSI/MSI-x.

A typical apic_irq_t entry in the apic_ira_table[] looks like this :-

> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
22 0x61 6 PCI Lvl Fixed 1 2 0x0/0x16 bge_intr, ata_intr

> apic_irq_table+(0t22*8)/J
apic_irq_table+0xb0: fffffffec10d7f38

> fffffffec10d7f38::print apic_irq_t
{
airq_mps_intr_index = 0xfffd
airq_intin_no = 0x16 // set since it's FIXED type interrupt.
airq_ioapicindex = 0
airq_dip = 0xfffffffec01fd9c0 // dev info
airq_major = 0xca
airq_rdt_entry = 0xa061
airq_cpu = 0x1
airq_temp_cpu = 0x1
airq_vector = 0x61 // note that it matches with ::interrupts output
airq_share = 0x2 // two interrupts are sharing the same IRQ and vector
airq_share_id = 0
airq_ipl = 0x6 // IPL
airq_iflag = {
intr_po = 0x3
intr_el = 0x3
bustype = 0xd
}
airq_origirq = 0xa
airq_busy = 0
airq_next = 0
}
> 0xfffffffec01fd9c0::print 'struct dev_info' ! grep name
devi_binding_name = 0xfffffffec01fcf88 "pci-ide"
devi_node_name = 0xfffffffec01fcf88 "pci-ide"
devi_compat_names = 0xfffffffec0206940 "pci1002,4379.1025.10a.80"
devi_rebinding_name = 0
>

apic_ipltopri[] - This array holds Solaris IPL priority to APIC priority. For example :-

> apic_ipltopri::print
[ 0x10, 0x20, 0x20, 0x20, 0x30, 0x50, 0x70, 0x80, 0x80, 0x80, 0x90, 0xa0, 0xb0, 0xc0, 0xd0,
0xf0, 0 ]
>

Note the order of priority assignment. Higher vector numbers are being assigned to higher IPL. Also note that 0x20 is given to index 1,2,3 which means that IPL 1,2,3 share the same vector range 0x20...0x2F.

And apic_ipltopri[] is declared as :-

uchar_t apic_ipltopri[MAXIPL + 1]; /* unix ipl to apic pri */

apic_vectortoipl[] - This array is a bit complex. The main purpose of this array is to initialize apic_ipltopri[] array.

apic_init()
{
[.]
apic_ipltopri[0] = APIC_VECTOR_PER_IPL; /* leave 0 for idle */
for (i = 0; i < (APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL); i++) {
if ((i < ((APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL) - 1)) &&
(apic_vectortoipl[i + 1] == apic_vectortoipl[i]))
/* get to highest vector at the same ipl */
continue;
for (; j <= apic_vectortoipl[i]; j++) {
apic_ipltopri[j] = (i << APIC_IPL_SHIFT) +
APIC_BASE_VECT;
}
}

[.]

}

uchar_t apic_vectortoipl[APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL] = {
3, 4, 5, 5, 6, 6, 9, 10, 11, 12, 13, 14, 15, 15
};

Note that IPL 5 share vector range 0x40...0x5F (or 0x20...0x3F for optimization) and that's why vector index 2 and 3 have IPL 5. Similarly vector index 4,5 have IPL 6 (0x40...0x5F or 0x60...to 0x7F).

* IPL Vector range. as passed to intr_enter
* 0 none.
* 1,2,3 0x20-0x2f 0x0-0xf
* 4 0x30-0x3f 0x10-0x1f
* 5 0x40-0x5f 0x20-0x3f
* 6 0x60-0x7f 0x40-0x5f
* 7,8,9 0x80-0x8f 0x60-0x6f
* 10 0x90-0x9f 0x70-0x7f
* 11 0xa0-0xaf 0x80-0x8f
* ... ...
* 15 0xe0-0xef 0xc0-0xcf
* 15 0xf0-0xff 0xd0-0xdf
*/

apic_vector_to_irq[] - This array holds IRQ number given the vector number. If an element of this array contains APIC_RESV_IRQ (0xFE) then it means that the vector is free and can be allocated. apic_navail_vector() function checks this array to figure out how many vectors are available.

Here an example on how IPL to vector priority is mapped in Solaris :-

Lets say we got network interrupt at IPL 6 (ath - wifi interrupt) having vector number 0x60 (as shown above in the ::interrupt output). Now Solaris will block all interrupts at and below IPL 6 which is done by apic_intr_enter() function. In this function, the caller actually subtracts 0x20 (APIC_BASE_VECT) from the vector number. Anyway, this is done for optimization but lets come to the point - apic_ipls[] array is used to get to the IPL which will be programmed in the APIC register. So we first get nipl as

nipl = apic_ipls[vector]; // vector is 0x40 not 0x60 as mentioned above and nipl will be 0x6
*vectorp = irq = apic_vector_to_irq[vector + APIC_BASE_VECT]; // This is done to get actual vector and irq.

and then this statement blocks all the interrupts at and below the vector priority (or IPL).

apicadr[APIC_TASK_REG] = apic_ipltopri[nipl];

So we write 0x70 to APIC task register to block interrupts. Note that Solaris uses range 0x60...0x7F for IPL 6 :-

* IPL Vector range. as passed to apic_intr_enter()
* 6 0x60-0x7f 0x40-0x5f

and it does not matter whether you write 0x70 or 0x7F as they all do the same work which is block interrupts at IPL 6 or below.

Solaris x86 Interrupt Handling

Now that we have glimpsed through the data structures involved, lets look at how Solaris x86 handles Interrupt. I prefer to describe interrupt handling before describing how interrupts are allocated because I felt interrupt handling is easier to understand.

Lets first go through how Solaris x86 is designed in terms of psm ops. For example, PCI express has its own psm ops which is apic_ops and PCI has its own psm_ops which is uppc_ops. In fact xVM (Zen based hypervisor) has its own psm_ops called xen_psm_ops. It is psm_install() who is responsible for installing psm in Solaris x86 world.

apic_probe_common() is what gets called when psm_install() jumps into psm_probe() for each psm_ops. apic_probe_common() does many things and one of them being mapping 'apicadr[]' (you would have seen this before; I referred it for setting APIC priority i.e task register). apic_cpus[] array also gets initialized by ACPI i.e acpi_probe() because ACPI tables have all the information like local apic cpu id, version etc.

Now lets see what happens when local APIC generates an interrupt. The interrupt could come from IOAPIC or MSI/MSI-x based generated interrupt (in-band message). Solaris calls cmnint() or _interrupt(). These are same and call do_interrupt() once regs is setup. do_interrupt() will first set the PIL so that CPU does not get any interrupt at or below PIL. Raising the priority of CPU is done using setlvl pointer to function. This pointer gets set to appropriate psm_ops's psm_intr_enter and in our case it will be apic_intr_enter(). Now comes the dispatching interrupt part which is done by calling switch_sp_and_call() once the stack of interrupt thread is setup. Recall that Solaris handles interrupts in thread context if PIL is at or below LOCK_LEVEL (0xa). High level interrupts (0xa...0xf) are handled in current thread's stack.

switch_sp_and_call() can dispatch three type of interrupts -- (a) software interrupts (b) high level interrupts and (c) normal device interrupts.

In our example, we have been looking at wifi interrupt and it will be (c) which maps to dispatch_hardint() routine. dispatch_hardint() calls av_dispatch_autovect() after enabling interrupts. Now that we are touching av_dispatch_autovect() routine, I must explain what is autovect[] array. If you remember add_avintr() which is responsible for registering a hardware interrupt handler then I think you can skip this part. autovect[] has MAX_VECT (256) elements and each element is of type 'struct av_head'. The first pointer in 'struct av_head' points to 'struct autovec' and autovec structure will have all the information about interrupt handler, arguments passed to interrupt handler, priority level etc. Note that more than one interrupt handler can share the same vector and they are linked by 'av_link' in 'struct autovec'. For example :-

> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
22 0x61 6 PCI Lvl Fixed 1 2 0x0/0x16 bge_intr, ata_intr

> ::sizeof 'struct av_head'
sizeof (struct av_head) = 0x10

> autovect+(0x10*0t22)=J // Take the IRQ and index into autovect[] array.
fffffffffbc52ba0

> fffffffffbc52ba0::print 'struct av_head'
{
avh_link = 0xfffffffec50d2cc0
avh_hi_pri = 0x6 // take a look at bge_intr() and its priority below
avh_lo_pri = 0x5 // take a look at ata_inr() and its priority below
}

> 0xfffffffec50d2cc0::print 'struct autovec'
{
av_link = 0xfffffffec10d2f40
av_vector = bge_intr
av_intarg1 = 0xfffffffec50d5000
av_intarg2 = 0
av_ticksp = 0xfffffffec506ae20
av_prilevel = 0x6
av_intr_id = 0xfffffffec537a078
av_dip = 0xfffffffec01f8400
}

> 0xfffffffec10d2f40::print 'struct autovec'
{
av_link = 0
av_vector = ata_intr
av_intarg1 = 0xfffffffec00bc8c0
av_intarg2 = 0
av_ticksp = 0xfffffffec0528898
av_prilevel = 0x5
av_intr_id = 0xfffffffec10cbe78
av_dip = 0xfffffffec01fd9c0
}
>

Here's an example which we have been discussing :-

bash-3.00# dtrace -n av_dispatch_autovect:entry'/`autovect[args[0]].avh_link->av_vector/{@[args[0]]=count(); printf("%a, %x", `autovect[args[0]].avh_link->av_vector, args[0])}'

1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13

There is a very interesting blog by Anish at this link on APIC and Solaris x86 interrupt handling.

How does Solaris APIC implementation allocates Interrupt

Now that we looked at how APIC is structured in Solaris x86 and how interrupts are handled, lets look at how interrupts are allocated. There are three types of interrupts -- DDI_INTR_TYPE_FIXED, DDI_INTR_TYPE_MSI and DDI_INTR_TYPE_MSIX in the order they are evolved. Solaris DDI routine ddi_intr_get_supported_types() can be called to retrieve types of interrupt supported by the Bus.

In case of MSI, apic_alloc_msi_vectors() gets called and in case of MSI-x, apic_alloc_msix_vectors() gets called to allocate the appropriate number of interrupt vectors. Note that MSI supports 32 number of vectors per device function and MSI-x supports 2048 number of vectors per device function however in Solaris x86 we only support 2 MSI-x interrupt vectors per device (the reason for studying APIC and MSI-x by me). On SPARC, Solaris supports far more MSI-x interrupt and configured by #msix-request property in DDI. This hard limit is determined by i_ddi_get_msix_alloc_limit() function however even on SPARC it seems we limit to 8.

msix_alloc_limit = MAX(DDI_MAX_MSIX_ALLOC, ddi_msix_alloc_limit);

/* Default number of MSI-X resources to allocate */
#define DDI_DEFAULT_MSIX_ALLOC 2

/* Maximum number of MSI-X resources to allocate */
#define DDI_MAX_MSIX_ALLOC 8

These limits will change when Interrupt Resource Management (IRM) framework is integrated in Solaris.

Anyway, lets get back to the topic. Depending upon the interrupt type and bus intr ops, Solaris will jump to interrupt ops. In our case, we will get into pci_common_intr_ops() from ddi_intr_alloc(9F) to allocate the interrupts with cmd DDI_INTROP_ALLOC. We will not get into FIXED type interrupts as they are hard wired via IOAPIC and fairly easy (I suppose). It's the psm_intr_ops which gets into action with cmd PSM_INTR_OP_ALLOC_VECTORS and we land up in apic_intr_ops().

apic_intr_ops
{
[.]
case PSM_INTR_OP_ALLOC_VECTORS:
if (hdlp->ih_type == DDI_INTR_TYPE_MSI)
*result = apic_alloc_msi_vectors(dip, hdlp->ih_inum,
hdlp->ih_scratch1, hdlp->ih_pri,
(int)(uintptr_t)hdlp->ih_scratch2);
else
*result = apic_alloc_msix_vectors(dip, hdlp->ih_inum,
hdlp->ih_scratch1, hdlp->ih_pri,
(int)(uintptr_t)hdlp->ih_scratch2);
break;
[.]
}

apic_alloc_msi_vectors() - This function allocates 'count' number of vectors for the device. 'count' has to be power of 2 and the priority is passed by the caller. The first thing which this function does is - it checks whether we have enough vectors available at the priority to satisfy the request and tt is done by routine apic_navail_vector(). We start our search whether we can get contiguous vectors and the value returned by apic_find_multi_vectors() is our starting point. It seems MSI has this constraint to give contiguous vectors only. I don't why.

The next step is to check whether we have enough irq's in the apic_irq_table[]. This is done by the function apic_check_free_irqs(). If we succeed in finding enough IRQ entries in the table, apic_alloc_msi_vector() proceeds to allocate irq which is done by apic_allocate_irq(). The IRQ no. returned by this function is finally used by autovect[] table to index into the appropriate vector. We will go into autovect[] again soon but for now lets see how we select CPU. The selection of CPU for this IRQ is done by apic_bind_intr() for the first interrupt in 'count' number of vectors and subsequent vectors are bound to the same CPU. These steps are done in a loop for 'count' number of times.

Now that we have setup IRQ in the apic_irq_table[] with priority, vector, target CPU etc, we are set to enable the interrupt. BTW, all this is mostly done in driver's attach(9E) entry point but mostly in two phases with in the attach(9E) entry point -- (i) add interrupts by allocating them (ii) enable interrupts.

apic_alloc_msix_vectors() - This function does similar work as done for MSI interrupts except that we allocate the vector (apart from allocating the IRQ entry in the apic_irq_table[]) and bind the interrupt to CPU by calling apic_bind_intr() for each request in 'count'). MSI-x does have the limitation of contiguous vectors as MSI has. Vector allocation is done by routine apic_allocate_vector() which returns the free vector by walking apic_vector_to_irq[] table and looking for APIC_RESV_IRQ slot. The range is determined by the priority passed to it. For example if priority passed is 6, then range would be

highest = apic_ipltopri[ipl] + APIC_VECTOR_MASK;
lowest = apic_ipltopri[ipl - 1] + APIC_VECTOR_PER_IPL;

if (highest < lowest) /* Both ipl and ipl - 1 map to same pri */
lowest -= APIC_VECTOR_PER_IPL;

highest is 0x7f (0x70 + 0x0f) and lowest would be 0x60 (0x50+0x10) and this matches with our observation in the beginning of the blog.

A typical flow of this dance is as follows :-

1 22557 apic_alloc_msix_vectors:entry name pciex8086,10a7, inum : 0, count : 2, pri :6
pcplusmp`apic_intr_ops+0x114
npe`pci_common_intr_ops+0x8f1
npe`npe_intr_ops+0x21
unix`i_ddi_intr_ops+0x54
unix`i_ddi_intr_ops+0x54
genunix`ddi_intr_alloc+0x263
igb`igb_alloc_intrs_msix+0x134
igb`igb_alloc_intrs+0x64
igb`igb_attach+0xcb
genunix`devi_attach+0x87

1 22485 apic_navail_vector:entry name : pciex8086,10a7, pri 6
1 22486 apic_navail_vector:return 31
1 22547 apic_allocate_irq:entry 72
1 22419 apic_find_free_irq:entry start :72, end : 253
1 22417 apic_find_io_intr:entry 72
1 22548 apic_allocate_irq:return 72
1 22479 apic_allocate_vector:entry ipl : 6, irq: 72, pri: 1
1 22480 apic_allocate_vector:return 96
1 22473 apic_bind_intr:entry name : pciex8086,10a7, irq 72
1 22474 apic_bind_intr:return 0

Now lets talk about how driver enables interrupts once they are allocated. Interrupts can be enabled in block (more than one at once by DDI ddi_intr_block_enable(9F)) or calling explicitly ddi_intr_enable(9F) for each interrupt however we will discuss ddi_intr_enable(9F) . Once again we will end up in pci_common_intr_ops() and call pci_enable_intr() which does two things mainly :-

- Translate the interrupt if needed. This is done by apic_introp_xlate(). If the interrupt is MSI or MSI-x, we call apic_setup_irq_table() if the IRQ entry in the apic_irq_table[] is not setup. In our example, we have already done this so apic_introp_xlate() just returns IRQ number from 'apic_vector_to_irq[airqp->airq_vector]'. airqp is an entry in the apic_irq_table[] which gets assigned by calling apic_find_irq().

- Add the interrupt handler by calling add_avintr(). We have actually touched this routine in this blog but it is worth mentioning - when in the life cycle of setting up interrupts we bind an interrupt handler (ISR or Interrupt Service Routine) to vector. The main task of add_avintr() is to insert 'autovec' in the appropriate index and call insert_av(). The other and the most important thing is to program the interrupt which is done by addspl(). addspl() is another pointer to function from the family of setlvl, setspl etc. In APIC case, it will be apic_addspl() which is just a wrapper over apic_addspl_common(). There are four arguments passed to it :-

apic_addspl_common(int irqno, int ipl, int min_ipl, int max_ipl)

We first get the pointer from apic_irq_table[] by indexing irqno and check if we need to upgrade vector or just check IPL in case this interrupt needs to be shared. Eventually we will land up in apic_setup_io_intr() which does the main task. In fact apic_rebind() binds an interrupt to a CPU and apic_rebind() is called from apic_setup_io_intr(). Since we are discussing MSI/MSI-x and once apic_rebind() does sanity checks it will call apic_pci_msi_enable_vector(). The following statement is what we write to program the interrupt :-

/* MSI Address */
msi_addr = (MSI_ADDR_HDR | (target_apic_id << MSI_ADDR_DEST_SHIFT));
msi_addr |= ((MSI_ADDR_RH_FIXED << MSI_ADDR_RH_SHIFT) |
(MSI_ADDR_DM_PHYSICAL << MSI_ADDR_DM_SHIFT));

/* MSI Data: MSI is edge triggered according to spec */
msi_data = ((MSI_DATA_TM_EDGE << MSI_DATA_TM_SHIFT) | vector);

apic_pci_msi_enable_mode() is also called from apic_rebind() to enable the interrupt once it's programmed. That's how per-vector masking is controlled I suppose.

Since we are touch how we bind an interrupt to a CPU, I should also mention how Solaris selects CPU to bind an interrupt. The routine apic_bind_intr() is responsible for doing this and the decision is based on value of tunable 'apic_intr_policy'. You can define three type of policy -- (a) INTR_ROUND_ROBIN_WITH_AFFINITY - round robin and affinity based policy which returns same CPU for the same dip (or device). This is the default policy. (b) INTR_LOWEST_PRIORITY - I don't know because it's not implemented and (c) INTR_ROUND_ROBIN - select cpu in round-robin fashion using 'apic_next_bind_cpu' global variable. Choosing between INTR_ROUND_ROBIN_WITH_AFFINITY vs INTR_ROUND_ROBIN may not be easy but I think the decision should be based on throughput vs locality awareness.

Monday, January 14, 2008

xVM experience so far

I recently configured xVM on Solaris - HVM (hardware-assisted virtual machine) and PV (Paravirtualized) guest (domU) domains. I could easily install Solaris 10 Update 5 as HVM domU, boot, configure network interface and assign IP. The plan is to have multiple domU as testbed having Solaris 10 and Solaris Nevada. This would cut down on machines and sanity checks can be done quickly as I don't have to install/boot OS every time. I can easily run functional tests if not performance benchmarks. The performance of Solaris 10 as HVM domain is not as good as Solaris Nevada (PV domU) and especially when there are more than one VCPUs but I guess it's being worked. I think the performance would drastically improve when we have PV (Paravirtualized) drivers for Solaris 10. I'll soon experiment installing xVM on my laptop and configure Windows XP as HVM domain.

Here's a small demo describing my experience so far with xVM :-

For installing the Solaris PV domuU, I used this sample script.

bash-3.2# cat snv.1.py<>name = 'solaris-pv'
memory = '1024'
vcpus = 4
# for installation
disk = [ 'file:/var/tmp/solarisdvd.iso,6:cdrom,r', 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]
on_poweroff = 'restart'
on_reboot = 'restart'
on_crash = 'preserve'

In 'disk', you will see 'file and 'phys' and they specify what kind of media it is. Once you have specified the location in 'disk', you also specify the type of access like read (r) or write (w).

Once you run '#xm create script.py', you will see OS installation screen and once the installation is completed, I used a similar script but removed solarisdvd paragraph from 'disk' (mentioned in the .py file).

name = 'solaris-pv'
memory = '1024'
vcpus = 4
disk = [ 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]
on_poweroff = 'destroy'
on_reboot = 'restart'
on_crash = 'preserve'
vif = [ 'mac=0:14:4f:2:12:35, ip=10.5.63.98, bridge=nge1' ]

With the 'vif' property you can specify what network interface you want. You can also set 'config/default-nic' property in xvm/xend service if you want to override the NIC. Finally, once you have booted guest domain, you will see the interface as rtls0. You can run 'dlmadn show-dev' to see if network interface is really configured or not and run ifconfig(1m) to plumb the interface.

You can see the resources of each as follows.

bash-3.2# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 4973 4 r----- 4019.6
S10U5HVM 8 2056 1 r----- 40.8
solaris-pv 10 1024 1 r----- 5.0

I also found following links to be very helpful as I learnt how to configure domU.
Write-up from Chris Beal
Write-up from mbrowarski