Friday, March 20, 2009
Latency group (lgroup) in Solaris on NUMA aware machines
This figure below describes how the lgroup structures are layed out on SPARC based NUMA aware machines. The root lgroup (0) is the top most level of the hierarchy having all the resource sets in the system. lgroup id 1, 2 and 3 are having four CPUs each (system board) and are leaf nodes in this case. On sparc, the remote latency from lgroup 2 to 1 or 3 is same i.e they are equidistant having local and remote latency. In Solaris, we have something called lgroup partition load (lpl_t) which represents the leaf-nodes having CPUs and memory. Each cpu_t (CPU struture) will have cpu_lpl. lpl's are also used when CPU partitons are created (processor sets are the best example). There's a global table of lgroups called lgrp_table[]. Each partition will have its lpl's in cp_lgrploads[] (cpupart_t). Both the tables are indexed by lgroup id. A thread will be homed to an lpl with in the CPU partition.
On a 4-way amd64, the lgroup representation is quite interesting as we have local and in remote we have one and two hops. For example psrinfo(1M) revealed this :-
0 on-line since 06/09/2006 06:49:25
1 on-line since 06/09/2006 06:49:31
2 on-line since 06/09/2006 06:49:33
3 on-line since 06/09/2006 06:49:35
Each CPU is a leaf lgroup. The diagram below explains this very well. In the this kind of configuration, we will have non-leaf nodes as 5, 6, 7 and 8 representing resource sets which are one hop away. For example lgroup id 5 is having 1,2,3 (local and one hop away from lgroup 1). The root lgroup id (0) will have everything.
On SPARC, we have two levels of memory hierarchy whereas on 4-way amd64 has three levels of memory hierarchy. 8 way amd64 should have four levels of memory hierarchy. The scheduling of threads starts from it's home lgroup and goes up the hierarchy. For example if the home of a thread (t->t_lpl) is lgroup 1 (CPU 0 is the resource set), then we would first look at CPU 0 and if thread can't run there, then we will look at the parent of lgroup 1 (lpl_parent) which is lgroup 5 having 1,2,3 as resource sets. Same is true when idle thread steals the work from other CPUs. The locality is kept in mind.
The lgroup hierarchical representation is more interesting when there are three hops (for example on a 8-way amd64 box). I'll leave it for next time. Thanks to Jonathan Chew for taking time and explaining all this. I thought it'd be worth to blog about this since it's a bit complex design.
Solaris APIC implementation with respect to MSI/MSI-x interrupts
What's Local APIC
Local APIC (LAPIC) is part of the CPU chip and it contains (a) mechanism for generating/accepting interrupts (b) a timer (c) manages all external interrupts for the processor and (d) accept and generate inter-processor-interrupts (IPIs).
What's IOAPIC
This is a separate chip that is wired to the local APIC so that it can forward interrupts to the appropriate CPU (and to local APIC).
What's Local APIC Table
Interrupt vectors are numbered 0x00 through 0xFF in APIC and 0x00...0x1F are reserved for exceptions. The interrupt vectors in the range 0x20...0xFF are available for programming the interrupts in APIC. Like the Local APIC's, the IOAPIC will assign a priority to the interrupt based on the vector number and and it uses 4 top bits of the vector number to distinguish priority and ignores the lower 4 bits. For example if the vector number is 0x3F then the priority would be 0x3. In Solaris, this priority mask is represented by APIC_IPL_MASK (0xF0) and the vector mask is represented by APIC_VECTOR_MASK (0x0F).
Since we can't use vector range from 0x00...0x1F, Solaris represents APIC_BASE_VECT (0x20) as the base vector and APIC_MAX_VECTOR (0xFF) being the maximum number of vectors in the local APIC. APIC_AVAIL_VECTOR is calculated based on this formula :-
APIC_MAX_VECTOR+1-APIC_BASE_VECT and it translates to (0xFF+1-0x20) which is 224 vectors in decimal.
Note that vectors are grouped in 16 priority groups and each group has 0x10 number of vectors. These 16 vectors share the same priority.
APIC Data Structures in Solaris
Here is the big picture on how the various APIC data structures are related to each other. These data structures are described below :-
apic_irq_table[] - Holds all IRQ entires. Each entry is of type apic_irq_t and total size of the table is APIC_MAX_VECTOR + 1. Note that IRQ has no meaning with respect to MSI/MSI-x.
A typical apic_irq_t entry in the apic_ira_table[] looks like this :-
> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
22 0x61 6 PCI Lvl Fixed 1 2 0x0/0x16 bge_intr, ata_intr
> apic_irq_table+(0t22*8)/J
apic_irq_table+0xb0: fffffffec10d7f38
> fffffffec10d7f38::print apic_irq_t
{
airq_mps_intr_index = 0xfffd
airq_intin_no = 0x16 // set since it's FIXED type interrupt.
airq_ioapicindex = 0
airq_dip = 0xfffffffec01fd9c0 // dev info
airq_major = 0xca
airq_rdt_entry = 0xa061
airq_cpu = 0x1
airq_temp_cpu = 0x1
airq_vector = 0x61 // note that it matches with ::interrupts output
airq_share = 0x2 // two interrupts are sharing the same IRQ and vector
airq_share_id = 0
airq_ipl = 0x6 // IPL
airq_iflag = {
intr_po = 0x3
intr_el = 0x3
bustype = 0xd
}
airq_origirq = 0xa
airq_busy = 0
airq_next = 0
}
> 0xfffffffec01fd9c0::print 'struct dev_info' ! grep name
devi_binding_name = 0xfffffffec01fcf88 "pci-ide"
devi_node_name = 0xfffffffec01fcf88 "pci-ide"
devi_compat_names = 0xfffffffec0206940 "pci1002,4379.1025.10a.80"
devi_rebinding_name = 0
>
apic_ipltopri[] - This array holds Solaris IPL priority to APIC priority. For example :-
> apic_ipltopri::print
[ 0x10, 0x20, 0x20, 0x20, 0x30, 0x50, 0x70, 0x80, 0x80, 0x80, 0x90, 0xa0, 0xb0, 0xc0, 0xd0,
0xf0, 0 ]
>
Note the order of priority assignment. Higher vector numbers are being assigned to higher IPL. Also note that 0x20 is given to index 1,2,3 which means that IPL 1,2,3 share the same vector range 0x20...0x2F.
And apic_ipltopri[] is declared as :-
uchar_t apic_ipltopri[MAXIPL + 1]; /* unix ipl to apic pri */
apic_vectortoipl[] - This array is a bit complex. The main purpose of this array is to initialize apic_ipltopri[] array.
apic_init()
{
[.]
apic_ipltopri[0] = APIC_VECTOR_PER_IPL; /* leave 0 for idle */
for (i = 0; i < (APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL); i++) {
if ((i < ((APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL) - 1)) &&
(apic_vectortoipl[i + 1] == apic_vectortoipl[i]))
/* get to highest vector at the same ipl */
continue;
for (; j <= apic_vectortoipl[i]; j++) {
apic_ipltopri[j] = (i << APIC_IPL_SHIFT) +
APIC_BASE_VECT;
}
}
[.]
}
uchar_t apic_vectortoipl[APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL] = {
3, 4, 5, 5, 6, 6, 9, 10, 11, 12, 13, 14, 15, 15
};
Note that IPL 5 share vector range 0x40...0x5F (or 0x20...0x3F for optimization) and that's why vector index 2 and 3 have IPL 5. Similarly vector index 4,5 have IPL 6 (0x40...0x5F or 0x60...to 0x7F).
* IPL Vector range. as passed to intr_enter
* 0 none.
* 1,2,3 0x20-0x2f 0x0-0xf
* 4 0x30-0x3f 0x10-0x1f
* 5 0x40-0x5f 0x20-0x3f
* 6 0x60-0x7f 0x40-0x5f
* 7,8,9 0x80-0x8f 0x60-0x6f
* 10 0x90-0x9f 0x70-0x7f
* 11 0xa0-0xaf 0x80-0x8f
* ... ...
* 15 0xe0-0xef 0xc0-0xcf
* 15 0xf0-0xff 0xd0-0xdf
*/
apic_vector_to_irq[] - This array holds IRQ number given the vector number. If an element of this array contains APIC_RESV_IRQ (0xFE) then it means that the vector is free and can be allocated. apic_navail_vector() function checks this array to figure out how many vectors are available.
Here an example on how IPL to vector priority is mapped in Solaris :-
Lets say we got network interrupt at IPL 6 (ath - wifi interrupt) having vector number 0x60 (as shown above in the ::interrupt output). Now Solaris will block all interrupts at and below IPL 6 which is done by apic_intr_enter() function. In this function, the caller actually subtracts 0x20 (APIC_BASE_VECT) from the vector number. Anyway, this is done for optimization but lets come to the point - apic_ipls[] array is used to get to the IPL which will be programmed in the APIC register. So we first get nipl as
nipl = apic_ipls[vector]; // vector is 0x40 not 0x60 as mentioned above and nipl will be 0x6
*vectorp = irq = apic_vector_to_irq[vector + APIC_BASE_VECT]; // This is done to get actual vector and irq.
and then this statement blocks all the interrupts at and below the vector priority (or IPL).
apicadr[APIC_TASK_REG] = apic_ipltopri[nipl];
So we write 0x70 to APIC task register to block interrupts. Note that Solaris uses range 0x60...0x7F for IPL 6 :-
* IPL Vector range. as passed to apic_intr_enter()
* 6 0x60-0x7f 0x40-0x5f
and it does not matter whether you write 0x70 or 0x7F as they all do the same work which is block interrupts at IPL 6 or below.
Solaris x86 Interrupt Handling
Now that we have glimpsed through the data structures involved, lets look at how Solaris x86 handles Interrupt. I prefer to describe interrupt handling before describing how interrupts are allocated because I felt interrupt handling is easier to understand.
Lets first go through how Solaris x86 is designed in terms of psm ops. For example, PCI express has its own psm ops which is apic_ops and PCI has its own psm_ops which is uppc_ops. In fact xVM (Zen based hypervisor) has its own psm_ops called xen_psm_ops. It is psm_install() who is responsible for installing psm in Solaris x86 world.
apic_probe_common() is what gets called when psm_install() jumps into psm_probe() for each psm_ops. apic_probe_common() does many things and one of them being mapping 'apicadr[]' (you would have seen this before; I referred it for setting APIC priority i.e task register). apic_cpus[] array also gets initialized by ACPI i.e acpi_probe() because ACPI tables have all the information like local apic cpu id, version etc.
Now lets see what happens when local APIC generates an interrupt. The interrupt could come from IOAPIC or MSI/MSI-x based generated interrupt (in-band message). Solaris calls cmnint() or _interrupt(). These are same and call do_interrupt() once regs is setup. do_interrupt() will first set the PIL so that CPU does not get any interrupt at or below PIL. Raising the priority of CPU is done using setlvl pointer to function. This pointer gets set to appropriate psm_ops's psm_intr_enter and in our case it will be apic_intr_enter(). Now comes the dispatching interrupt part which is done by calling switch_sp_and_call() once the stack of interrupt thread is setup. Recall that Solaris handles interrupts in thread context if PIL is at or below LOCK_LEVEL (0xa). High level interrupts (0xa...0xf) are handled in current thread's stack.
switch_sp_and_call() can dispatch three type of interrupts -- (a) software interrupts (b) high level interrupts and (c) normal device interrupts.
In our example, we have been looking at wifi interrupt and it will be (c) which maps to dispatch_hardint() routine. dispatch_hardint() calls av_dispatch_autovect() after enabling interrupts. Now that we are touching av_dispatch_autovect() routine, I must explain what is autovect[] array. If you remember add_avintr() which is responsible for registering a hardware interrupt handler then I think you can skip this part. autovect[] has MAX_VECT (256) elements and each element is of type 'struct av_head'. The first pointer in 'struct av_head' points to 'struct autovec' and autovec structure will have all the information about interrupt handler, arguments passed to interrupt handler, priority level etc. Note that more than one interrupt handler can share the same vector and they are linked by 'av_link' in 'struct autovec'. For example :-
> ::interrupts
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
22 0x61 6 PCI Lvl Fixed 1 2 0x0/0x16 bge_intr, ata_intr
> ::sizeof 'struct av_head'
sizeof (struct av_head) = 0x10
> autovect+(0x10*0t22)=J // Take the IRQ and index into autovect[] array.
fffffffffbc52ba0
> fffffffffbc52ba0::print 'struct av_head'
{
avh_link = 0xfffffffec50d2cc0
avh_hi_pri = 0x6 // take a look at bge_intr() and its priority below
avh_lo_pri = 0x5 // take a look at ata_inr() and its priority below
}
> 0xfffffffec50d2cc0::print 'struct autovec'
{
av_link = 0xfffffffec10d2f40
av_vector = bge_intr
av_intarg1 = 0xfffffffec50d5000
av_intarg2 = 0
av_ticksp = 0xfffffffec506ae20
av_prilevel = 0x6
av_intr_id = 0xfffffffec537a078
av_dip = 0xfffffffec01f8400
}
> 0xfffffffec10d2f40::print 'struct autovec'
{
av_link = 0
av_vector = ata_intr
av_intarg1 = 0xfffffffec00bc8c0
av_intarg2 = 0
av_ticksp = 0xfffffffec0528898
av_prilevel = 0x5
av_intr_id = 0xfffffffec10cbe78
av_dip = 0xfffffffec01fd9c0
}
>
Here's an example which we have been discussing :-
bash-3.00# dtrace -n av_dispatch_autovect:entry'/`autovect[args[0]].avh_link->av_vector/{@[args[0]]=count(); printf("%a, %x", `autovect[args[0]].avh_link->av_vector, args[0])}'
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
1 2391 av_dispatch_autovect:entry ath`ath_intr, 13
There is a very interesting blog by Anish at this link on APIC and Solaris x86 interrupt handling.
How does Solaris APIC implementation allocates Interrupt
Now that we looked at how APIC is structured in Solaris x86 and how interrupts are handled, lets look at how interrupts are allocated. There are three types of interrupts -- DDI_INTR_TYPE_FIXED, DDI_INTR_TYPE_MSI and DDI_INTR_TYPE_MSIX in the order they are evolved. Solaris DDI routine ddi_intr_get_supported_types() can be called to retrieve types of interrupt supported by the Bus.
In case of MSI, apic_alloc_msi_vectors() gets called and in case of MSI-x, apic_alloc_msix_vectors() gets called to allocate the appropriate number of interrupt vectors. Note that MSI supports 32 number of vectors per device function and MSI-x supports 2048 number of vectors per device function however in Solaris x86 we only support 2 MSI-x interrupt vectors per device (the reason for studying APIC and MSI-x by me). On SPARC, Solaris supports far more MSI-x interrupt and configured by #msix-request property in DDI. This hard limit is determined by i_ddi_get_msix_alloc_limit() function however even on SPARC it seems we limit to 8.
msix_alloc_limit = MAX(DDI_MAX_MSIX_ALLOC, ddi_msix_alloc_limit);
/* Default number of MSI-X resources to allocate */
#define DDI_DEFAULT_MSIX_ALLOC 2
/* Maximum number of MSI-X resources to allocate */
#define DDI_MAX_MSIX_ALLOC 8
These limits will change when Interrupt Resource Management (IRM) framework is integrated in Solaris.
Anyway, lets get back to the topic. Depending upon the interrupt type and bus intr ops, Solaris will jump to interrupt ops. In our case, we will get into pci_common_intr_ops() from ddi_intr_alloc(9F) to allocate the interrupts with cmd DDI_INTROP_ALLOC. We will not get into FIXED type interrupts as they are hard wired via IOAPIC and fairly easy (I suppose). It's the psm_intr_ops which gets into action with cmd PSM_INTR_OP_ALLOC_VECTORS and we land up in apic_intr_ops().
apic_intr_ops
{
[.]
case PSM_INTR_OP_ALLOC_VECTORS:
if (hdlp->ih_type == DDI_INTR_TYPE_MSI)
*result = apic_alloc_msi_vectors(dip, hdlp->ih_inum,
hdlp->ih_scratch1, hdlp->ih_pri,
(int)(uintptr_t)hdlp->ih_scratch2);
else
*result = apic_alloc_msix_vectors(dip, hdlp->ih_inum,
hdlp->ih_scratch1, hdlp->ih_pri,
(int)(uintptr_t)hdlp->ih_scratch2);
break;
[.]
}
apic_alloc_msi_vectors() - This function allocates 'count' number of vectors for the device. 'count' has to be power of 2 and the priority is passed by the caller. The first thing which this function does is - it checks whether we have enough vectors available at the priority to satisfy the request and tt is done by routine apic_navail_vector(). We start our search whether we can get contiguous vectors and the value returned by apic_find_multi_vectors() is our starting point. It seems MSI has this constraint to give contiguous vectors only. I don't why.
The next step is to check whether we have enough irq's in the apic_irq_table[]. This is done by the function apic_check_free_irqs(). If we succeed in finding enough IRQ entries in the table, apic_alloc_msi_vector() proceeds to allocate irq which is done by apic_allocate_irq(). The IRQ no. returned by this function is finally used by autovect[] table to index into the appropriate vector. We will go into autovect[] again soon but for now lets see how we select CPU. The selection of CPU for this IRQ is done by apic_bind_intr() for the first interrupt in 'count' number of vectors and subsequent vectors are bound to the same CPU. These steps are done in a loop for 'count' number of times.
Now that we have setup IRQ in the apic_irq_table[] with priority, vector, target CPU etc, we are set to enable the interrupt. BTW, all this is mostly done in driver's attach(9E) entry point but mostly in two phases with in the attach(9E) entry point -- (i) add interrupts by allocating them (ii) enable interrupts.
apic_alloc_msix_vectors() - This function does similar work as done for MSI interrupts except that we allocate the vector (apart from allocating the IRQ entry in the apic_irq_table[]) and bind the interrupt to CPU by calling apic_bind_intr() for each request in 'count'). MSI-x does have the limitation of contiguous vectors as MSI has. Vector allocation is done by routine apic_allocate_vector() which returns the free vector by walking apic_vector_to_irq[] table and looking for APIC_RESV_IRQ slot. The range is determined by the priority passed to it. For example if priority passed is 6, then range would be
highest = apic_ipltopri[ipl] + APIC_VECTOR_MASK;
lowest = apic_ipltopri[ipl - 1] + APIC_VECTOR_PER_IPL;
if (highest < lowest) /* Both ipl and ipl - 1 map to same pri */
lowest -= APIC_VECTOR_PER_IPL;
highest is 0x7f (0x70 + 0x0f) and lowest would be 0x60 (0x50+0x10) and this matches with our observation in the beginning of the blog.
A typical flow of this dance is as follows :-
1 22557 apic_alloc_msix_vectors:entry name pciex8086,10a7, inum : 0, count : 2, pri :6
pcplusmp`apic_intr_ops+0x114
npe`pci_common_intr_ops+0x8f1
npe`npe_intr_ops+0x21
unix`i_ddi_intr_ops+0x54
unix`i_ddi_intr_ops+0x54
genunix`ddi_intr_alloc+0x263
igb`igb_alloc_intrs_msix+0x134
igb`igb_alloc_intrs+0x64
igb`igb_attach+0xcb
genunix`devi_attach+0x87
1 22485 apic_navail_vector:entry name : pciex8086,10a7, pri 6
1 22486 apic_navail_vector:return 31
1 22547 apic_allocate_irq:entry 72
1 22419 apic_find_free_irq:entry start :72, end : 253
1 22417 apic_find_io_intr:entry 72
1 22548 apic_allocate_irq:return 72
1 22479 apic_allocate_vector:entry ipl : 6, irq: 72, pri: 1
1 22480 apic_allocate_vector:return 96
1 22473 apic_bind_intr:entry name : pciex8086,10a7, irq 72
1 22474 apic_bind_intr:return 0
Now lets talk about how driver enables interrupts once they are allocated. Interrupts can be enabled in block (more than one at once by DDI ddi_intr_block_enable(9F)) or calling explicitly ddi_intr_enable(9F) for each interrupt however we will discuss ddi_intr_enable(9F) . Once again we will end up in pci_common_intr_ops() and call pci_enable_intr() which does two things mainly :-
- Translate the interrupt if needed. This is done by apic_introp_xlate(). If the interrupt is MSI or MSI-x, we call apic_setup_irq_table() if the IRQ entry in the apic_irq_table[] is not setup. In our example, we have already done this so apic_introp_xlate() just returns IRQ number from 'apic_vector_to_irq[airqp->airq_vector]'. airqp is an entry in the apic_irq_table[] which gets assigned by calling apic_find_irq().
- Add the interrupt handler by calling add_avintr(). We have actually touched this routine in this blog but it is worth mentioning - when in the life cycle of setting up interrupts we bind an interrupt handler (ISR or Interrupt Service Routine) to vector. The main task of add_avintr() is to insert 'autovec' in the appropriate index and call insert_av(). The other and the most important thing is to program the interrupt which is done by addspl(). addspl() is another pointer to function from the family of setlvl, setspl etc. In APIC case, it will be apic_addspl() which is just a wrapper over apic_addspl_common(). There are four arguments passed to it :-
apic_addspl_common(int irqno, int ipl, int min_ipl, int max_ipl)
We first get the pointer from apic_irq_table[] by indexing irqno and check if we need to upgrade vector or just check IPL in case this interrupt needs to be shared. Eventually we will land up in apic_setup_io_intr() which does the main task. In fact apic_rebind() binds an interrupt to a CPU and apic_rebind() is called from apic_setup_io_intr(). Since we are discussing MSI/MSI-x and once apic_rebind() does sanity checks it will call apic_pci_msi_enable_vector(). The following statement is what we write to program the interrupt :-
/* MSI Address */
msi_addr = (MSI_ADDR_HDR | (target_apic_id << MSI_ADDR_DEST_SHIFT));
msi_addr |= ((MSI_ADDR_RH_FIXED << MSI_ADDR_RH_SHIFT) |
(MSI_ADDR_DM_PHYSICAL << MSI_ADDR_DM_SHIFT));
/* MSI Data: MSI is edge triggered according to spec */
msi_data = ((MSI_DATA_TM_EDGE << MSI_DATA_TM_SHIFT) | vector);
apic_pci_msi_enable_mode() is also called from apic_rebind() to enable the interrupt once it's programmed. That's how per-vector masking is controlled I suppose.
Since we are touch how we bind an interrupt to a CPU, I should also mention how Solaris selects CPU to bind an interrupt. The routine apic_bind_intr() is responsible for doing this and the decision is based on value of tunable 'apic_intr_policy'. You can define three type of policy -- (a) INTR_ROUND_ROBIN_WITH_AFFINITY - round robin and affinity based policy which returns same CPU for the same dip (or device). This is the default policy. (b) INTR_LOWEST_PRIORITY - I don't know because it's not implemented and (c) INTR_ROUND_ROBIN - select cpu in round-robin fashion using 'apic_next_bind_cpu' global variable. Choosing between INTR_ROUND_ROBIN_WITH_AFFINITY vs INTR_ROUND_ROBIN may not be easy but I think the decision should be based on throughput vs locality awareness.