Friday, March 20, 2009

Latency group (lgroup) in Solaris on NUMA aware machines

All of you would have heard about NUMA (Non-uniform-memory-access) machines. I'm going to describe how the memory latency groups (called lgroup in Solaris) are layed out. While working on Multi-CPU binding project, I had to learn these aspects to implement how to choose a lgroup for a thread having least latency from its earlier home lgroup.

This figure below describes how the lgroup structures are layed out on SPARC based NUMA aware machines. The root lgroup (0) is the top most level of the hierarchy having all the resource sets in the system. lgroup id 1, 2 and 3 are having four CPUs each (system board) and are leaf nodes in this case. On sparc, the remote latency from lgroup 2 to 1 or 3 is same i.e they are equidistant having local and remote latency. In Solaris, we have something called lgroup partition load (lpl_t) which represents the leaf-nodes having CPUs and memory. Each cpu_t (CPU struture) will have cpu_lpl. lpl's are also used when CPU partitons are created (processor sets are the best example). There's a global table of lgroups called lgrp_table[]. Each partition will have its lpl's in cp_lgrploads[] (cpupart_t). Both the tables are indexed by lgroup id. A thread will be homed to an lpl with in the CPU partition.


On a 4-way amd64, the lgroup representation is quite interesting as we have local and in remote we have one and two hops. For example psrinfo(1M) revealed this :-
0 on-line since 06/09/2006 06:49:25
1 on-line since 06/09/2006 06:49:31
2 on-line since 06/09/2006 06:49:33
3 on-line since 06/09/2006 06:49:35

Each CPU is a leaf lgroup. The diagram below explains this very well. In the this kind of configuration, we will have non-leaf nodes as 5, 6, 7 and 8 representing resource sets which are one hop away. For example lgroup id 5 is having 1,2,3 (local and one hop away from lgroup 1). The root lgroup id (0) will have everything.

On SPARC, we have two levels of memory hierarchy whereas on 4-way amd64 has three levels of memory hierarchy. 8 way amd64 should have four levels of memory hierarchy. The scheduling of threads starts from it's home lgroup and goes up the hierarchy. For example if the home of a thread (t->t_lpl) is lgroup 1 (CPU 0 is the resource set), then we would first look at CPU 0 and if thread can't run there, then we will look at the parent of lgroup 1 (lpl_parent) which is lgroup 5 having 1,2,3 as resource sets. Same is true when idle thread steals the work from other CPUs. The locality is kept in mind.

The lgroup hierarchical representation is more interesting when there are three hops (for example on a 8-way amd64 box). I'll leave it for next time. Thanks to Jonathan Chew for taking time and explaining all this. I thought it'd be worth to blog about this since it's a bit complex design.

No comments: