A simple example
Here is a simple example. As a test system I used a virtual machine with 4 CPU's and 2 running zones.
#>zoneadm list -icv
ID NAME STATUS PATH BRAND IP
0 global running / native shared
9 zone1 running /zones/zone1 native shared
10 zone2 running /zones/zone2 native shared
- zone3 installed /zones/zone3 native shared
- zone4 installed /zones/zone4 native shared
The availability of 4 CPU can be shown by the output of
psrinfo
# >psrinfo
0 on-line since 11/03/2009 08:19:02
1 on-line since 11/03/2009 08:19:08
2 on-line since 11/03/2009 08:19:08
3 on-line since 11/03/2009 08:19:10
We take a short look at the CPU utilization. In essence, the system just runs itself at the moment. A very relaxing situation.
# >vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id
0 0 0 978940 596200 116 439 95 0 0 0 79 10 2 -0 -0 701 3321 1456 4 14 82
0 0 0 812240 438140 8 30 0 0 0 0 0 0 0 0 0 430 173 232 0 1 98
0 0 0 812240 438140 0 6 0 0 0 0 0 16 0 0 0 715 142 350 0 2 98
0 0 0 812240 438140 0 6 0 0 0 0 0 0 0 0 0 442 157 229 0 1 99
0 0 0 812240 438140 0 6 0 0 0 0 0 0 0 0 0 433 200 217 0 1 99
0 0 0 812240 438140 0 6 0 0 0 0 0 0 0 0 0 437 214 233 0 1 99
0 0 0 812240 438140 0 6 0 0 0 0 0 0 0 0 0 438 162 243 0 1 99
^C
# >prstat -mLZ 1
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
2203 root 0.2 3.0 0.0 0.0 0.0 0.0 97 0.0 37 0 525 0 prstat/1
122 root 0.1 0.1 0.0 0.0 0.0 0.0 100 0.0 36 0 218 0 nscd/3
2137 noaccess 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 1 0 java/15
…
9 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 svc.configd/1
7 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 svc.startd/323
7 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 svc.startd/316
7 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 0 0 0 0 svc.startd/66
7 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 svc.startd/10
ZONEID NLWP SWAP RSS MEMORY TIME CPU ZONE
0 183 140M 212M 21% 0:01:16 0.3% global
2 123 135M 204M 20% 0:00:52 0.0% zone2
1 126 138M 208M 20% 0:00:50 0.0% zone1
Total: 102 processes, 432 lwps, load averages: 0.04, 0.39, 0.49
Configuration of the CPU capping
Next, the capping for a local Zone called zone1 is activated and set to 10%. Keep in mind: The activation via prctl is not boot persistent. The calculation for the ratio of capping value can be done like this: 4CPU = 400 → 400 = 100% → 40 = 10 %
# >prctl -t privileged -n zone.cpu-cap -s -v 40 -i zone zone1
Let's check the configuration:
# >prctl -P -i zone zone1
zone: 1: zone1
zone.max-swap system 18446744073709551615 max deny -
zone.max-locked-memory system 18446744073709551615 max deny -
zone.max-shm-memory system 18446744073709551615 max deny -
zone.max-shm-ids system 16777216 max deny -
zone.max-sem-ids system 16777216 max deny -
zone.max-msg-ids system 16777216 max deny -
zone.max-lwps system 2147483647 max deny -
zone.cpu-cap privileged 40 - deny -
zone.cpu-cap system 4294967295 inf deny -
zone.cpu-shares privileged 1 - none -
zone.cpu-shares system 65535 max none -
BTW, there is also a corresponding
kstat module for the CPU Capping. The name of the relevant field is cpucaps_zone_
.
#kstat -m caps -n cpucaps_zone_`zoneadm list -icv | grep zone1 | awk '{print $1}'`
module: caps instance: 1
name: cpucaps_zone_1 class: zone_caps
above_sec 0
below_sec 1335
crtime 4801.079480706
maxusage 12
nwait 0
snaptime 6135.748493858
usage 1
value 40
zonename zone1
Pedal to the metal - or testing the cap
To test the capping, just use your favorite CPU hog. I wrote a small - albeit not very elegant C program - to fulfill this purpose.
# >vi load.c
#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <std.h>
int C,I;
void loop1()
{
double x=0, m;
int i=0;
while(i == 0)
{
m = acos(x);
x++;
}
}
main()
{
pthread_attr_t attr;
pthread_attr_init(&attr);
for (C=0; C<2000; C++)
{
pthread_t THREAD;
pthread_create(&THREAD, NULL,(void *) loop1,NULL);
}
while(I == 0)
{
sleep(1);
}
}
Okay ... as C source doesn't help us, we have to compile it first. Compile – copy – finish.
#> gcc -lm -o load.bin load.c
#> cp load.bin /zones/zone1/root
Now we can login into the zone and put some load on our system.
# >zlogin zone1
# >./load.bin&;./load.bin&;./load.bin&
^D
Now, wait a bit - until the cake rises. And after a short moment:
#> prstat -mLZ 1
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
3662 root 0.8 8.7 0.0 0.0 0.0 0.0 90 0.0 16 2 439 0 prstat/1
3690 root 2.1 0.0 0.0 0.0 0.0 0.0 0.0 98 0 2 0 0 load.bin/166
...
3691 root 1.0 0.0 0.0 0.0 0.0 0.0 0.0 99 0 1 0 0 load.bin/214
3690 root 1.0 0.0 0.0 0.0 0.0 0.0 0.0 99 0 1 0 0 load.bin/505
ZONEID NLWP SWAP RSS MEMORY TIME CPU ZONE
1 1907 153M 225M 22% 0:01:22 10% zone1
0 191 147M 219M 21% 0:03:32 2.4% global
2 122 135M 205M 20% 0:01:21 0.0% zone2
Total: 110 processes, 2220 lwps, load averages: 2164.75, 1980.04, 1527.83
As shown by the CPU column the capping works fine. 10% CPU utilization zone1 - works as configured.
Contradictory numbers - and their reason
But wait .... the LoadAvg is extremely high.
# >uptime
3:58pm up 2:57, 2 users, load average: 2176.48, 1985.47, 1532.13
Just in the case LoadAvg would be a relevant indicator in this situation, the system would be in real trouble now. But it isn't ... however, applications with dependencies to the LoadAvg are indeed in trouble now. One example is sendmail. There is a mechanism in sendmail to mitigate the risks of a mail storm. When the load average of the system increases above a configurable level, Sendmail stops to accept mail. sendmail would stop to accept mail now, despite there is more than enough capacity available on the system. Okay, let's look to the system from a different perspective: How is it the state of the system as reported by vmstat?
# >vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id
0 0 0 721984 358208 4 6 0 0 0 0 0 0 0 0 0 509 142 312 11 1 88
0 0 0 721984 358208 4 6 0 0 0 0 0 0 0 0 0 660 174 298 10 2 87
0 0 0 721984 358208 4 6 0 0 0 0 0 0 0 0 0 503 157 316 10 1 89
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 512 206 321 10 2 88
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 504 143 299 10 1 88
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 498 156 307 11 1 88
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 493 195 309 10 1 88
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 505 140 298 11 1 88
0 0 0 721916 358140 4 6 0 0 0 0 0 0 0 0 0 516 379 309 11 1 88
0 0 0 721916 357944 4 54 0 0 0 0 0 0 0 0 0 507 149 305 10 2 88
^C
88% idle. That's the number you would expect when your just use 10% of your system plus a little bit load by all the processes running on a freshly installed system.
Why is LoadAvg that high? Especially since the 1st column in vmstat shows 0, this number seems counter intuitive . vmstat just shows the number of kthreads in runqueues in this columns. Is it possible that those threas are in the kthread runqueue, but not in the state "on_proc"? To understand this, you have to dig into the mechanism that enables CPU capping. CPU capping is done by leveraging the scheduling subsystem. Threads in the run queue and dispatch queue on a CPU are monitored by the system. When a group of threads is reaching the capping limit, the threads are set to wait, thus they are not run on a CPU. The load generator spawns more and more threads, the number of threads is increasing, but they are set to wait state almost immediately.
This can be stubstantiated by a quick look into the ps man-page:
#> man ps
…
S (l) The state of the process:
...
W Waiting: process is waiting
for CPU usage to drop to
the CPU-caps enforced lim-
its.
...
Let's use this knowledge on our system:
#> ps -o s=state -o comm=command -aelfL | grep load | more
W ./load.bin
W ./load.bin
W ./load.bin
...
W ./load.bin
W ./load.bin
W ./load.bin
W ./load.bin
# > ps -o s=state -o comm=command -aelfL | grep load | grep W | wc -l
2135
The number of wait kthreads can also be controlled via kstat. The value of interest for this situation is "nwait“.
# >kstat -m caps -n cpucaps_zone_`zoneadm list -icv | grep zone1 | awk '{print $1}'`
module: caps instance: 1
name: cpucaps_zone_1 class: zone_caps
above_sec 2141
below_sec 3753
crtime 4801.079480706
maxusage 135
nwait 2156
snaptime 10694.627372473
usage 40
value 40
zonename zone1
What's the reason for this difference between LoadAvg and the data displayed by other tools? Commands such as prstat / uptime / w etc. use the syscall getloadavg(), which apparently evaluates the number of entries in the runqueues, but isn't aware of the wait-flag. We can check this by a short dtrace one-liner:
# >dtrace -n 'syscall::: /probefunc=="getloadavg"/ {trace(execname)}'
dtrace: description 'syscall::: ' matched 466 probes
CPU ID FUNCTION:NAME
2 2626 getloadavg:entry uptime
2 2627 getloadavg:return uptime
2 2626 getloadavg:entry prstat
2 2627 getloadavg:return prstat
2 2626 getloadavg:entry prstat
2 2627 getloadavg:return prstat
2 2626 getloadavg:entry prstat
2 2627 getloadavg:return prstat
2 2626 getloadavg:entry prstat
2 2627 getloadavg:return prstat
2 2626 getloadavg:entry w
2 2627 getloadavg:return w
^C
Conclusion
There is an interesting interaction between getloadavg() and zone.cpu-cap that leads to misguiding, but perfectly correct numbers. You should keep this in mind, when you try to make sense out of a system with a extremely high load average that's still responsive.
Do you want to learn more?
Misc
opensolaris.org: The implementation of CPU caps