Home

Resume scp/rsync file transfer

This is very basic Linux knowledge, but the question has arised too often and I decided to document it. Problem: Transfer (or upload/download) a large tree of files and directories from one machine to another using a ssh connection, and being able to resume/continue if the operation is interrupted. Solution: Until now, I believe that the best solution is [...]

We should be as the linux kernel…

Lucas Meneghel Rodrigues: Linux on Power - Tue, 11/18/2008 - 21:31
And persevere, no matter what: From arch/x86/kernel/traps_32.c, L 690: printk(KERN_EMERG “Dazed and confused, but trying to continue\n”); Brings us a lesson about how we should live our lives - even on moments where everything seems to be confusing and dazzling, we *must* continue. Word.       

SPE gang scheduling policies

Jeremy Kerr - Tue, 11/18/2008 - 02:56

In my previous post here, I mentioned that:

We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

I think that may require a little explanation, so here we go:

gang scheduling policies

The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.

So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?

I can see four possible approaches:

policy 1: the gang is only schedulable when all contexts are runnable

In this case, the gang is only ever scheduled when all of the gang's contexts are runnable (ie, they are being run by the spu_run system call).

Although the simplest, this approach will never complete&emdash;consider the following:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task

So, this policy isn't much use; perhaps we can solve this with a new approach:

policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable

This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.

However, now we have a new, slightly more complex non-termination case:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
  6. Because neither context is currently runnable, the gang is descheduled
  7. Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.

Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:

int main(void)
{
        do_work();
        printf("work done!\n");
        return 0;
}

Here we're doing a PPE-assisted callback (the call to printf is implemented as a callback) before finishing. If this callback were to occur when the other context has already completed, we would hit the non-termination condition above.

This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.

So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.

policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable

This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.

The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.

The good news is that I've already done a little experimental work to overcome this general restriction in spufs.

The last approach is a little more complex, but works around this restriction:

policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts

This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.

This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.

So, which policy is best for spufs?

Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.

Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.

Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.

Handy joystick udev rules

Lucas Meneghel Rodrigues: Linux on Power - Sun, 11/16/2008 - 13:04
To make a long story short, I haven’t blogged for ages, and the best way to resume any activity is, well… actually doing it. Lately, I’ve been doing some experiments on my personal laptop, aiming to make it a good entertainment center, running on free software only. At an appropriate time, I will discuss how I’m [...]

YAFAP - What else is there to do when you are ill?

Mel Gorman: Kernel Spanner - Fri, 11/14/2008 - 10:22
I've been out ill the last few days (and still am). As having a head full of goo made be dumber than the average stick that couldn't do more than an hour or two of useful work in a day, I decided to fire up nethack again to chew up a bit of the day. The only games I play are guitar hero variants, final fantasy anything and nethack which was the first game I played on the PS3. I hadn't played properly in months as the last time I almost made it which just sickened me. The one exception was a game a few weeks ago for a local competition that had a keg as a prize if someone ascended (no one did).

The game was mainly a grind but being dumb also made me patient. One major setback was cancelling the whole inventory including the Bell of Opening which needs charged to finish the game. This is dumb, don't do it. I had the Wizard of Yendor killed by the time I realized the artifact was a no-go and there was no means of charging or wishing left in the game as I had them cleared out or used already. In a somewhat disgruntled mood, I put the character in a situation where it was attacked by literally all the time to finish the game and call it a day. Fortunately for me, between the monsters summoned by having being negatively aligned with negative luck and the repeated monsters summoned by the Wizard, eventually a scroll of charging fell out for the first time in the game putting me back in action and levelled me up considerably in the process. After opening the sanctum I found another scroll which was enough to get a magic marker to bring my AC back to something respectable and stand up to two wizards (double trouble), the high priest of Moloch and various beasties at the same time without using wands of death as I had lost all instant-kill methods at that time. Was a real messy fight but eventually with the Amulet of Yendor in hand and after poking the Wizard a few more times in the eye, I got to the Astral planes. Visiting all three altars later and last night night I see the lovely message.

"In return for thy service, I grant thee the gift of Immortality!" You ascend to the status of Demigod...

Mel the Lord St:25 Dx:18 Co:18 In:16 Wi:18 Ch:18 Lawful
Astral Plane $:175 HP:441(441) Pw:141(141) AC:-23 Xp:30/100090501 T:99085 Satiated


Wooooo! This is the full dump file and now, it's time to jam yet more wonderful drugs into my head and clear it out!

external context scheduling in spufs

Jeremy Kerr - Fri, 11/14/2008 - 02:14

At present, the spufs code has the invariant that a context is only ever loaded to an SPE when it is being run; ie, a thread is calling the spu_run syscall on the context.

However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

So, I've been working on some experimental changes to allow "external scheduling" for SPE contexts. The "external" refers to a thread external to the SPE's usual method of scheduling (ie, it's owning thread calling spu_run). In the example above, the external schedule would be caused by the fault handler for the problem-state mapping.

Although a context may be scheduled to an SPE, we still can't always guarantee forward progress. For example, in the "use the psmap to access the DMA engine" scenario, a DMA may cause a major page fault, which needs a controlling thread to service. In this case, the only way to ensure forward progress is through calling spu_run. However, I have some ideas on how we can remove this restriction later.

the interface

First up, we need to tell the spufs scheduler that we want a context to be loaded:

/*
 * Request an 'external' schedule for this context.
 *
 * The context will be either loaded to an SPU, or added to the run queue,
 * depending on SPU availability.
 *
 * Should be called with the context's state mutex locked, and the context
 * in SPU_STATE_SAVED state.
 */
int spu_request_external_schedule(struct spu_context *ctx);

After loading the context with spu_request_external_schedule, we need a way to tell the scheduler that the context can be de-scheduled:

/*
 * The context should be unscheduled at the end of its timeslice
 */
void spu_cancel_external_schedule(struct spu_context *ctx);
These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.

usage

We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:

--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma,

        if (ctx->state == SPU_STATE_SAVED) {
                up_read(&current->mm->mmap_sem);
+               spu_request_external_schedule(ctx);
                spu_context_nospu_trace(spufs_ps_fault__sleep, ctx);
                ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED);
                spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu);
+               spu_cancel_external_schedule(ctx);
                down_read(&current->mm->mmap_sem);
        } else {
                area = ctx->spu->problem_phys + ps_offs;

Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.

We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.

the code

I've created a development branch in the spufs repository for these changes, which is available:

  • via git: git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git, in the ext-sched branch; or
  • on the browsable gitweb interface.

Note that this is an experimental codebase, expect breakages!

About Trilead SSH open source project

Some time ago I have written an article about the JSch open source project. However, I soon lost all my motivation when I faced, again, the code complexity required for JSch just to start a execution session, a file copy or a port forwarding with JSch. Nevertheless the completely misunderstood authentication API for JSch let me [...]

new patchwork beta

Jeremy Kerr - Tue, 10/28/2008 - 06:23
patchwork screengrab

We've had a new version of patchwork - the web-based patch-tracking system - online for a few weeks now at patchwork.ozlabs.org.

After Paul's presentation on patchwork at the 2008 Kernel Summit, there has been wider interest in patchwork setups for other projects. Patchwork originally hosted the Linux on Power and Linux on Cell lists, we've since added netdev, linux-mtd and linux-ext4. I've also added the main Linux Kernel mailing list (lkml), just to see how the new patchwork handles the load; all has been okay so far.

If you're interested in installing the new patchwork at your own site, you can grab the source from the patchwork project page. Installation can be a little tricky, so feel free to mail me if you need a hand.

So many cores... so few students...

Bill Buros: Improving Performance on Linux - Mon, 10/27/2008 - 15:32
So it's that time of year when we return to the semi-serious thought process regarding next year's college interns and college hires. Looking down the road in the industry and for Linux, the new servers coming have so many processor cores, requiring new programming models and approaches, that the students coming have got to know how to dive into this new world. And recalling the way-too-many resumes dug through over 2008, there's an amazing disconnect coming.

Which leads me to the three key questions..
  1. Can the schools actually shift gears to teach more about parallel programming?
  2. Or do we need to do this ourselves in the industry and in our example the Linux community itself?
  3. Can the students really cope with so many cores? They will eventually be the users of all of these cores.
I suspect we'll all be chasing just a handful of students who have the background to help the industry down a new path of effective and efficient parallel programming.

As SC08 looms in less than a month here in Austin, the many topics around super-computing keep coming up. One of the most biggest pieces we're interested in is this transition to the massive number of cores coming down the pipe across the hardware architectures and systems. This is especially apparent as our performance focus has branched into clusters as the systems scale up and scale out. And Linux plays on them all. HPC, commercial, educational, personal - the systems are going to have an amazing number of processor cores ready to do useful work.

For myself, it's been an interesting personal journey this year as I've learned more about cluster performance and the different ways that customers and performance benchmarks parallel'ize HPC work. I certainly have a long way to go, but now know enough to see that many cluster products and cluster programming techniques make things way too hard for the end user and the system admins.

Getting ready for SC08, there are a number of sessions, panels, and birds-of-a-feather sessions emerging on this topic. For example, Paul Steinberg over at Intel popped a blog up on "Sequential programming is dead. So stop teaching it!". Interesting theme. Paul's blog has a number of pointers to related information and papers.

Paul McKenney in the LTC nicely hooked me up with this effort so I'll be spending more time on this over the next month. This looks like a really good time to snag the universities and prod the educational processes to shift gears. And, I'm hoping I'll learn enough about parallel programming to know what to look for in next year's students. We need the skills. Today.

So many cores... so few students...

Bill Buros: Improving Performance on Linux - Mon, 10/27/2008 - 10:32
So it's that time of year when we return to the semi-serious thought process regarding next year's college interns and college hires. Looking down the road in the industry and for Linux, the new servers coming have so many processor cores, requiring new programming models and approaches, that the students coming have got to know how to dive into this new world. And recalling the way-too-many resumes dug through over 2008, there's an amazing disconnect coming.

Which leads me to the three key questions..
  1. Can the schools actually shift gears to teach more about parallel programming?
  2. Or do we need to do this ourselves in the industry and in our example the Linux community itself?
  3. Can the students really cope with so many cores? They will eventually be the users of all of these cores.
I suspect we'll all be chasing just a handful of students who have the background to help the industry down a new path of effective and efficient parallel programming.

As SC08 looms in less than a month here in Austin, the many topics around super-computing keep coming up. One of the most biggest pieces we're interested in is this transition to the massive number of cores coming down the pipe across the hardware architectures and systems. This is especially apparent as our performance focus has branched into clusters as the systems scale up and scale out. And Linux plays on them all. HPC, commercial, educational, personal - the systems are going to have an amazing number of processor cores ready to do useful work.

For myself, it's been an interesting personal journey this year as I've learned more about cluster performance and the different ways that customers and performance benchmarks parallel'ize HPC work. I certainly have a long way to go, but now know enough to see that many cluster products and cluster programming techniques make things way too hard for the end user and the system admins.

Getting ready for SC08, there are a number of sessions, panels, and birds-of-a-feather sessions emerging on this topic. For example, Paul Steinberg over at Intel popped a blog up on "Sequential programming is dead. So stop teaching it!". Interesting theme. Paul's blog has a number of pointers to related information and papers.

Paul McKenney in the LTC nicely hooked me up with this effort so I'll be spending more time on this over the next month. This looks like a really good time to snag the universities and prod the educational processes to shift gears. And, I'm hoping I'll learn enough about parallel programming to know what to look for in next year's students. We need the skills. Today.

So many cores... so few students...

Bill Buros: Improving Performance on Linux - Mon, 10/27/2008 - 10:32
So it's that time of year when we return to the semi-serious thought process regarding next year's college interns and college hires. Looking down the road in the industry and for Linux, the new servers coming have so many processor cores, requiring new programming models and approaches, that the students coming have got to know how to dive into this new world. And recalling the way-too-many resumes dug through over 2008, there's an amazing disconnect coming.

Which leads me to the three key questions..
  1. Can the schools actually shift gears to teach more about parallel programming?
  2. Or do we need to do this ourselves in the industry and in our example the Linux community itself?
  3. Can the students really cope with so many cores? They will eventually be the users of all of these cores.
I suspect we'll all be chasing just a handful of students who have the background to help the industry down a new path of effective and efficient parallel programming.

As SC08 looms in less than a month here in Austin, the many topics around super-computing keep coming up. One of the most biggest pieces we're interested in is this transition to the massive number of cores coming down the pipe across the hardware architectures and systems. This is especially apparent as our performance focus has branched into clusters as the systems scale up and scale out. And Linux plays on them all. HPC, commercial, educational, personal - the systems are going to have an amazing number of processor cores ready to do useful work.

For myself, it's been an interesting personal journey this year as I've learned more about cluster performance and the different ways that customers and performance benchmarks parallel'ize HPC work. I certainly have a long way to go, but now know enough to see that many cluster products and cluster programming techniques make things way too hard for the end user and the system admins.

Getting ready for SC08, there are a number of sessions, panels, and birds-of-a-feather sessions emerging on this topic. For example, Paul Steinberg over at Intel popped a blog up on "Sequential programming is dead. So stop teaching it!". Interesting theme. Paul's blog has a number of pointers to related information and papers.

Paul McKenney in the LTC nicely hooked me up with this effort so I'll be spending more time on this over the next month. This looks like a really good time to snag the universities and prod the educational processes to shift gears. And, I'm hoping I'll learn enough about parallel programming to know what to look for in next year's students. We need the skills. Today.

PCMan: the real and true Gnome file manager

I have modified this post after several people complained that I was misunderstood: Nautilus can, sure, display a tree view of the file system. Sure, but only for “who-knows-how”. I do not like Gnome desktop environment because of its excessively simplified GUI that leverages all users as if they lack minimal skill to interact with a [...]

Using virtualization to provide "HA at wholesale"

Traditionally, the way people have implemented high availability is by using a high-availability management package like Linux-HA[1], then configure it in detail for each application, file system mount, IP address and so on.  This traditional method works quite well, but can be a bit labor intensive - particularly when using custom or uncommon applications.  You may have to understand the structure of your applications, write some resource agents[2], debug them, and test them in detail.  In addition, every time you change your mount structure, or other details you've told your HA system, you have to be sure and update your HA configuration to match - or it might not fail over correctly the next time.

When you have good resource agents, your HA system will also recover from application failures - by restarting applications that have failed.  This is a good thing.  On the other hand, this is enough work that virtually no one runs all their applications in an HA configuration.  It's just too much work for most applications.  I call this traditional boutique-like method "HA at retail".  It works well, but it is a little costly to set up and maintain all the details just so.

With virtualization, another approach is possible, and (big surprise), I call it "HA at wholesale".  In this paradigm, instead of needing to write scripts for each type of application, you just have one resource agent - one for managing a virtual machine.  You also don't need to know the structure of the applications - the OS still starts them in whatever way it has been starting them all along.  Wow, this sounds good - less work, fewer chances for errors!  As expected, there is still no such thing as a free lunch here - you do wind up with some disadvantages.

For example, you can no longer easily detect the failure of an application.  In addition, if an application fails, the only thing you can do about it is reboot the entire virtual machine.   Inevitably, this takes longer than just restarting the failed application.

So, HA at wholesale has these properties:
  • Simple enough that you can implement it for every machine
  • Works well for hardware failures
  • When coupled with hardware predictive failure analysis[3] and smart HA software, outages can sometimes be completely avoided.
  • Can't easily detect or recover from application failures
  • The only thing you can do about any failure is reboot the virtual machine
HA at retail has these properties:
  • It is complex enough that you need to limit how broadly you apply it in your environment
  • Works well for hardware failures
  • It can easily detect and recover from application failures
  • Individual applications can easily be restarted - and don't require a reboot
HA software like Linux-HA[1] can manage either type of environment.  In an ideal world, one would like to be able to do both in the same software infrastructure.


[1] http://linux-ha.org/
[2] http://linux-ha.org/ResourceAgent
[3] http://www-05.ibm.com/hu/termekismertetok/xseries/dn/pfa.pdf

Hey! Who's stealing my CPU cycles?!

Bill Buros: Improving Performance on Linux - Wed, 10/22/2008 - 10:45
I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.
  • The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.
In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 14578432 408768 943616 0 0 0 0 2 5 0 0 100 0 0
0 0 0 14578368 408768 943616 0 0 0 0 25 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 32 12 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 0 21 45 0 0 100 0 0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.
  • Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 11 34 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574656 408704 943488 0 0 0 0 10 34 50 0 0 0 50
For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

Hey! Who's stealing my CPU cycles?!

Bill Buros: Improving Performance on Linux - Wed, 10/22/2008 - 05:45
I hear this every now and then on the Power systems from customers, programmers, and even peers. In the more recent distro versions, there's a new "st" column in the CPU metrics which tracks the usage of "stolen" CPU cycles, from the perspective of the CPU being measured. This "steal" column has been around for a while, but the most recent service packs of RHEL 5.2 and SLES 10 sp2 have the latest fixes which display the intended values - so the values are getting noticed more.

I believe this "cpu cycle stealing" all came into being when things like Xen were being developed and the programmers wanted a way to account for the CPU cycles which were allocated to another partition. I suspect the programmers were looking at it from the perspective of "my partition", where something devious and nefarious was daring to steal my CPU cycles. Thus the term "stolen CPU cycles". Just guessing though.

This "steal" term is a tad unfortunate. It's been suggested that a more gentle term of "sharing" would be preferred for customers. But digging around the source code I found the term "steal" is fairly pervasive. And what's in the code, tends to end up in the man pages. Ah well.

With Power hardware, there's a mode where the two hardware threads are juggled by the Linux scheduler. This is implemented via cpu pairs (for example, cpu0 and cpu1) which represent the schedule'able individual hardware threads running on the single processor core. This is the SMT mode (simultaneous multi-threaded) on Power.
  • The term "hardware thread" is with respect to the processor core. Each processor core can have two active hardware threads. Software threads and software processes are scheduled on the processor cores by the operating system via the schedule'able CPUs which correspond to the two hardware threads.
In the SMT realm, each SMT hardware thread can be considered a sibling (in the context of brother or sister) of each other, running on a processor core. So if the two hardware threads are flat-out-busy with work from the operating system and evenly balanced, then each of the corresponding CPUs being scheduled are generally getting 50% of the processor core's cycles.

From a performance perspective, this has tremendous advantages because the processor core can flip between the hardware threads as soon as one thread hits a short-wait for things like memory accesses. Essentially the processor core can fetch the instructions and memory accesses simultaneously for the two hardware threads which improves the efficiency of the core.

In days of old, each CPU's metrics were generally based on the premise that a CPU could get to 100% user busy. Now, the new steal column can account for the processor cycles being shared by the two SMT sibling threads, not to mention additional CPU cycles being shared with other partitions. It's still possible for an individual CPU to go to 100% user busy, while the SMT sibling thread is idle.

For example, in the vmstat output below, the rightmost CPU column is the steal column. On an idle system, this value isn't very meaningful.

# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 14578432 408768 943616 0 0 0 0 2 5 0 0 100 0 0
0 0 0 14578368 408768 943616 0 0 0 0 25 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 32 12 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 0 21 45 0 0 100 0 0

In the next example, pushing do-nothing work on every CPU... (in this case a four-core system, SMT was on, so 8 CPUs were available...), we'll see the vmstat "st" column quickly get to the point where the CPU cycles on average are 50% user and 50% steal.
  • Try using "top", then press the "1" key to see what's happening on a per-CPU basis easier..

while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 11 34 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574656 408704 943488 0 0 0 0 10 34 50 0 0 0 50
For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..

I just wish we could distinguish the SMT sharing of CPU cycles, and the CPU cycles being shared with other partitions.

For more details on the process of sharing the CPU cycles, especially when the CPU cycles are being shared between partitions, check out this page where we dive into more (but not yet all) of the gory details...

python scripting in gdb update

Thiago Bauermann: Linux on Power - Thu, 10/16/2008 - 01:03

It’s been a great while since I last posted about Python scripting in GDB, mostly because I’ve been busy coding the feature and getting it ready for upstream.

First of all, I’d like to take the opportunity to encourage people interested in using this feature to experiment with what we have implemented so far. The reason is that if you still can’t do what you want with the current code in the Python branch, we’d love to hear what you miss and implement it. We are working on what is useful for ourselves, and trying to decide what other people would find useful. But it’s not possible to imagine everything that people want to use this for, or even most things. Please refer to this wiki page to learn what currently works, what we plan to implement, and how to grab the code from the Python branch.

Feel free to write to the GDB mailing list or show up in the #gdb IRC channel at Freenode to discuss this work and/or bring your use case to our attention, so that we can support it. I hope that with enough input from prospective users we can ship something that’s immediately useful for most people, and avoid having to jump through hoops later and have to shoehorn something that we forgot to cater for initially, risking breaking scripts out there or ending up with an inconsistent API.

Anyway, back to business: I have just committed the second patch in the Python series! It exports GDB’s value subsystem to Python scripts. Basically, GDB values are objects which represent data in the inferior (GDB jargon for the program being debugged), holding its address in the inferior’s addressspace, its type and so on. See the “Python API” section in the GDB manual if you want to learn more about it (yes, we are even writing documentation for the feature!).

I committed the first patch back in August, but I didn’t mention it here because it didn’t do anything the user would find useful, really. It was just groundwork for the rest (autoconf and Makefile.in changes, a ‘python’ command in GDB which basically does nothing useful, initial documentation…). Still, it was about 1500 lines long (not counting the patch’s context)! This shows how much work it is to integrate Python support in GDB. I almost regret having joined this effort. :-)

The second patch also doesn’t allow the user to do anything useful yet, unfortunately. But it is noteworthy because it is a base upon which a lot of other Python support code depend upon. Also, it’s the first committed patch which actually exposes something from GDB to Python. It took a while to get this code ready for two reasons: one was that there was a long discussion regarding how the syntax of acessing struct/union/class elements. The other was that implementing the Value class involved playing with little-documented aspects of Python’s C interface, and it took me time to discover how to do what I needed.

Now my next step is to choose the next patch from the Python series to submit upstream, and get it ready for posting (i.e., fix FIXMEs, add testcases and documentation). This brings me to another thing I’d like to mention. Back in April when I first prepared the Python patch series, I naively thought that after cutting them out, it was just a matter of posting them, iterate through a few review/rework steps and they’d be committed. Simple enough. But here we are in mid-October and just two from nine patches went in (now it’s more like 15 patches in total)! What happened?

The problem is that we’ve been working in the branch in an experimental and exploratory way, just hacking together enough to get something useful done. This was necessary because we didn’t know exactly what we would want to expose from GDB to Python, and how we wanted to do that. As we progressed and discussed the results, things started to become clear. The problem is that now we have a lot to clean up, voids to fill, and above all documentation and testcases to write. This takes time.

At least, that was the problem with the first two patches. I noticed Tom Tromey started to write more documentation and tie more loose ends than in the beginning (me? I’ve just been working on the first two patches until they were ready. Didn’t write sexy new stuff since then…), so there’s hope that the next patches will be easier to work with. We still lack a lot of tests for the testsuite, though…

      

Linux on Power.. links and portals..

Bill Buros: Improving Performance on Linux - Wed, 10/15/2008 - 07:23
Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Linux on Power.. links and portals..

Bill Buros: Improving Performance on Linux - Wed, 10/15/2008 - 02:23
Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Linux on Power.. links and portals..

Bill Buros: Improving Performance on Linux - Wed, 10/15/2008 - 02:23
Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.

[edit'ed 10/30/2008 - we made the Quick Links page the LinuxP home page]

Linux on Power.. links and portals..

Bill Buros: Improving Performance on Linux - Wed, 10/15/2008 - 02:23
Several of us were playing around recently counting up how many "portals" we could find for information in and around Linux for Power systems. On the performance side, we were specifically interested in seeing whether there was information "out there" that we could leverage that we weren't really aware of. We actually found a lot of performance information which I'll try to highlight more in the weeks coming.

For the portals, we did find an amazing assortment of web pages available. We hit classic marketing portals, hardware and performance information, generic Linux information, technical portals, a couple of old and outdated portals, various download sites for added-value items, lots of IBM forums on developerWorks, IBM Redbooks (always good information), and pointers to wiki pages spanning a number of subjects.

The marketing and customer teams generally point to the IBM Linux page as the primary entry point (portal). The five web tabs out there (Overview, Getting Started, Solutions, About Linux, and Resources) can get the reader to all sorts of official information.

For our list of web sites and pages, rather than just file the list in another email bucket never to be seen again, we created an index page called Quick Links to keep track of what we wanted to hunt down and get updated and more current. We naturally didn't want to call it another portal. 'Course, now we're hunting down subject-matter experts (aka volunteers) to help update the various wiki pages, especially under the developerWorks Linux for Power architecture wiki. We're particularly interested in providing more of the practical details, one example being the HPC Central - Red Hat page where a series of technical wiki pages are available.

Another interesting observation is seeing IBM's classic reliance on the developerWorks forums which we listed on our Quick Links index page. The Linux community is far more used to mailing lists for interactions, questions, and development issues. Forums are fine for questions and answers, but in our mind many of the forums are rarely used, even if the technology or product covered by each forum is helpful and useful. I would expect that we'll start seeding the forums with answers to questions we get from customers, developers, and peers. Help nudge things along. Which will give us more places to link to and get the practical questions answered.
Syndicate content