In my previous post here, I mentioned that:
We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
I think that may require a little explanation, so here we go:
The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.
So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?
I can see four possible approaches:
In this case, the gang is only ever scheduled when all of the gang's contexts are runnable (ie, they are being run by the spu_run system call).
Although the simplest, this approach will never complete&emdash;consider the following:
So, this policy isn't much use; perhaps we can solve this with a new approach:
This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.
However, now we have a new, slightly more complex non-termination case:
Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:
int main(void) { do_work(); printf("work done!\n"); return 0; }
Here we're doing a PPE-assisted callback (the call to printf is implemented as a callback) before finishing. If this callback were to occur when the other context has already completed, we would hit the non-termination condition above.
This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.
So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.
This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.
The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.
The good news is that I've already done a little experimental work to overcome this general restriction in spufs.
The last approach is a little more complex, but works around this restriction:
This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.
This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.
So, which policy is best for spufs?
Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.
Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.
Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.
"In return for thy service, I grant thee the gift of Immortality!" You ascend to the status of Demigod...
Mel the Lord St:25 Dx:18 Co:18 In:16 Wi:18 Ch:18 Lawful
Astral Plane $:175 HP:441(441) Pw:141(141) AC:-23 Xp:30/100090501 T:99085 Satiated
At present, the spufs code has the invariant that a context is only ever loaded to an SPE when it is being run; ie, a thread is calling the spu_run syscall on the context.
However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
So, I've been working on some experimental changes to allow "external scheduling" for SPE contexts. The "external" refers to a thread external to the SPE's usual method of scheduling (ie, it's owning thread calling spu_run). In the example above, the external schedule would be caused by the fault handler for the problem-state mapping.
Although a context may be scheduled to an SPE, we still can't always guarantee forward progress. For example, in the "use the psmap to access the DMA engine" scenario, a DMA may cause a major page fault, which needs a controlling thread to service. In this case, the only way to ensure forward progress is through calling spu_run. However, I have some ideas on how we can remove this restriction later.
First up, we need to tell the spufs scheduler that we want a context to be loaded:
/* * Request an 'external' schedule for this context. * * The context will be either loaded to an SPU, or added to the run queue, * depending on SPU availability. * * Should be called with the context's state mutex locked, and the context * in SPU_STATE_SAVED state. */ int spu_request_external_schedule(struct spu_context *ctx);
After loading the context with spu_request_external_schedule, we need a way to tell the scheduler that the context can be de-scheduled:
/* * The context should be unscheduled at the end of its timeslice */ void spu_cancel_external_schedule(struct spu_context *ctx);These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.
We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:
--- a/arch/powerpc/platforms/cell/spufs/file.c +++ b/arch/powerpc/platforms/cell/spufs/file.c @@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma, if (ctx->state == SPU_STATE_SAVED) { up_read(¤t->mm->mmap_sem); + spu_request_external_schedule(ctx); spu_context_nospu_trace(spufs_ps_fault__sleep, ctx); ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED); spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu); + spu_cancel_external_schedule(ctx); down_read(¤t->mm->mmap_sem); } else { area = ctx->spu->problem_phys + ps_offs;
Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.
We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.
I've created a development branch in the spufs repository for these changes, which is available:
Note that this is an experimental codebase, expect breakages!
We've had a new version of patchwork - the web-based patch-tracking system - online for a few weeks now at patchwork.ozlabs.org.
After Paul's presentation on patchwork at the 2008 Kernel Summit, there has been wider interest in patchwork setups for other projects. Patchwork originally hosted the Linux on Power and Linux on Cell lists, we've since added netdev, linux-mtd and linux-ext4. I've also added the main Linux Kernel mailing list (lkml), just to see how the new patchwork handles the load; all has been okay so far.
If you're interested in installing the new patchwork at your own site, you can grab the source from the patchwork project page. Installation can be a little tricky, so feel free to mail me if you need a hand.
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 14578432 408768 943616 0 0 0 0 2 5 0 0 100 0 0
0 0 0 14578368 408768 943616 0 0 0 0 25 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 32 12 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 0 21 45 0 0 100 0 0
while : ; do : ; done &For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 11 34 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574656 408704 943488 0 0 0 0 10 34 50 0 0 0 50
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 14578432 408768 943616 0 0 0 0 2 5 0 0 100 0 0
0 0 0 14578368 408768 943616 0 0 0 0 25 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 32 12 44 0 0 100 0 0
0 0 0 14578432 408768 943616 0 0 0 0 21 45 0 0 100 0 0
while : ; do : ; done &For customers and technical people who were used to seeing their CPUs up to 100% user busy, this can be... disconcerting... but it's now perfectly normal.. even expected..
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
while : ; do : ; done &
# vmstat 1
procs ---- -------memory------- ---swap-- ---io--- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 11 34 50 0 0 0 50
8 0 0 14574400 408704 943488 0 0 0 0 26 42 50 0 0 0 50
8 0 0 14574656 408704 943488 0 0 0 0 10 34 50 0 0 0 50
It’s been a great while since I last posted about Python scripting in GDB, mostly because I’ve been busy coding the feature and getting it ready for upstream.
First of all, I’d like to take the opportunity to encourage people interested in using this feature to experiment with what we have implemented so far. The reason is that if you still can’t do what you want with the current code in the Python branch, we’d love to hear what you miss and implement it. We are working on what is useful for ourselves, and trying to decide what other people would find useful. But it’s not possible to imagine everything that people want to use this for, or even most things. Please refer to this wiki page to learn what currently works, what we plan to implement, and how to grab the code from the Python branch.
Feel free to write to the GDB mailing list or show up in the #gdb IRC channel at Freenode to discuss this work and/or bring your use case to our attention, so that we can support it. I hope that with enough input from prospective users we can ship something that’s immediately useful for most people, and avoid having to jump through hoops later and have to shoehorn something that we forgot to cater for initially, risking breaking scripts out there or ending up with an inconsistent API.
Anyway, back to business: I have just committed the second patch in the Python series! It exports GDB’s value subsystem to Python scripts. Basically, GDB values are objects which represent data in the inferior (GDB jargon for the program being debugged), holding its address in the inferior’s addressspace, its type and so on. See the “Python API” section in the GDB manual if you want to learn more about it (yes, we are even writing documentation for the feature!).
I committed the first patch back in August, but I didn’t mention it here because it didn’t do anything the user would find useful, really. It was just groundwork for the rest (autoconf and Makefile.in changes, a ‘python’ command in GDB which basically does nothing useful, initial documentation…). Still, it was about 1500 lines long (not counting the patch’s context)! This shows how much work it is to integrate Python support in GDB. I almost regret having joined this effort.
The second patch also doesn’t allow the user to do anything useful yet, unfortunately. But it is noteworthy because it is a base upon which a lot of other Python support code depend upon. Also, it’s the first committed patch which actually exposes something from GDB to Python. It took a while to get this code ready for two reasons: one was that there was a long discussion regarding how the syntax of acessing struct/union/class elements. The other was that implementing the Value class involved playing with little-documented aspects of Python’s C interface, and it took me time to discover how to do what I needed.
Now my next step is to choose the next patch from the Python series to submit upstream, and get it ready for posting (i.e., fix FIXMEs, add testcases and documentation). This brings me to another thing I’d like to mention. Back in April when I first prepared the Python patch series, I naively thought that after cutting them out, it was just a matter of posting them, iterate through a few review/rework steps and they’d be committed. Simple enough. But here we are in mid-October and just two from nine patches went in (now it’s more like 15 patches in total)! What happened?
The problem is that we’ve been working in the branch in an experimental and exploratory way, just hacking together enough to get something useful done. This was necessary because we didn’t know exactly what we would want to expose from GDB to Python, and how we wanted to do that. As we progressed and discussed the results, things started to become clear. The problem is that now we have a lot to clean up, voids to fill, and above all documentation and testcases to write. This takes time.
At least, that was the problem with the first two patches. I noticed Tom Tromey started to write more documentation and tie more loose ends than in the beginning (me? I’ve just been working on the first two patches until they were ready. Didn’t write sexy new stuff since then…), so there’s hope that the next patches will be easier to work with. We still lack a lot of tests for the testsuite, though…
