Re: Blue Gene

Robert J. Bradbury (bradbury@www.aeiveos.com)
Tue, 7 Dec 1999 09:25:36 -0800 (PST)

On Tue, 7 Dec 1999, Mike Hall wrote:

>
> Maybe, but I've seen nothing in the published material that says this is
> anything other than a general-purpose machine. But again, the facts in
> these pieces are somewhat meager. I'd like to get a peek at the
> instruction set if they ever deign to publish it.

So would I. I think they would publish it, you can't effectively use a machine unless you can work with it at multiple levels. I've rarely seen a compiler that I can't out-code. The question is whether they will publish it before the machine becomes available, for example do we even have the Merced instruction set (or the Playstation instruction set?).

>
> But even if I'm right, the task of designing software to make full use
> of the machine's capabilities may be so daunting that no one else will
> want to take it on, effectively making it a single-purpose machine.

Not really, if it is general purpose, there are already software models (e.g. the Oxford Bulk Synchronous Parallel (BSP) model, the OpenMP API for Shared Memory Programming, and the BIP message passing model (for Myrinet) for programming similar machines. There only difference between programming something for a Beowulf cluster and Blue Gene is the granularity of the processor units.

What IBM probably did was ask themselves what the failure rate was going to be in the processor units. With 1M processors it might be quite high. Customers aren't going to be happy if your machine is down most of the time getting boards replaced. This is now solved in multiprocessor & clustered architectures where you can afford to take out a node for a few minutes to hours to replace parts. However if you are running integrated calculations (i.e. this isn't a client-server archecture) that take days to weeks and the data in one node interacts with *all* of the other data, then when you pull a node you slow down the entire calculation. The clever trick is going to be detecting the failures (you don't want soft failures, you want hard failures) and having the data arranged so that multiple processors/nodes can rapidly get to it.

This is a new level in computer architecture and getting very close to what goes on in the brain. If they get the architecture right and the fault tolerance right and because they have solved the bandwidth problem, you can expect a simple instruction set to gradually expand as people come up with other applications and declining feature sizes give you more chip real-estate to work with.

> And this is likely the only one they will build, like Deep Blue.

IBM is one of the most clever marketing organizations in the world. Unlike Deep Blue, they aren't doing this for publicity. (After all how many machines are you going to sell when you know you are going to lose the game...) They realize the market for these machines is in the dozens (major pharma & govmnts), thousands (universities & small-biotech), and potentially workstation quantities (individual researchers). I'll predict with this one they are planning to do the software investment and then use that to follow the declining hardware costs to make the machines available to larger markets.

>
> P.S. I apologize for my sloppy editing on my original post
> (which was truly my first post to this list).

No problem. The information was quite helpful and appreciated.

Robert