Re: Blue Gene

Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Tue, 7 Dec 1999 11:55:09 -0800 (PST)

Robert J. Bradbury writes:

> So would I. I think they would publish it, you can't effectively use a
> machine unless you can work with it at multiple levels. I've rarely
> seen a compiler that I can't out-code. The question is whether they

Deeply pipelined VLIW stuff is a nightmare to hand-optimize. TI 'C6x and Itanium would come to mind here.

> will publish it before the machine becomes available, for example do we
> even have the Merced instruction set (or the Playstation instruction set?).

http://developer.intel.com/design/ia64/index.htm

As to PSX2 Beowulf (project named Wolfstation) there is a site

http://wulfstation.org/
and a mailing list

http://www.onelist.com/community/beowulf-psx2 though so far it is mostly fluff (we're waiting for the dev kits).

The PSX2 CPU is mostly vanilla MIPS. If you know MIPS, you'll find yourself immediately at home.

> > But even if I'm right, the task of designing software to make full use
> > of the machine's capabilities may be so daunting that no one else will
> > want to take it on, effectively making it a single-purpose machine.
>
> Not really, if it is general purpose, there are already software
> models (e.g. the Oxford Bulk Synchronous Parallel (BSP) model,
> the OpenMP API for Shared Memory Programming, and the BIP message
> passing model (for Myrinet) for programming similar machines.
> There only difference between programming something for a Beowulf
> cluster and Blue Gene is the granularity of the processor units.

Hopefully, IBM will integrate the switch/router into the CPU, and implement direct hardware support for most MPI calls (i.e. most basic MPI calls would be substituted by a single/few machine instructions).

Otherwise, latencies will be prohibitive. The best we can do with cheap off-shelf codes/hardware is GAMMA (with MPI wrapper) and M-VIA:

http://www.disi.unige.it/project/gamma/ http://www.nersc.gov/research/FTG/via/

The latencies here are mostly due to software overhead. I don't see why one shouldn't be able to bring message passing latency down to few 10 ns with the proper hardware support.

As to the optimal type of code for MD, I strongly suspect integer lattice gases might be up to the challenge.

http://xxx.lanl.gov/archive/comp-gas

Progress is slow, but steady. A generic forcefield engine implemented as integer lattice gas would be trivially to cast in hardware. And you certainly can't beat the speed.

> What IBM probably did was ask themselves what the failure rate
> was going to be in the processor units. With 1M processors
> it might be quite high. Customers aren't going to be happy
> if your machine is down most of the time getting boards replaced.

I surmise one can keep a fresh CPU pool and checkpoint periodically. When a CPU fails you fallback to the last snapshot state, thus losing only minutes of computation. Of course this means lots of I/O activity every few minutes, so each CPU should have its own disk. Otoh Big Blue is probably up to the challenge of building a monolithic monster RAID.

> This is now solved in multiprocessor & clustered architectures
> where you can afford to take out a node for a few minutes to
> hours to replace parts. However if you are running integrated
> calculations (i.e. this isn't a client-server archecture) that
> take days to weeks and the data in one node interacts with *all*
> of the other data, then when you pull a node you slow down the
> entire calculation. The clever trick is going to be detecting
> the failures (you don't want soft failures, you want hard failures)
> and having the data arranged so that multiple processors/nodes can
> rapidly get to it.

The good part about MD is that they are mostly local-interaction (see particle-in-cell for a very good illustration), and long-range interactions (mostly Coulomb) can be simulated by propagating the information through the node lattice a la bucket brigade.

> This is a new level in computer architecture and getting very close
> to what goes on in the brain. If they get the architecture right
> and the fault tolerance right and because they have solved the
> bandwidth problem, you can expect a simple instruction set to
> gradually expand as people come up with other applications
> and declining feature sizes give you more chip real-estate to
> work with.

For a glimpse of what is possible with PIM type of devices (caution, self plug again) have a look at (old, obsolete, half-baked etc):

http://www.lrz-muenchen.de/~ui22204/.html/txt/8uliw.txt

> > And this is likely the only one they will build, like Deep Blue.
>
> IBM is one of the most clever marketing organizations in the world.
> Unlike Deep Blue, they aren't doing this for publicity. (After all
> how many machines are you going to sell when you know you are going
> to lose the game...) They realize the market for these machines is
> in the dozens (major pharma & govmnts), thousands (universities &
> small-biotech), and potentially workstation quantities (individual
> researchers). I'll predict with this one they are planning to do
> the software investment and then use that to follow the declining
> hardware costs to make the machines available to larger markets.

If we're talking about 5 years before they deliver, there is a fair chance Beowulfs will be able to meet the challenge. You can't argue with economies of scale very well.

> > P.S. I apologize for my sloppy editing on my original post
> > (which was truly my first post to this list).
>
> No problem. The information was quite helpful and appreciated.
>
> Robert