COMP:WARS: RE: Software/Hardware Architectures

Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Thu, 15 Jul 1999 01:33:53 -0700 (PDT)

Billy Brown writes:
> I don't think we were really getting anywhere with the previous line of
> responses, so I decided to try it again from the beginning. Here goes:

Yep, this has been also my impression. We do really seem to have unresolvably different ways of seeing matters. Not that diversity is a bad thing, but it sure turns tedious having to reiterate things/sugarcoat them in different verbiage.

> Regarding Current Software
> Current PC software is written for hardware that is actually in use, not
> hypothetical designs that might or might not ever be built. This is
> perfectly logical, and I don't think it makes sense to blame anyone for it.

It might be logical, yet there is plenty of reasons to blame widespread "investment protection" attitude for it. Investment protection is great for local optimization, but is deletorious even on the middle run. And is really really desastrous on the long run.

To name a few notorious perpetrators: Big Blue, Intel, Microsoft have been reserved a special circle in Dante's Hell. (At least in my private universion of it).

> If a better architecture becomes available, we can expect ordinary market
> forces to lead them to support it in short order (look at Microsoft's

Alas, woefully, free markets seem to fail here miserably. Technical excellence has very little to do with market permeation. Deja vu deja vu deja vu deja vu.

> efforts with regard to the only-marginally-superior Alpha chip, for
> example).

There is really no fundamental difference between x86 family, PowerPC, diverse MIPSen or Alpha. They all suck.

> The more sophisticated vendors (and like it or not, that included Microsoft)

The trouble with Microsoft is that we're reasoning with a marketplace flattened by more than a decade of its influence as a point of reference. To be fair we must evaluate multiple alternate branches of reality-as-it-could-have-been, which necessarily makes for extremely subjective judgements. Your mileage WILL vary.

As to Microsoft, I guess all the intelligence cream they've been skimming off academia/industry for years & all these expenses in R&D will eventually lead somewhere. Right now, what I see doesn't strike me as especially innovative or even high-quality, no Sir. Particularly regarding ROI in respect to all these research gigabucks pourin' in. Administratory hydrocephalus begets administratory hydrocephalus.

> have been writing 100% object-oriented, multithreaded code for several years
> now. They use asynchronous communication anywhere there is a chance that it

I hear you. It is still difficult to belive.

> might be useful, and they take full advantage of what little multiprocessor
> hardware is actually available. There is also a trend currently underway

Well, essentially all we've got is SMP. I guess it makes use of shared-memory paradigm which is a dead-end. Shared-memory is at least as unphysical as caches, in fact more so.

> towards designing applications to run distributed across multiple machines
> on a network, and this seems likely to become the standard approach for
> high-performance software in the near future.

I know clustering is going to be big, and is eventually going to find its way into desktops. It's still a back-assed way of doing things, maybe smart RAM will have its say yet. If only Playstation 2 would be already available, oh well. Marketplace will sure look different a year downstream. Difficult to do any planning when things are so in flux.

> Regarding Fine-Grained Parallelism
> Parallel processing is not a new idea. The supercomputer industry has been

Heck, LISP is the second oldest HLL known to man, and Alonzo Church invented lambda calculus in the 1940's. Eniac was a RISC machine. Unix is a 1970's OS. GUIs/mice/Ethernet are 1970's technologies. Von Neumann invented cellular automata in the 1950's.

What has age anything to do with how good an idea it is? If anything, a brand-new untried idea is something to be wary of.

> doing it for some time now, and they've done plenty of experimenting with
> different kinds of architectures. They have apparently decided that it

Nope, sorry, don't think so. Most of IT landscape is shaped by vogues, and right now parallelism is getting fashionable (what a damnable word) again. (While ALife is heading into oblivion, which is imo a damn shame).

> makes more sense to link 1,000 big, fast CPUs with large memory caches than
> 100,000 small, cheap CPUs with tiny independant memory blocks. That fits

Heck, caches hierarchies are unphysical. Fat CPUs are incommensurable with nonnegligeable fast on-die SRAM blocks _and_ high good die yield. Wafer-scale integration is impossible without good die yield and failure tolerance. Kbit buses are impossible if you don't have embedded DRAM technology due to packaging constraints. Embedded RAM technology is hardly one year old. VLIW is at the threshold of going mainstream, and VLIW only makes good sense with kBit broad buses/on-die memory. High code density is impossible without threaded code, and threaded code requires stack CPU hardware support. Common HLLs don't support threaded code/stack CPUs. No language supports fine-grain maspar systems. Blahblahblah. I could go on for a long time but (I hope) my point is nauseatingly clear: it's a big ball of yarn buried in yonder tarpit, and requires a whole lotta muscle to haul it out. You have to do it in one piece because everything is coherent/contiguous/synergistic. A bit of tarry string wrestled from the pit won't excite anybody. Please go for the whole hog.

IT is just another acronym for Inertia Technology. We're caught in a local minimum, but this doesn't mean there is no lower one. And a (number of) decision(s) has been made in the past which made us land in this particular minimum. It could have been a different one.

Sorry if this sounds like a just another technoshaman mantra, but that's just how things are.

> perfectly with what I know about parallel computing - the more nodes you
> have the higher your overhead tends to be, and tiny nodes can easily end up
> spending 100% of their resources on system overhead.

There are codes where Amdahl is going to bite ya. There are a lot where overhead is not a problem. My particular problem (from the domain of physical simulation on a 3d lattice) is the latter case.

> Now, if someone has found a new technique that changes the picture, great.
> But if this is something you've thought up yourself, I suggest you do some
> more research (or at least propose a more complete design). When one of the

Heck I did it years ago. Somebody even used the writeup in a CPU design class. As I don't have a fab and several 100 M$ to burn (and, incidentally, more important research to do) I can hardly be expected to assault the buttress alone, can I? You'll wind up in the moat, all dirty & bloody, and have to listen to stupid French jokes and be used for carcass target practice in the bargain.

> most competitive (and technically proficient) industries on the planet has
> already tried something and discarded it as unworkable, its going to take
> more than arm-waving to convince me that they are wrong.

Right, the whole area of supercomputing is going to vanish into a logics cloud overnight -- because as everybody knows they are all monoprocessors. Beowulf is just a passing fad -- pray no attention to exponential growth of 'wulfers. Photolitho-semiconductor CPUs will scale in to 10, 100, 1000 GHz regime trivially. Einstein was dead wrong, and you can signal faster than speed of light in vacuum. It makes actual sense implementing a Merced in buckytube logic. People who proved that reversible cellular automata in molecular logics are the most efficient way of doing computation were just dweebs. Right.

> Regarding the Applicability of Parallelism
> The processes on a normal computer span a vast continuum between the
> completely serial and the massively parallel, but most of them cluster near
> the serial end of the spectrum. Yes, you have a few hundred process in

Says who.

> memory on your computer at any given time, but only a few of them are
> actually doing anything. Once you've allocated two or three fast CPUs (or a

How would you know? I gave you a list of straightforward jobs my machine could be doing right now. Sounds all very parallel to me. Remember, there is a reason why I need to build a Beowulf.

> dozen or so slow ones) to the OS and any running applications, there isn't
> much left to do on a typical desktop machine. Even things that in theory

I guess I don't have a typical desktop machine, then. I could really use an ASCI Red here, or better one of these kCPU QCD DSP jobs.

> should be parallel, like spell checking, don't actually get much benifit
> from multiple processors (after all, the user only responds to one dialog
> box at a time).

Spell checking? I never do spell checking. I do have the C. elegans genome sitting on my hard drive here, though, which I'd love to do some statistical analysis on. Guess what? Another embarrasingly parallel app.

> On servers there is more going on, and thus more opportunity for
> parallelism. However, the performance bottleneck is usuall in the network

You know what? We're going to move to xDSL pretty quick. And we're going to need a database-backed web site, both for the intranet and the outside. No never fork no more...

> or disk access, not CPU time. You can solve these problems by introducing
> more parallelism into the system, but ultimately it isn't cost-effective.
> For 99% of the applications out there, it makes more sense to buy 5
> standardized boxes for <$5,000 each than one $100,000 mega-server (and you
> get better performance, too).

Well, I guess I must be pretty special, because the $100,000 mega-server doesn't make at all sense when you want to do multi-million particles MD. Lots of cheap PC with full duplex FastEthernet, very much yes. And there is no bottleneck, since from a certain minimal system size/node onwards the things scales O(N), and I mean _strictly_ O(N).

> Of course, there are many processes that are highly amenable to being run in
> a parallel manner (video rendering, simulation of any kind, and lots of
> other things), but most of them are seldom actually done on PCs. The one

Well, I hate to keep repeating this, but it is not really that rare you seem to think it is.

> example that has become commonplace (video rendering) is usually handled by
> a specialized board with 1 - 8 fast DSP chips run by custom driver-level
> software (once again, the vendors have decided that a few fast, expensive
> chips are more economical than a lot of slow, cheap ones).

Cheap!=slow. A $30 DSP can outperform a $300 CPU because it don't have to put up with legacy bloat.

> Side Issues
> 1) Most parallel tasks require that a large fraction of the data in the
> system be shared among all of your CPUs. Thus, your system needs to provide

YMMV. Mine don't.

> for a lot of shared memory if it is going to be capable of tackling

Shared memory does not exist, at least not >2-4 ports. If you attempt to simulate that, you will have to pay dearly in logic and cache coherence issues, which starts to slow you down very quickly (point of diminishing returns is just round the corner). You can simulate shared memory with message-passing, though. If you really, really really need it.

> molecular CAD, atmospheric simulations, neural networks, etc. That brings

<laugher>. Molecular CAD, weather codes and neural codes are already are or patently formulable in an embarrasingly parallel way. Really. Look on the code shelves.

> up all those issues of caching, inter-node communication and general
> overhead you were trying to avoid.

Caching doesn't exist (pray pay no attention to the clever fata morgana). Internode-communication is readily addressable (=solved) by having a failure-tolerant routing protocol with switch fabric built into the CPU. Next Alpha is going to have multi-10 GByte/s inter-CPU signalling with 15 ns signalling latency. 3d lattice topology (6 links/CPU) is really sufficient for most codes I care about.

See SGI/Cray, DSP clusters and Myrinet Beowulfs for illustration.

> 2) You also can't get away from context switching. Any reasonably complex
> task is going to have to be broken down into procedures, and each processor
> will have to call a whole series of them in order to get any usefull work
> done. This isn't just an artifact of the way we currently write software,

Untrue. You almost never have to switch context if you have 1 kCPUs to burn. You only have to do this if you run out of the allocable CPU heap (when the number of your objects exceed the number of your CPUs).

> either. It is an inevitable result of the fact that any interesting
> computation requires a long series of distinct operations, each of which may
> require very different code and/or data from the others.

Strangely, my needs are very different.

> Billy Brown, MCSE+I
> ewbrownv@mindspring.com