Re: We need better tools to make better tools. <guaRANTeed>

Eugene Leitl (Eugene.Leitl@lrz.uni-muenchen.de)
Wed, 4 Dec 1996 20:29:47 +0100 (MET)


On Wed, 4 Dec 1996, James Rogers wrote:

> > [modern RISC, e.g. Alpha AXP suck, and CISC does things unmentionable]
>
> In the Alpha AXP, this is primarily a result of mediocre CPU design rather
> than something you can blame on the compilers. The Alpha has a very deep

A simple question: why must the compilers be large, slow monsters, having
to know lots of things about your hardware setup? (That should be the
domain of the OS). This wasn't the rule just a few years ago. Gcc is
terribly difficult to optimize for Alpha (it is an internal
representation problem), Digital's dedicated compiler performs much better.
This isn't God-given, men build both machines and write compilers for them.

When was the last time you read your compiler in one hour, and understood
it, at least sufficiently to modify it constructively? (Hint: read a
modern optimizing Forth compiler, which is just just a few pages of
printout, one should definitely recite'em at these lyrics readings. Its
a goddam work of art). The last time the ole edit-compile-crash iteration
took unperceivable milliseconds, not minutes? Where you could do dynamic
surgery on a live system, with true incremental compiling?

We had to wait since the '60 for gc to become industrial mainstream, nay,
not even mainstream: it is still considered top-notch newfangledness.
This should tell us something about current tech's potential for
optimization. Remember USCD p-System, Wirth's Oberon project, Lisp machines,
Taos, Forth chips. Now we see pieces of them cropping up in mainstream
products.

> pipeline, but almost no look-ahead or intelligent pre-fetch/decode logic.

What if your CPU core is so simple, you cannot afford a hardware
multiplier, apart from things like FPU, pipeline, BPU, &c&c?

Consider a WSI system: a 64 bit primitive CPU, a few kBytes of SRAM to
contain the VM, 128 kBytes of DRAM, a router with, say, 12 links. How much
complexity can we afford before die yield drops to zero? Surely not much
beyond 1 MTransistors. (Those working in a Valley silicon foundry, please
step forth, to cite chapter & verse).

There is simply no way to put above complexity into a die as tiny as
that. Moreover, it would be an entirely wrong direction: it would keep
the CPU architecture converging towards a CAM (it is easy to see
convergenece in extreme cases of both), which is a nonalgorthmic machine,
making no distinction between code and data, everything being just a
hyperactive bit soup.

To approach it from a different direction: consider a 1 MBit DRAM. If we
ignore all I/O, we can surely organize it in 1 kWords a la 1 kBit each.
So we need a 1 kBit SRAM register to refresh it occasionally, and an
incrementing PC to step through, as fast as the core lets us. We can add
shift capability to the register, we can take two of them, add an adder,
basic logics, a segmented ALU design (selectable comb tooth width), a
variety of primitive logic engines to protect parts of the word, to
shuffle things back & forth, etc. Anyway, starting from this points one
does not have too many choices how the final beast has to look like. I
have seen the MediaProcessor people independantly reinventing a lot of
my machinery I thought was cool but ad hoc. (My design is still better,
though ;)

> The result is that the pipeline is flushed pretty often, which is only
> worstened by its depth. Also, the high clock rates make pipeline stalls

So let's drop the pipeline, and move the core onto the die, as much of it
as die yield can take before wafer yield drops to zero.

> (due to things like cache misses) more serious than they would be in slower

So let's abolish the cache, giving core DRAM almost cache quality,
mapping a tiny SRAM islet into the address space, while eliminating
worst-case scenarios (locality is nonexistant in certain problems, if fact
it can be conjectured that access locality is a pathologicial case, an
artefact of the human programmer).

> clocked chips. Add on top of this an inefficient superscalar

Astronomic clocks aren't everything, they just increase power dissipation
(wasn't that an exponential relation?) and are of no use if the core
cannot deliver requested words. Clock rates beyond 2-4 GHz aren't viable
with Si tech anyway (physics starting harrumphing in the background), and
GaAs is a much weaker substrate, when it comes to integration density,
which is the key.

I say take a 40 kTransistor, highly tweaked CPU (Chuck does this), clock
it as high as it can take without melting, considering dense packing of
a WSI system (about 1 GHz with current high-end structures),
flush your code into the tiny SRAM by hand, put it and a 128 k DRAM on a
a die, them wafer with 100s of 'em dies, letting them all work in
parallel. What are the wafer costs, about $500? Even if it sells for 1 k$,
it is still acceptable.

Who is going to program this? The multimedia craze will greedily gulp
down every imaginable DSP resources you can throw at it, as will GA and
ANN stuff.

> implementation, and you have a chip that virtually *requires* handcrafted
> assembly language to run efficiently.

I think we might see a (brief) revival of handcrafted assembly, at least
in the nanoOS kernel methods. Apps will stop being monolithic, instead
being live things, multiplying over nodes dynamically or shrinking
likewise, goaded by the OS load leveler.

> One of the reasons Intel's late generation chips have done so well,
> performance-wise, is that Intel probably has one of the best pre-execution
> logic designs and techniques on any CPU, RISC or otherwise. Add to this

Intel has grown sufficiently rich to afford the best people, as has M$. I
once laughed about them, ignored them. I cannot, anymore. They have
demolished the markets, and upon the ruined marketplace (all thanks, ye
smart customers) their braindead designs even look good, since there is
no competition.

Once you are big enough, investments into best manpower, new process
research, long-term planning, industrial espionage and open market
warfare can make you only bigger. It is profit maximization, that has
brought us peu a peu device introduction, instead of a longer stages of
drastic designs.

> that Intel easily has one of the most efficient superscalar implementations,

I don't think vector/scalar division makes at all sense. Precambrian life
forms, designed to become extinct.

> and you get some real performance out of an old architecture. One of the
> few RISC companies that I think has a really solid architecture concept is HP.

Alas, HP RISC is dying, as does Alpha. When speaking of monopolies...

> > [rant]
> >Corollaries:
> >
> >1) CPUs will become _very_ simple, see MISC.( I doubt InTeL will pioneer
> > this, though they sure will join the party, once it has started. I
> > don't know what M$ might or might not do...)
>
> Actually, you will probably see arrays of tiny cores on chips glued together
> with complex decode logic, or in the VLIW case, have the compiler do most of

Agreed, but they won't be complex logics, and the array'd cores will
stretch all over the entire wafer ;)

> the decode for you.

One DSP design from TI features 4 DSP cores, 4 on-die SRAM areas,
connected by a on-die crossbar, choreographed by a vanilla CPU. TI is
pretty innovative in terms of DSP/embedded hybrids, putting core on the
die and to make DSPs cascadable (up to 6 fast links).

> > [ grow your own ]
> >
> > [ CLI=GUI already ]
>
> I'll take it one step further: An OS you can't customize is useless.
>

Have YOU rebuilt your kernel tonight? ;)

> -James Rogers
> jamesr@best.com

ciao,
'gene