Re: We need better tools to make better tools. <guaRANTeed>

James Rogers (jamesr@best.com)
Wed, 04 Dec 1996 13:52:22 -0800


At 08:29 PM 12/4/96 +0100, you wrote:

>> In the Alpha AXP, this is primarily a result of mediocre CPU design rather
>> than something you can blame on the compilers. The Alpha has a very deep
>
>A simple question: why must the compilers be large, slow monsters, having
>to know lots of things about your hardware setup? (That should be the
>domain of the OS). This wasn't the rule just a few years ago. Gcc is
>terribly difficult to optimize for Alpha (it is an internal
>representation problem), Digital's dedicated compiler performs much better.
>This isn't God-given, men build both machines and write compilers for them.

If you left all the hardware details to the OS, you wouldn't be able to do
much optimization. You would be trusting that the OS was optimized,
something I am not yet willing to do. Granted, compilers *are* slow and
large, but for some languages (like C++) and systems, the complexity of
compiling software can approach absurdity. I once tried to write a
compiler. Not an easy task if you are concerned with optimization.

Actually, I think somebody needs to reinvent what an OS is supposed to do.
Take NT (which I am using to write this email) for example. That OS is so
bulky and has so much baggage it would be difficult to write a really small
or fast piece of software for it. Most versions of Unix have become the
same way. More often than not, the OS gets in the way. There are a lot of
good experimental OSs out there, but unfortunately, most don't have enough
support to be useable. If I had lots of time and money, I would start
building a brand new computing environment from the ground up.

The one thing I liked about DOS, as limited as it was, was that it allowed
you to really write compact, fast apps. If it natively multi-tasked, and
had full support for 32-bit operation, I might still be working on it.

>When was the last time you read your compiler in one hour, and understood
>it, at least sufficiently to modify it constructively? (Hint: read a
>modern optimizing Forth compiler, which is just just a few pages of
>printout, one should definitely recite'em at these lyrics readings. Its
>a goddam work of art). The last time the ole edit-compile-crash iteration
>took unperceivable milliseconds, not minutes? Where you could do dynamic
>surgery on a live system, with true incremental compiling?

Oh no. Another Forth evangelist! ;)

>We had to wait since the '60 for gc to become industrial mainstream, nay,
>not even mainstream: it is still considered top-notch newfangledness.
>This should tell us something about current tech's potential for
>optimization. Remember USCD p-System, Wirth's Oberon project, Lisp machines,
>Taos, Forth chips. Now we see pieces of them cropping up in mainstream
>products.
>
>> pipeline, but almost no look-ahead or intelligent pre-fetch/decode logic.
>
>What if your CPU core is so simple, you cannot afford a hardware
>multiplier, apart from things like FPU, pipeline, BPU, &c&c?
>
>Consider a WSI system: a 64 bit primitive CPU, a few kBytes of SRAM to
>contain the VM, 128 kBytes of DRAM, a router with, say, 12 links. How much
>complexity can we afford before die yield drops to zero? Surely not much
>beyond 1 MTransistors. (Those working in a Valley silicon foundry, please
>step forth, to cite chapter & verse).

The major CPU producers seem to be getting acceptible yields at
3-5MTransistors, depending on which company you are talking about. I know
Intel gets >80% yield by mid-cycle for most of its CPU products. I think
their initial yields are something like 15% though.

>There is simply no way to put above complexity into a die as tiny as
>that. Moreover, it would be an entirely wrong direction: it would keep
>the CPU architecture converging towards a CAM (it is easy to see
>convergenece in extreme cases of both), which is a nonalgorthmic machine,
>making no distinction between code and data, everything being just a
>hyperactive bit soup.

We are currently seeing and will probably continue to see a trend towards
multi-chip modules. This is certainly coming true as far as memory is
concerned. It's more profitable to connect a lot of little chips than to
make one big one. Companies like Cray have been doing things this way for a
while. And of course, Intel did it to get the cache size and performance
they wanted on the P6. I estimate most of the next generation of chips for
many of the major manufacturers will be MCMs.

>To approach it from a different direction: consider a 1 MBit DRAM. If we
>ignore all I/O, we can surely organize it in 1 kWords a la 1 kBit each.
>So we need a 1 kBit SRAM register to refresh it occasionally, and an
>incrementing PC to step through, as fast as the core lets us. We can add
>shift capability to the register, we can take two of them, add an adder,
>basic logics, a segmented ALU design (selectable comb tooth width), a
>variety of primitive logic engines to protect parts of the word, to
>shuffle things back & forth, etc. Anyway, starting from this points one
>does not have too many choices how the final beast has to look like. I
>have seen the MediaProcessor people independantly reinventing a lot of
>my machinery I thought was cool but ad hoc. (My design is still better,
>though ;)

These architectures can be fast, but how well would something like this
really work for general computing applications such as spreadsheets? The
instruction set would be kind of poor for efficient general computing.

>> The result is that the pipeline is flushed pretty often, which is only
>> worstened by its depth. Also, the high clock rates make pipeline stalls
>
>So let's drop the pipeline, and move the core onto the die, as much of it
>as die yield can take before wafer yield drops to zero.

It is okay to have a pipeline. It is just poor design to have such a deep
and dependant one when you have essentially no branch prediction or
look-ahead logic. A well-designed pipeline is what allows you to get the
high clock speeds out of silicon processes. Either keep it simple and
short, or long but with a lot of intelligent pre-pipeline logic. The thing
with simple and short is, you have to have either a very simple instruction
set, or a *very* good compiler. Personally, I don't think current compilers
are up to the challenge.

>
>> (due to things like cache misses) more serious than they would be in slower
>
>So let's abolish the cache, giving core DRAM almost cache quality,
>mapping a tiny SRAM islet into the address space, while eliminating
>worst-case scenarios (locality is nonexistant in certain problems, if fact
>it can be conjectured that access locality is a pathologicial case, an
>artefact of the human programmer).

Isn't this the way modern caches are essentially done?

>> clocked chips. Add on top of this an inefficient superscalar
>
>Astronomic clocks aren't everything, they just increase power dissipation
>(wasn't that an exponential relation?) and are of no use if the core
>cannot deliver requested words. Clock rates beyond 2-4 GHz aren't viable
>with Si tech anyway (physics starting harrumphing in the background), and
>GaAs is a much weaker substrate, when it comes to integration density,
>which is the key.

High clock rates force a deeper pipeline because you still have gate latency
to contend with. Deeper pipelines mean you have more time to complete one
instruction. If your clock is getting faster and your gates aren't, the
only solution is to deepen the pipeline.

The obvious physical limit for Si is when clock speed equals switching
speed, although realistically this isn't true, since it would require
ridiculous level of complexity (a 100 level pipeline?). I am not sure if
they have determined the maximum theoretical switching speed for a Si
transistor yet. The best they can do right now is to just improve the
fan-in/fan-out limits (like BiCMOS) and generally decrease structural latency.

>I say take a 40 kTransistor, highly tweaked CPU (Chuck does this), clock
>it as high as it can take without melting, considering dense packing of
>a WSI system (about 1 GHz with current high-end structures),
>flush your code into the tiny SRAM by hand, put it and a 128 k DRAM on a
>a die, them wafer with 100s of 'em dies, letting them all work in
>parallel. What are the wafer costs, about $500? Even if it sells for 1 k$,
>it is still acceptable.

At 40 kTransistors, how wide will the ALU be? I assume this does not
include any type of FP capability.

>Who is going to program this? The multimedia craze will greedily gulp
>down every imaginable DSP resources you can throw at it, as will GA and
>ANN stuff.
>
>> implementation, and you have a chip that virtually *requires* handcrafted
>> assembly language to run efficiently.
>
>I think we might see a (brief) revival of handcrafted assembly, at least
>in the nanoOS kernel methods. Apps will stop being monolithic, instead
>being live things, multiplying over nodes dynamically or shrinking
>likewise, goaded by the OS load leveler.

I still use assembly for computationally complex functions in some types of
software. I still can't believe the kind of speed I can get out of
handcrafted assembler sometimes, even on old systems.

>> One of the reasons Intel's late generation chips have done so well,
>> performance-wise, is that Intel probably has one of the best pre-execution
>> logic designs and techniques on any CPU, RISC or otherwise. Add to this
>
>Intel has grown sufficiently rich to afford the best people, as has M$. I
>once laughed about them, ignored them. I cannot, anymore. They have
>demolished the markets, and upon the ruined marketplace (all thanks, ye
>smart customers) their braindead designs even look good, since there is
>no competition.

Many aspects of Intel's designs are quite good. Other parts I question,
though. I think a lot of their bad design decisions are done to cater to
the consumer market that built them.

>Once you are big enough, investments into best manpower, new process
>research, long-term planning, industrial espionage and open market
>warfare can make you only bigger. It is profit maximization, that has
>brought us peu a peu device introduction, instead of a longer stages of
>drastic designs.
>
>> that Intel easily has one of the most efficient superscalar implementations,
>
>I don't think vector/scalar division makes at all sense. Precambrian life
>forms, designed to become extinct.
>
>> and you get some real performance out of an old architecture. One of the
>> few RISC companies that I think has a really solid architecture concept
is HP.
>
>Alas, HP RISC is dying, as does Alpha. When speaking of monopolies...

Actually, HP RISC is being kind of absorbed by Intel. I guess either the P7
or P8 will include a version which will include a PA-RISC decoder. This is
one feature of Intel's latest chips that I think is pretty cool. Have an
architecture specific decoder sitting upon a powerful generic RISC core.
They could probably put a decoder for just about any common architecture on
top of that. Now if you could select your hardware architecture emulation
via software, *that* would be *too cool*.

>
>One DSP design from TI features 4 DSP cores, 4 on-die SRAM areas,
>connected by a on-die crossbar, choreographed by a vanilla CPU. TI is
>pretty innovative in terms of DSP/embedded hybrids, putting core on the
>die and to make DSPs cascadable (up to 6 fast links).

Let me see...this would be the TMS320C80, if I am not mistaken.
Theoretical throughput of 2,000 DSP MIPS, or something like that.

Doesn't IBM have a new architecture (MFast?) that has a 32 DSP-core mesh?
I vaguely remember reading something about this. 10,000 DSP MIPS
theoretical throughput (although I understand for most operations it is
something like 2-5 kMIPS).

>> I'll take it one step further: An OS you can't customize is useless.
>>
>
>Have YOU rebuilt your kernel tonight? ;)

Not since about two weeks ago :)

-James Rogers
jamesr@best.com