Re: We need better tools to make better tools. <guaRANTeed>

Eugene Leitl (Eugene.Leitl@lrz.uni-muenchen.de)
Sun, 8 Dec 1996 20:05:14 +0100 (MET)


On Wed, 4 Dec 1996, James Rogers wrote:

> At 08:29 PM 12/4/96 +0100, you wrote:
>
> >[ pointless complexity rant ]
>
> If you left all the hardware details to the OS, you wouldn't be able to do

Uh, I wasn't proposing leaving the optimization to the OS, just using a
simple hardware architecture, taking only a few, straightforward
optimization, and, even more important, which reacts
_predictable/nonbrittle_ in regards to optimization. Lacking single-node
performance, the power must come from maspar systems.

But of course, the OS to be hardware-independant must contain a
just-in-time compiler (this concept was first demonstrated by Taos).
Though the compile method will typically not be a part of OS nanokernel,
it must be extremely small (few kBytes) and have excellent performance,
i.e. being able to translate faster than the HD can provide data.

> much optimization. You would be trusting that the OS was optimized,
> something I am not yet willing to do. Granted, compilers *are* slow and

For the OS to be optimized, it must be just a few kBytes big. Then, yes,
it can be optimized, hand-coded in assembly. It can then be even
virtually bug-free. I think the OS must a virtual machine, interpreting
a p-code (of course, JIT-interpreting). Depending on implementation, the
VM can be mostly software, or mostly hardware.

> large, but for some languages (like C++) and systems, the complexity of
> compiling software can approach absurdity. I once tried to write a

That's the reason why I say: hang compiler complexity. The investments
have gone far beyond of any returns. Many developers do not even realize
how important it is to have blitzn turnaround cycles. By testing out
small functions thouroughly, the entire conglomerate is much less brittle.

> compiler. Not an easy task if you are concerned with optimization.

Optimizing Forth compilers are embarassingly simple and fast. Most of
these are simply lookup tables (sane processor architecture assumed, of
course).

> Actually, I think somebody needs to reinvent what an OS is supposed to do.

We _have_ such OSses, not only Taos. A lot is happening in realtime
embedded systems, and DSP OSses. It is just that the market has no demand
for them (yet). People like whales, they need MBytes on their hard disks
to feel they have got some value for their money. The same thing with
hardware boxes: marketing says it must have certain size to be
perceivable by the customer. Look at a severe case of featuritis most office
packages have: it makes them large, encrusted in useless features,
molasses slow, but -- people buy them. Only very few people are
deliberately running older versions because they are faster.

> Take NT (which I am using to write this email) for example. That OS is so
> bulky and has so much baggage it would be difficult to write a really small

It is purported to have a microkernel. Recently they said they are going
to integrate goddamn _video drivers_ into it. Microkernel, indeed. In
fact, Microsoft itself was unable to write NT, they had to engage a bunch
of VMS guys to do it. Environment considered, they did a relatively fine job.

> or fast piece of software for it. Most versions of Unix have become the
> same way. More often than not, the OS gets in the way. There are a lot of

Linux has started as a monolithic system, now kernel modularization has
progressed noticeably. At the same time the MkLinux project (for the
PowerPC Macs) has achieved wrapping a Linux personality arond a Mach
microkernel. There are some chances MkLinux to become integrated into the
Linux source tree, and thus to become available on a large number of
platforms.

However, Mach has not been designed for maspar systems, must SMPs.

> good experimental OSs out there, but unfortunately, most don't have enough
> support to be useable. If I had lots of time and money, I would start
> building a brand new computing environment from the ground up.

Yes, a tiny OO OS is necessary for maspar systems. There are several
possible candidates for desktop maspar system OSses already, though.

> The one thing I liked about DOS, as limited as it was, was that it allowed
> you to really write compact, fast apps. If it natively multi-tasked, and
> had full support for 32-bit operation, I might still be working on it.

I used a reentrant multitasking OO 32-clean microkernel system since even
end 1987. It had even a useful GUI, and a shell.

> >[...]
>
> Oh no. Another Forth evangelist! ;)

Danger: pigeonhole alert. Apart from Lisp machines, I never met a more
powerful environment than a Forth machine. Of course, a Forth hacker eats
C hackers for breakfast, so not many had the guts to become Forth
wizards. A pity, imo. The Forth movement never achieved such a critical
mass as Unix. Lisp has failed in that, as well.

> > [WSI optimal die size estimation]
>
> The major CPU producers seem to be getting acceptible yields at
> 3-5MTransistors, depending on which company you are talking about. I know
> Intel gets >80% yield by mid-cycle for most of its CPU products. I think
> their initial yields are something like 15% though.

Wow. I'd thought 30% tops. 80% die yield for 1-2 MTransistor dies is more
than enough to produce 100% wafer yield WSI systems.

> >There is simply no way to put above complexity into a die as tiny as
> >that. Moreover, it would be an entirely wrong direction: it would keep
> >the CPU architecture converging towards a CAM (it is easy to see
> >convergenece in extreme cases of both), which is a nonalgorthmic machine,
> >making no distinction between code and data, everything being just a
> >hyperactive bit soup.
>
> We are currently seeing and will probably continue to see a trend towards
> multi-chip modules. This is certainly coming true as far as memory is

Yes, the last generation before WSI.

> concerned. It's more profitable to connect a lot of little chips than to
> make one big one. Companies like Cray have been doing things this way for a

This is true, and an obvious reason to create orthogonal, redundant WSI
systems, made from identical mini cores.

> while. And of course, Intel did it to get the cache size and performance
> they wanted on the P6. I estimate most of the next generation of chips for
> many of the major manufacturers will be MCMs.

Of course, one cannot put the complexity even of a P5 plus even 1 MByte RAM
on a die, and expect yields >0.05%. WSI won't be pioneered by Intel, that
much is certain.

> >[...]
>
> These architectures can be fast, but how well would something like this
> really work for general computing applications such as spreadsheets? The

A spreadsheet is a quite parallel application (wonders over wonders...),
one can tesselate the spreadsheet over individual nodes, where only the
boundaries have to be communicated. Large integer arithmetics will
substitute any float. And of course, when it comes to display the
spreadsheet, a graph, etc. or to do I/O on a RAID, maspar systems will
win hugely, again.

> instruction set would be kind of poor for efficient general computing.

This is a myth, imo. A simple estimation will show, that you'll get
several orders of magnitude more bang for the same bucks. The hard part
is to write maspar software, which gets distributed over any architecture
automagically. Once again, software, and the minds of human programmers
are the key why we don't have maspar systems yet.

> >> The result is that the pipeline is flushed pretty often, which is only
> >> worstened by its depth. Also, the high clock rates make pipeline stalls
> >
> >So let's drop the pipeline, and move the core onto the die, as much of it
> >as die yield can take before wafer yield drops to zero.
>
> It is okay to have a pipeline. It is just poor design to have such a deep

If it not okay, if the pipeline machinery doubles necessary transistor
resources for the CPU. If is not ok, if you use self-modifying code (as
ALife-bred code will do routinely). It is not ok, if pipeline stalls make
execution time nondeterministic.

> and dependant one when you have essentially no branch prediction or
> look-ahead logic. A well-designed pipeline is what allows you to get the
> high clock speeds out of silicon processes. Either keep it simple and

Have a look at the MuP21 chip by Chuck Moore. It has 20 bit words, and 5
bit instuctions, 4 slots in a word. Each instruction get executed
simultaneously, if the result is needed, it is instantly available. This
is not a pipeline, yet it has all the features of one.

> short, or long but with a lot of intelligent pre-pipeline logic. The thing
> with simple and short is, you have to have either a very simple instruction
> set, or a *very* good compiler. Personally, I don't think current compilers

Right. MISC uses a very simple instruction set. Minimal Instruction Set
Computer.

> are up to the challenge.
>
> >
> >> (due to things like cache misses) more serious than they would be in slower
> >
> >So let's abolish the cache, giving core DRAM almost cache quality,
> >mapping a tiny SRAM islet into the address space, while eliminating
> >worst-case scenarios (locality is nonexistant in certain problems, if fact
> >it can be conjectured that access locality is a pathologicial case, an
> >artefact of the human programmer).
>
> Isn't this the way modern caches are essentially done?

Alas, not. I am talking about mapping an 0-wait-state SRAM into an
address slot. Caches are much more complex beasts, they mount a fast
memory transparently over the slow memory. If you have a cache miss, the
retrival takes _longer_ than without a cache, maintaining cache consistancy
is difficult, and caches eat huge amounts of silicon free estate. Caches
will have to go.

> >> clocked chips. Add on top of this an inefficient superscalar
> >
> >Astronomic clocks aren't everything, they just increase power dissipation
> >(wasn't that an exponential relation?) and are of no use if the core
> >cannot deliver requested words. Clock rates beyond 2-4 GHz aren't viable
> >with Si tech anyway (physics starting harrumphing in the background), and
> >GaAs is a much weaker substrate, when it comes to integration density,
> >which is the key.
>
> High clock rates force a deeper pipeline because you still have gate latency
> to contend with. Deeper pipelines mean you have more time to complete one

Nope. If the entire memory is on-die, and the bus width approaches 1 kBit
core latency ceases to be a problem.

> instruction. If your clock is getting faster and your gates aren't, the
> only solution is to deepen the pipeline.

No pipeline.

> The obvious physical limit for Si is when clock speed equals switching
> speed, although realistically this isn't true, since it would require

I think for CHMOS the limit is about 4-5 GHz. Though frequencies will
rise as structures will become smaller, the shrinking wire geometries
cannot take such high frequencies anymore. Sounds like saturation, huh?

> ridiculous level of complexity (a 100 level pipeline?). I am not sure if
> they have determined the maximum theoretical switching speed for a Si
> transistor yet. The best they can do right now is to just improve the
> fan-in/fan-out limits (like BiCMOS) and generally decrease structural latency.

Interersting observation: at very high frequencies the entire circuit
must be considered as an analogue one. Clean things with interfaces, FSM,
etc. do not work anymore.

> >[...]
>
> At 40 kTransistors, how wide will the ALU be? I assume this does not
> include any type of FP capability.

32 bit, possibly even 64 bits. ALU width scales roughly with O(n). Of
course no FP, not even a hardware multiplier. I tried talking Chuck into
putting a barrel shifter in (a truly basic commodity, especially if you
don't have a multiplier) but he is a semireligious minimalist. i21, a
21 bit architecture, has 20kTransistors complexity.

> >[...]
>
> I still use assembly for computationally complex functions in some types of
> software. I still can't believe the kind of speed I can get out of
> handcrafted assembler sometimes, even on old systems.

Yes. The nanoOS methods must be written in assembly, that's obvious.

> >[...]
>
> Many aspects of Intel's designs are quite good. Other parts I question,

Lacking contrast, everything will appear good. I think Intel chips stink,
yet I run Linux on a Pentium. I almost bought an Alpha, though.

> though. I think a lot of their bad design decisions are done to cater to
> the consumer market that built them.

Right. Customers are Evil, and they don't even realize it. It is a
perfect stampede: the bisons are dumb brutes, yet they will stamp you in
the dirt if you happen to stand in their way.

> > [...]
> >Alas, HP RISC is dying, as does Alpha. When speaking of monopolies...
>
> Actually, HP RISC is being kind of absorbed by Intel. I guess either the P7

A nice illustration of Intel's politics. Either buy them, sue them to
death, or outmaneuver them due to sheer bulk.

> or P8 will include a version which will include a PA-RISC decoder. This is
> one feature of Intel's latest chips that I think is pretty cool. Have an

10 MTransistors complexity, right?

> architecture specific decoder sitting upon a powerful generic RISC core.
> They could probably put a decoder for just about any common architecture on
> top of that. Now if you could select your hardware architecture emulation
> via software, *that* would be *too cool*.

Alpha is microprogrammable. Threaded code machines are equivalent to
infinite microprogramming capability.

> >[...]
>
> Let me see...this would be the TMS320C80, if I am not mistaken.
> Theoretical throughput of 2,000 DSP MIPS, or something like that.

2 GOPs. Not bad, huh? But of course, this happens only five minutes
before midnight, on Sundays, and a full moon.

[...]

ciao,
'gene