Re: tech snippets

Eugene Leitl (Eugene.Leitl@lrz.uni-muenchen.de)
Mon, 23 Dec 1996 15:07:56 +0100 (MET)


(Sorry for the delay, (briefly) I am back to ethernet instead of 14.4
kBaud online link which is _expensive_ in Krautland. I've heard local calls
are not free in California, is this really true? I thought this was the
case almost throughout the U.S...)

On Thu, 19 Dec 1996, James Rogers wrote:

> [antifloat rant]
> I often do use fixed-point numbers for floating point computation (32-bit
> int for fraction and 32 or 64-bit int for integer portion). For some

Applause! You realize almost nobody is doing this nowadays? A pity, imo.

> applications it is more convenient (and more accurate) to use fixed point
> calculations. However, for some types of floating point intensive

Fixed point ints have an equidistributed mapping density to "real" reals
(resolution is the same over the entire dynamic range), at the cost of a
smaller dynamic range than that of equally long floats. This can be
avoided by using scaled integers (scaling is done by shifts), which
drains a lot of programmer brainpower from the problem, and introduces
subtle numerical bugs, as do floats. In fact, this would be almost
equivalent to implementing floats in software.

> computation, I doubt this is the _fastest_ method.
>
> > [ Alpha/PA-Risc expensive? ]
>
> Granted, Fast fp RISC machines are expensive, but if you need a workstation
> with high-end fp capability, this is the way to go. "Consumer" CPUs,

If one is stuck with a monolithic application, requiring high absolute
Flop values, you are right. But if your application is suitable for SMP, or
a distribution on a (PVMed) workstation cluster, Intels might be a choice,
simply because there are so cheap.

> especially SMP, will get you more bang for the buck, but SMP is only useful
> for a subset of computational problems. The performance gap is getting
> pretty narrow though. Using specmarks as a reference, the highest-end

Yes, numerics mainstream is starting to be aware of this.

> workstation CPUs only offer roughly twice the performance of the highest-end
> "consumer" CPUs. For some classes of problems, a 4-way SMP P6 or PowerPC
> system would seriously out-perform many high-end workstations at a fraction
> of the cost. Case in point: An Intergraph P6 SMP graphics workstations

...if only e.g. numerical QM codes were not Fortran monsters from 70's,
porting being equivalent to rewriting them from scratch, which grad
students simply cannot afford, as it would take years to do.

> (with custom acceleration hardware) will out-perform any SGI graphics system
> costing less than $100k. The Intergraph system will only cost you $25k

I think current accelerator hardware/software is a trifle weak on
realtime graphics horsepower, but this will be mended in the next version.

> because it uses off the shelf components, with the sole exception of the
> hardware accelerator (which is compatible with any NT workstation).

While high-end SGIs use awesome amounts of silicon for their
engines/buffers, they charge by far too much, especially because of their
small production capacities. Next-generation 3d accelerators should be
sufficient for mid-range scientific visualization, for a comparatively
negligeable price. I'd like to see, whether OpenGL will really catch on (it
should, being a part of NT. Linux is still very weak on OpenGL
performance).

> I think this convergence of performance will kill a significant number of
> the RISC vendors unless prices converge as well.

Agreed absolutely. PA should fall, Alpha will probably fall, PowerPC
might fall. MIPS by rights should, but they have found their nice in
embedded and consumer markets, as has ARM, and possibly soon StrongARM.

> >> [ mutimedia DSP eats GFlops for breakfast ]
> >
> >The more reason for doing it in 128 bit integers, and with a maspar
> >pipeline (one CPU for each processing stage).
>
> I never really thought about it in this sense. I suppose it WOULD be
> possible to pipeline integer CPUs to get fast floating point performance.
> It probably never occurred to me because this is contrary to conventional
> thinking.

I take that as a compliment ;) Trouble is, we must grew accustomed to the
"Man from Mars" viewpoint, because physical reality will demand novel
solutions, as a rule of thumb mostly parallel ones. We've seen this
coming for decades, but we still haven't done much to school our thinking.

E.g., consider a minor novely, the L4 nanokernel. It is just 12 kBytes,
which vastly increases its probability to be at last in the 2nd-level
cache if it is needed, hot-spot 1 kByte of message-passing code (one
order of magnitude more efficent than Mach) will typically even reside in
1st-level cache. The microkernel is not portable, so why not writing it in
assembly for each individual CPU? This should take a single person several
months, which is tolerable. But consider the performance increase!

Taking this one step further: why not putting the entire microkernel/VM
into a _normal_ on-die SRAM, having an address in address space (cache
has none, it occupies the same address space as the addressable core)?
Why not offering a second set of registers/stacks for the OS, making
system perversion much harder? Since we will be using GA-techniques, the
entire system must be able to execute randomly generated code (what crashme
does to kill your workstation). One must consider each node running in a
maspar cluster, in a very hostile environment (GA, hacker attack), it
must offer lightweight cryptographic authentication methods as parts of the
_kernel_.

> This could be the basis for a flexible, fast computing architecture.
> Pipeline simple 128-bit (or even 256-bit) ALUs. The ALUs, by nature, would
> be really small, simple, and very fast. You could build a simple floating

Right, if there are no hardware multipliers, ALUs grow really skinny. A
lot of clever algorithms exist, which substitute short sequences of shifts
and basic logics for hardware, this stuff is not slow.

> point processor with many parallel pipelines using less than >1 million
> transistors (trivial these days). The fp throughput would be enormous, and

I don't think a 256-bit ALU needs more than 100 kTransistors, pipelining
can be substituted by VLIW/SIMD techniques.

> I suspect that you could build a veritable supercomputer on a chip or MCM

Now imagine a 8" WSI wafer, filled with 50% viable dies, linked by a
redundant on-wafer high-speed (100 MByte/s) hypergrid links... Sounds
like fun, huh?

> this way. And depending on how it was designed, you could have arbitrary
> hardware supported precision, up to the point of the total number of
> pipelines on the chip or MCM.

Another thing, which interests me, is reconfigurable/evolvable hardware.
We know how to translate source into register machines automagically, and
novel hardware lets us change logic gate connectivity by means of writing a
bit pattern into a SRAM. One solution might involve swapping in/out
building blocks, another solution involves GA-changing this bit pattern.
This pattern can be changed at runtime, by an autofeedback mechanism. I
can imagine funky things happening on optically linked WSIs running GAs
on such evolvable hardware dies, featuring kBit buses. Interesting
things. The next step would be hardware CAMs, of course, but we should not
rush things.

ciao,
'gene

>
> -James Rogers
> jamesr@best.com
>