Re: Computer Architectures

James Rogers (jamesr@best.com)
Thu, 19 Dec 1996 15:59:27 -0800


At 05:58 PM 12/19/96 +0100, you wrote:
>I think SMP is a very broken architecture. It speculates on memory
>bandwidth, a scarce commodity, and it requires good caches, which are
>difficult/costly to do, and do not at all help if there is little or no
>data/code locality.

Agreed. I don't think anyone contests the limited nature of SMP. It think
it was designed as a cheap interim solution that has become popular.

>> software. OSs will have to support more truly parallel architectures before
>> the hardware will become popular. The first large-scale OS to adopt these
>> architectures will probably be one of the quickly evolving ones, like Linux.
>
>Linux is great, but Unix can't be scaled down to nanokernel size, alas.

Exactly what is a nano-kernel required to do? I am familiar with the
concept, but not with the intimate details. Won't you have to run some type
of distributed OS on top of the nano-kernels? Nano-kernels sound like
high-level microcode/firmware.

>Dedicated OOP DSP OSses are much better candidates for maspar systems,
>imo. It will get really exciting, when GA-grown ANNs will jump from DSP
>clusters to dedicated ANN chips. Probably, the need for neural DSP will
>pioneer this, other fields (robotics, generic control, ALife AI) will
>gladly accept the torch. Now imagine entire countries encrusted with
>boxes full of ANN chips, mostly idle, locally connected by fiber links...
>Though agoric computing will inhibit that somewhat, the phase transition
>to >web is writ all over the wall, in neon letters light-minutes-large...
>
>> The hardware is already starting to get there. We are starting to see
>> multiple buses becoming available on PC motherboards, and fully integrated
>
>Many DSPs (Sharc, TI, &c) already offer several high-bandwidth links.
>Theoretically, a single macroscopic (albeit short) serial bus can carry
>100 MBytes/s, optical links several orders of magnitude more.
>(High-clocked stuff must be done in optics anyway, for dissipation
>and signal reasons).

DSPs by design can accomodate this much better than CPUs. DSPs tend to be
very slick and efficient designs. And personally, I can't wait until all
the interconnects go to optics.

>> L2 caches (like the P6) are a good start towards eliminating resource
>> contention in multiprocessor systems. The one thing that will take the
>
>Caches are evil. Caches transistors provide no extra storage, and take
>4-6 times the transistor resources of a an equivalent DRAM. Putting
>caches on die bloats die extremely, which kills die yield and thus makes
>the result unsuitable for WSI. Cache consistancy is a problem. Added
>latency in case of cache miss is a problem. Increased design complexity,
>prolonged design time and increased probability of a design glitch are a
>problem. Decreased operating frequency due to circuit and design
>complexity is a problem. Lots of ugly & hairy things.

SRAM may be bloated, but at least it is fast. DRAM is simple, cheap, and a
horribly slow memory design. Capacitors will *never* stabilize with the
speed necessary for high-speed memory applications. As long as memory is
dependant on high-speed capacitors, we will never get beyond our current
bottleneck. The whole reason we need a cache is because of the limitations
of DRAM.

Currently, I think our best hope may be optical memory hardware, which is
both fast and has very high bandwidth (the standard bus for laboratory test
models is 1024-bits). And optical memory technologies are expected to see
short term performance improvements of at least another order of magnitude
as the technology matures. Unfortunately, we probably won't see anything
like this available for 5 years.

>> longest is breaking out of the shared memory model. Most of the rest of the
>> required technology is available and supported.
>>
>> I am not too sure that the shared memory is really such a bad idea, in terms
>> of efficiency. I think what *really* needs to be improved is the general
>
>Shared memory contradicts the demand for locality. Data/code should
>reside in the utmost immediate vicinity of the ALU, ideally being a single
>entity (CAMs qualify here best). Because of constraints of our spacetime
>(just 3 dimensions, the curled-up ones, alas, unaccessible), and, even
>worse, of silicon photolitho technology, which is a fundamentally
>2d-technique, the conflict arising between making the same data
>accessible to several processors is unresolvable. Caches are no good, and
>open a wholly new can of worms... (cache consistancy in shared-memory
>architectures is a nightmare, see KSR).

I am familiar with the problems associated with cache consistency in
shared-memory architectures. But caches are the product of slow memory
architectures. If we had really fast memory and wide memory bandwidth,
caches would become irrelevant. Memory contention, however, is unavoidable
in shared memory systems.

>> memory architecture currently used. If they used some type of fine grained
>> switching matrix mechanism, maybe something similar to the old Burroughs
>
>If we are to accept the scheme by which our relativistic universe works,
>we must adopt an large-scale asynchronous, locally-coupled, nanograin
>message-passing OO paradigm. Crossbars, whether vanilla, or perfect
>shuffle, are excellent for asynchronous OOP, provided the topology allows
>trivial routing. Hypergrid does.
>
>> B5000 series mainframes, a lot of memory contention could be eliminated.
>> This of course in addition to speeding up the entire memory architecture
>> altogether.
>
>Alas, there are physical limits to that, particularly if it comes to
>off-die memory. Locality strikes yet again, aargh.
>

I don't see how a computer could be designed that does not use shared
memory, but still manages to run a single process across multiple
processors, while still managing to have data consistency and locality. I
suspect there are real theoretical limitations to this. If all data is
local to each processor in a multi-processor system, then this would be the
same as having shared memory when running a single process, unless the data
was duplicated at each node. Of course, if the data was duplicated at each
node, you would have a potential coherence problem since it is a single
process. If the data is distributed among the local memories of different
processors for a single process without duplication, then you have
non-locality issues again. I don't see how it is possible to generate a
perfect memory architecture for running a single process on multiple
processors, while keeping coherence AND locality. This is assuming no
limitations in the memory hardware.

-James Rogers
jamesr@best.com