I don't think we were really getting anywhere with the previous line of responses, so I decided to try it again from the beginning. Here goes:
Regarding Current Software
Current PC software is written for hardware that is actually in use, not hypothetical designs that might or might not ever be built. This is perfectly logical, and I don't think it makes sense to blame anyone for it. If a better architecture becomes available, we can expect ordinary market forces to lead them to support it in short order (look at Microsoft's efforts with regard to the only-marginally-superior Alpha chip, for example).
The more sophisticated vendors (and like it or not, that included Microsoft) have been writing 100% object-oriented, multithreaded code for several years now. They use asynchronous communication anywhere there is a chance that it might be useful, and they take full advantage of what little multiprocessor hardware is actually available. There is also a trend currently underway towards designing applications to run distributed across multiple machines on a network, and this seems likely to become the standard approach for high-performance software in the near future.
Regarding Fine-Grained Parallelism
Parallel processing is not a new idea. The supercomputer industry has been doing it for some time now, and they've done plenty of experimenting with different kinds of architectures. They have apparently decided that it makes more sense to link 1,000 big, fast CPUs with large memory caches than 100,000 small, cheap CPUs with tiny independant memory blocks. That fits perfectly with what I know about parallel computing - the more nodes you have the higher your overhead tends to be, and tiny nodes can easily end up spending 100% of their resources on system overhead.
Now, if someone has found a new technique that changes the picture, great. But if this is something you've thought up yourself, I suggest you do some more research (or at least propose a more complete design). When one of the most competitive (and technically proficient) industries on the planet has already tried something and discarded it as unworkable, its going to take more than arm-waving to convince me that they are wrong.
Regarding the Applicability of Parallelism The processes on a normal computer span a vast continuum between the completely serial and the massively parallel, but most of them cluster near the serial end of the spectrum. Yes, you have a few hundred process in memory on your computer at any given time, but only a few of them are actually doing anything. Once you've allocated two or three fast CPUs (or a dozen or so slow ones) to the OS and any running applications, there isn't much left to do on a typical desktop machine. Even things that in theory should be parallel, like spell checking, don't actually get much benifit from multiple processors (after all, the user only responds to one dialog box at a time).
On servers there is more going on, and thus more opportunity for parallelism. However, the performance bottleneck is usuall in the network or disk access, not CPU time. You can solve these problems by introducing more parallelism into the system, but ultimately it isn't cost-effective. For 99% of the applications out there, it makes more sense to buy 5 standardized boxes for <$5,000 each than one $100,000 mega-server (and you get better performance, too).
Of course, there are many processes that are highly amenable to being run in a parallel manner (video rendering, simulation of any kind, and lots of other things), but most of them are seldom actually done on PCs. The one example that has become commonplace (video rendering) is usually handled by a specialized board with 1 - 8 fast DSP chips run by custom driver-level software (once again, the vendors have decided that a few fast, expensive chips are more economical than a lot of slow, cheap ones).
1) Most parallel tasks require that a large fraction of the data in the system be shared among all of your CPUs. Thus, your system needs to provide for a lot of shared memory if it is going to be capable of tackling molecular CAD, atmospheric simulations, neural networks, etc. That brings up all those issues of caching, inter-node communication and general overhead you were trying to avoid.
2) You also can't get away from context switching. Any reasonably complex task is going to have to be broken down into procedures, and each processor will have to call a whole series of them in order to get any usefull work done. This isn't just an artifact of the way we currently write software, either. It is an inevitable result of the fact that any interesting computation requires a long series of distinct operations, each of which may require very different code and/or data from the others.
Billy Brown, MCSE+I