Re: web snapshot

From: Eugene Leitl (Eugene.Leitl@lrz.uni-muenchen.de)
Date: Thu Jan 04 2001 - 05:48:27 MST


On Wed, 3 Jan 2001, Spike Jones wrote:

> You know, a technique for archiving the entire content of
> the web on any given day in history would be valuable as all
> hell. Have we not all read something somewhere on the web
> some time in the past and now we cannot find it? Think of
> the post singularity historians, trying to dig thru all the web
> content trying to reconstruct it all. Looks like an opportunity
> begging for some sharp computer guru to make a ton of
> money by inventing such a thing. spike

We've talked about this before (how I miss Sasha's input), but
it is probably worth reiterating.

Nonhydra - web spidering is multiple-lousy:

1) it's redundant. Many spiders try getting the same content.

2) it's redundant. You don't know when content changes, so you
   rescan it periodically. And again. And again. And again.

3) reading content over a webserver has a much higher overhead
   than going over a file system

4) since you scan from a central location, the local pipe
   is not your bottleneck; and in 99% of cases you're accessing
   remote content through a rather slow connection

5) you're pulling content verbatim, instead of just picking
   up the index, which is typically tiny in comparison to
   the total document base

6) most web is database-backed and otherwise nontrivially
   structured, so the spider never sees the bulk of content.
   In many cases this is deliberate, but not in all cases.

Partial remedy would be to bundle partial spider functionality for a
full-text index into Apache, so that upon submission (inserting a new leaf
into document tree, a database insert which is flagged as world-visible,
etc.) the web server automatically refreshes the index, which is placed in
a standard location. If the number of diffs goes superthreshold, the
web server notifies a spider, which picks up the full text index, and
nothing but the index.

This guarantees that the search engine is always up to date, and
only the relevant parts are indexed, and that the server doesn't
see robot hits, and the network inbetween only sees a tiny fraction
of bandwidth use it would see otherwise.

This is easily extensible to the point where each web server builds
a map of local nodes (i.e. those reachable within a minimal number
of hops, and/or those with a reasonably fat pipe to them) and picks
up their full-text indexes (or does it the old-fashioned way, by
spidering them). Because the indexing load is distributed, and by
virtue of locality a) no superfluous traffic is generated b) the
pipe is reasonably fat due to likely absence of bottlenecks. One
step away from this is a distributed search engine, which amplifies
the queries. Assuming each node indexes ~100 of others, amplified
traffic won't saturate each node, since the last step of the
pyramid is omitted, saving a couple orders of magnitude in traffic.
The reason why Gnutella saturates is because it doesn't do this,
and amplifies every query to the entire network, a lot of which
is not xDSL.

A good idea would be to add the option of geographically constrainable
queries (net connectivity will ultimatively approach geographic locations)
and classification which reduces query amplification by adding semantic
information. If you're interested in plush animals, you probably don't
want to query Boeing's database. If it's a pr0n .png, or a pic of
your latest industrial robot model, say it in a tag. Some of it
should be even free-form, for AI which can handle this. Machine
vision will use a *lot* more MIPS and will be a *lot* harder to
program than something which can understand plain text.

Another step away is integrating load levelling and redundancy.
A web server has a fraction of its file system allocated to hosting
other's content. Some of it will get mirrored from neighbour nodes,
some of it moves, attempting to minimize the traffic. If you get
slashdotted, you suddenly see your content amplifying itself, and
drifting off towards the gradient of the traffic.

I think we'll see web++ skipping several of these stages, should
FreeNet and MojoNation hit criticality. Here's something which
could put Akamai and Google out of business, eventually.

P2P a la MojoNation is how the web should have been designed
right from the start. If broadband flatrate won't boost it
into criticality, then it is tragedy of the commons time all
over again.



This archive was generated by hypermail 2b30 : Mon May 28 2001 - 09:56:16 MDT