
Date: Wed Jun 18 2003 - 23:51:00 MDT

  • Next message: Damien Broderick: "Re: The Future of Secrecy"

    Have we been limited by Google?

      Thursday, 19 June 2003
    <A HREF="">Google Powered Search: How Google Edits the Web</A>
    Google may be an amazing database of webpages, however, Google certainly is
    not a comprehensive database. I wanted to find out how many webpages have <A HREF="">
    Google WebSearch</A> on them. Typically these display text similar to "Powered by
    Google", although this is not always true. I soon learnt how much is edited out of
    the web by Google through trying to build a comprehensive database of sites
    carrying the "Powered by Google" label.

    Narrowing the Web World
    First of all, we need to understand the Google Inc., strategy of what in fact
    it is building. In an article talking about the Grub Project [seeking to
    index the whole of the web] Peter Norving, the Director of Search at Google Inc.,
    describes the Google strategy. Peter Norving does not consider crawling the
    web more, and getting more into the Google database is the problem. Rather he
    considers the problem to be fundamentally how to narrow the index:
    "Going from tens of thousands of machines to hundreds of thousands of
    machines is fundamentally going to change the nature of search," Stechert said.
    "Going to millions of machines allows us to ask, ‘What can we do with all that
    computing power?’"

    But Peter Norvig, director of search quality at Google, said while the Grub
    project is topical and interesting; improving Web searches isn't a problem of
    widening an index, but narrowing it.

    AND how they have narrowed it is surely interesting. Look at this example of
    what we found when we tried to build a comprehensive list of sites bearing the
    words "Powered by Google".

    Our Search for "Powered by Google"
    First we searched in Google using the following search query:

    <A HREF="</A>
    restricted to one year and in English

    At the time of writing this article there were 713,000 results indicated in
    the Google information bar.

    Next we started working our way down the list to see each of the results. We
    wanted to be sure that a Google WebSearch box was on each of the sites listed.
    We excluded any pages in the Google domain and any sites that displayed
    Translation Powered by Google.

    NOTE THIS: Even though Google has 713,000 results for this search, you and I
    can only see 998 of them. Google physically will not serve any more pages up
    than that for that particular search. Even though there are another 712,002
    results for that search, we cannot see any more!! Google simply just does not
    serve any more pages past that number.

    AllTheWeb Behavior
    We turned to see whether AllTheWeb, a search engine in competition with
    Google, would reveal anything more. Our search in AllTheWeb was a similar search:

    <A HREF="">search AND "powered by google" ANDNOT ""</A>

    Even though AllTheWeb reveals that it has only 477,231 results in its
    database, AlTheWeb actually displays more results to us than Google. You are able to
    see 4,000 results of the 477,231 possible results. We actually learnt more
    about the "Powered by Google" phrase by researching the results from AllTheWeb
    than we did from the Google results page. There was more to research...>>


    This archive was generated by hypermail 2.1.5 : Thu Jun 19 2003 - 00:02:32 MDT