Re: Search Engines: Inaccuracy in Basic Information

From: Anders Sandberg (asa@nada.kth.se)
Date: Sun Apr 27 2003 - 03:42:54 MDT

  • Next message: cryofan@mylinuxisp.com: "Re: Search Engines: Inaccuracy in Basic Information"

    On Sun, Apr 27, 2003 at 04:07:55AM -0400, Nathanael Allison wrote:
    >
    > I recently searched for just basic world statistics on the net. I searched
    > for things like land mass, dollars per person (GNP), and population for
    > countries. I found that there were major differences in basic numbers. For
    > instance in the top 5 countries for land mass I came across 3 different
    > orders. Economy statistics were different on the 5 different sites I
    > checked out.

    I guess one common source of errors is data from different times. GNP
    and population changes. Measuring land mass is also surprisingly
    nonconstant. In coastal areas it can change quite a bit. Even worse, it
    often depends on how you count, and many economic and population
    estimates are just estimates - which can be heavily biased by
    governments and organizations selecting the "right" way of measuring (or
    just selecting a particular approach with no real intent).

    And people copy bad statistics. Bjorn Lomborg did a good job of showing
    just how skewed environmental statistics could be by selective
    re-telling, and once you have a dramatic statement like "We are losing
    forest at the rate of X football fields per second" or "We only use 10%
    of our brains" they rapidly spread. Scientific papers commonly refer to
    other papers the authors have never read, but copy the references to
    from other papers. And so on.

    > What can be done about this? Is there programs that could either change the
    > info or tell the page/page owner that the info is wrong. Is this something
    > that will be around until AI can deal with it?

    If you have a particularly specific or bad factoid, you could always
    search for it and email pages with it. It could even be automated, but
    you would likely be spamming people just referring to it (a search for
    "we only use 10% of our brains" gave 211 hits on google, and all hits on
    the first page debunked it). A more specific number "China GNP 4389
    billion" might work better, but it is still error-prone.

    I wonder if AI could deal with it either. As discussed below, many
    statements are semantically fuzzy. I might say an erroneous number on a
    page, but the exact number doesn't matter ("Even if China has billions
    of citizens, it doesn't make it a democracy" - China does not have more
    than 2 billion inhabitants, but that doesn't change the meaning of the
    reasoning). The AI must understand such loose usage, and when the writer
    is stupid or writes confusedly (perhaps in a second language) it can be
    practically impossible to deal with it.

    I am reminded of Babbage's letter to Tennyson:
    http://www.uh.edu/engines/epi879.htm

    However, a system like the semantic web could make attempts to use
    information deliberately and exactly more manageable. Also, by stating
    sources for everything debugging of data can be done. If we had good
    tools for writing such pages, then I think automated factcheckers could
    be used at least to point out that certain facts are contentious.

    > Why do search engines work so screwed up? I just know basic programming
    > could someone explain to me why there are all these problems?

    In what way are they screwed up?

    I get the impression you want them to produce correct and
    relevant information. But how do you tell them what is correct and
    relevant? It is not a programming problem, it is a problem definition
    problem. And the things we as users want are usually extremely ill
    defined.

    Looking at my latest google searches, the first six are attempts at
    finding a particular website by using variations of its title. Then
    there is a name of a historical person that I wanted to understand a
    reference to. Then a search for "billion trillion" to remind myself of
    which is which in British and American english. Then there is a search
    for a newspaper I was interviewed in. Then several book titles, where I
    was looking for price estimates and reviews. And a search for another
    journal, where I used a word in the title I knew would produce a correct
    hit.

    Now look at this from the perspective of the search engine. The first
    group was trying to match a title or header consisting of fairly common
    words. The second is a search for a name, which ideally should give a
    bio or homepage. Then there was a search for two terms that should occur
    together, but there was no indication of what kind of page was wanted -
    it was just implicit that I wanted one with a linguistic comparision.
    The newspaper search consisted of little more than "biotechnology
    journal", and should have brought up journals with the literal name
    "Biotechnology" rather than biotechnology journals (it wasn't a very
    precise search). The book title searches had a common context
    (considering buying them), so I wanted only certain kinds of pages - but
    how could the engine know? And finally I asked for a very common word
    ("Register") because I knew it would dig up "The Register" - but from
    the search engine point of view I might have been looking for the
    National Register, where to register, the registers of the Z80 processor
    or a registrar. And things get even worse when someone searches for
    "How do I make bar graphs in matlab with error bars?" - unless there is
    a page with that sentence it is hard to separate it from pages with
    Matlab manuals listing both bar graphs and errorplots.

    To sum up, even this kind of simple keyword-based searching is awfully
    messy. I'm amazed by the skill of the people of Google and similar
    places in making their engines able to dig up roughly the right pages.
    But we need engines that can really search for semantics, and not just
    clear semantics like "What is the GNP of China?" but also unclear and
    confused semantics like "Journals with a biotech name that interviewed
    me". That is a programming challenge.

    -- 
    -----------------------------------------------------------------------
    Anders Sandberg                                      Towards Ascension!
    asa@nada.kth.se                            http://www.nada.kth.se/~asa/
    GCS/M/S/O d++ -p+ c++++ !l u+ e++ m++ s+/+ n--- h+/* f+ g+ w++ t+ r+ !y
    


    This archive was generated by hypermail 2.1.5 : Sun Apr 27 2003 - 03:52:18 MDT