Re: Search Engines: Inaccuracy in Basic Information

From: Anders Sandberg (asa@nada.kth.se)
Date: Sun Apr 27 2003 - 03:42:54 MDT

Next message: cryofan@mylinuxisp.com: "Re: Search Engines: Inaccuracy in Basic Information"

Previous message: Harvey Newstrom: "RE: Experiences with Atkins diet"
In reply to: Nathanael Allison: "Search Engines: Inaccuracy in Basic Information"
Next in thread: Mike Lorrey: "Re: Search Engines: Inaccuracy in Basic Information"
Reply: Mike Lorrey: "Re: Search Engines: Inaccuracy in Basic Information"
Reply: Damien Sullivan: "Netherlands (was Search Engines)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

On Sun, Apr 27, 2003 at 04:07:55AM -0400, Nathanael Allison wrote:
>
> I recently searched for just basic world statistics on the net. I searched
> for things like land mass, dollars per person (GNP), and population for
> countries. I found that there were major differences in basic numbers. For
> instance in the top 5 countries for land mass I came across 3 different
> orders. Economy statistics were different on the 5 different sites I
> checked out.

I guess one common source of errors is data from different times. GNP
and population changes. Measuring land mass is also surprisingly
nonconstant. In coastal areas it can change quite a bit. Even worse, it
often depends on how you count, and many economic and population
estimates are just estimates - which can be heavily biased by
governments and organizations selecting the "right" way of measuring (or
just selecting a particular approach with no real intent).

And people copy bad statistics. Bjorn Lomborg did a good job of showing
just how skewed environmental statistics could be by selective
re-telling, and once you have a dramatic statement like "We are losing
forest at the rate of X football fields per second" or "We only use 10%
of our brains" they rapidly spread. Scientific papers commonly refer to
other papers the authors have never read, but copy the references to
from other papers. And so on.

> What can be done about this? Is there programs that could either change the
> info or tell the page/page owner that the info is wrong. Is this something
> that will be around until AI can deal with it?

If you have a particularly specific or bad factoid, you could always
search for it and email pages with it. It could even be automated, but
you would likely be spamming people just referring to it (a search for
"we only use 10% of our brains" gave 211 hits on google, and all hits on
the first page debunked it). A more specific number "China GNP 4389
billion" might work better, but it is still error-prone.

I wonder if AI could deal with it either. As discussed below, many
statements are semantically fuzzy. I might say an erroneous number on a
page, but the exact number doesn't matter ("Even if China has billions
of citizens, it doesn't make it a democracy" - China does not have more
than 2 billion inhabitants, but that doesn't change the meaning of the
reasoning). The AI must understand such loose usage, and when the writer
is stupid or writes confusedly (perhaps in a second language) it can be
practically impossible to deal with it.

I am reminded of Babbage's letter to Tennyson:
http://www.uh.edu/engines/epi879.htm

However, a system like the semantic web could make attempts to use
information deliberately and exactly more manageable. Also, by stating
sources for everything debugging of data can be done. If we had good
tools for writing such pages, then I think automated factcheckers could
be used at least to point out that certain facts are contentious.

> Why do search engines work so screwed up? I just know basic programming
> could someone explain to me why there are all these problems?

In what way are they screwed up?

I get the impression you want them to produce correct and
relevant information. But how do you tell them what is correct and
relevant? It is not a programming problem, it is a problem definition
problem. And the things we as users want are usually extremely ill
defined.

Looking at my latest google searches, the first six are attempts at
finding a particular website by using variations of its title. Then
there is a name of a historical person that I wanted to understand a
reference to. Then a search for "billion trillion" to remind myself of
which is which in British and American english. Then there is a search
for a newspaper I was interviewed in. Then several book titles, where I
was looking for price estimates and reviews. And a search for another
journal, where I used a word in the title I knew would produce a correct
hit.

Now look at this from the perspective of the search engine. The first
group was trying to match a title or header consisting of fairly common
words. The second is a search for a name, which ideally should give a
bio or homepage. Then there was a search for two terms that should occur
together, but there was no indication of what kind of page was wanted -
it was just implicit that I wanted one with a linguistic comparision.
The newspaper search consisted of little more than "biotechnology
journal", and should have brought up journals with the literal name
"Biotechnology" rather than biotechnology journals (it wasn't a very
precise search). The book title searches had a common context
(considering buying them), so I wanted only certain kinds of pages - but
how could the engine know? And finally I asked for a very common word
("Register") because I knew it would dig up "The Register" - but from
the search engine point of view I might have been looking for the
National Register, where to register, the registers of the Z80 processor
or a registrar. And things get even worse when someone searches for
"How do I make bar graphs in matlab with error bars?" - unless there is
a page with that sentence it is hard to separate it from pages with
Matlab manuals listing both bar graphs and errorplots.

To sum up, even this kind of simple keyword-based searching is awfully
messy. I'm amazed by the skill of the people of Google and similar
places in making their engines able to dig up roughly the right pages.
But we need engines that can really search for semantics, and not just
clear semantics like "What is the GNP of China?" but also unclear and
confused semantics like "Journals with a biotech name that interviewed
me". That is a programming challenge.

-- 
-----------------------------------------------------------------------
Anders Sandberg                                      Towards Ascension!
asa@nada.kth.se                            http://www.nada.kth.se/~asa/
GCS/M/S/O d++ -p+ c++++ !l u+ e++ m++ s+/+ n--- h+/* f+ g+ w++ t+ r+ !y

Next message: cryofan@mylinuxisp.com: "Re: Search Engines: Inaccuracy in Basic Information"
Previous message: Harvey Newstrom: "RE: Experiences with Atkins diet"
In reply to: Nathanael Allison: "Search Engines: Inaccuracy in Basic Information"
Next in thread: Mike Lorrey: "Re: Search Engines: Inaccuracy in Basic Information"
Reply: Mike Lorrey: "Re: Search Engines: Inaccuracy in Basic Information"
Reply: Damien Sullivan: "Netherlands (was Search Engines)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.5 : Sun Apr 27 2003 - 03:52:18 MDT