There is an area of computer science that deals with these issues, at
least in part, called Computational Learning Theory. Below is a
description by researcher John Case:
http://www.cis.udel.edu/~case/colt.html
Computational Learning Theory (COLT) is a branch of theoretical computer
science which mathematically studies the power of computer programs to
learn (algorithmic) rules for predicting things such as membership in a
concept or, as in the first example above, rules for how to generate a
sequence.
Besides the intrinsic scientific and philosophical interest, the expected
primary applications of COLT are to construction of intelligent
technology, especially technology which learns, and to cognitive
psychology, including understanding human language acquisition (brief
postscript bibliography available) and scientific inductive inference
(brief postscript bibliography available). COLT seeks to provide a
conceptual, mathematical infrastructure for these applied and basic areas.
End excerpt
A repository containing some web accessible articles:
eColt
http://ecolt.informatik.uni-dortmund.de/
Information about the 1997 conference COLT
http://cswww.vuse.vanderbilt.edu/~mlccolt/colt97/index.html
Another conference:
Algorithmic Learning Theory
http://www.maruoka.ecei.tohoku.ac.jp/~alt97/
If you want to bootstrap a system using the "raw" input generated by
cameras and microphones then I do not think the results in this area are
currently robust and complete enough, however, they may point the way.
Gregory Sullivan