Multimedia Sensor Fusion for Intelligent Camera Control and Human-Computer Interaction

Steven George Goodridge

A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering, Raleigh, NC, 1997


This dissertation presents a novel technique for pixel-level fusion of spatial sound and color information for the detection and tracking of human beings, coupled with a behavior-based camera control system. Such a system can be used for videoconferencing, surveillance, and human-computer-interaction applications that require automatic camera control and/or machine perception of people. Binaural sound localization is used to assist the task of extracting faces from video images and to turn the camera toward sound sources outside the current field of view. A digital audio preprocessing technique for developing onset signals from each microphone is introduced. The time-domain cross-correlation between these onset signals provides a sharp peak at the interaural delay that corresponds to the target location. The resulting peak is narrow enough to distinguish multiple speaker locations without the degree of ambiguity ordinarily present with cross-correlation methods. Onset correlation and skin color statistics are used jointly for the classification of "talking face" and "background" pixels in the camera image. Occasional classification and measurement errors are compensated for by Kalman filtering and reactive fuzzy control behaviors. The test system, which is based on an ordinary personal computer, can automatically point the camera at the person or persons speaking and adjust the camera zoom as appropriate.


Table of Contents