** Figure 50: Tracking in Three
Dimensions**

The vision and sensor fusion techniques described in the previous chapters provide a measurement of target locations for each image frame. In its raw form, this information is of limited use for camera control because it is imprecise due to measurement noise; it may include false-positive detections of people; and it provides no association between new measurements and previous target locations. This makes it difficult to develop smooth camera control motions from the raw measurements. In addition, the data is in polar coordinates which complicates the task of associating the data with other sensors or devices in the room that may not share the same coordinate system. For these reasons, target measurements are converted to a global Cartesian coordinate system, associated with previously tracked targets, and used to update a filter/state estimator for each target track. Figure 50 illustrates the tracking of target positions in Cartesian coordinates. The crosshairs represent position estimates for the targets, and the ellipsoids represent the relative uncertainty of the estimated position in three dimensions.

The coordinates measured by the camera system must be transformed into Cartesian coordinates for tracking and data association. This is important for measuring the distance between targets and measurements, and for using state estimation techniques based on Netwon's laws of motion. Each pixel location in a camera image represents a different azimuth and elevation with respect to the camera orientation. Adding these angles to the camera's pan and tilt position defines a line through the real world. One way that the distance to the target may be estimated is by considering the target's size in the image, and incorporating a priori knowledge about the actual size of the object. This was used for determining the range to faces, assuming that most heads are about the same size. Since this distance calculation is highly sensitive to sensor noise and variations in target size, the expected measurement error in depth is much larger that the expected error in directions orthogonal to depth. If one computes the error covariance and plots a locus of points of equal probability of being the actual target location around the measured target position, one will obtain an ellipsoid similar to those in Figure 50, with its major axis aligned along the vector pointing out from the camera. Other methods for calculating depth include the use of multiple orthogonally mounted cameras or making the assumption all of the targets exist within the same plane, measuring their position from a camera overhead.

For the tracking system to perform properly, the most likely measured potential target location should be used to update the target's state estimator. This is generally known as the data association problem. The probability of the given measurement being correct is a distance function between the predicted state of the target and the measured state. Note that state is not limited to position; it may also consist of features such as color. This becomes especially important for targets that may come close to or cross one another, such as people. Popular association algorithms for single-target applications are based on the following schemes:

*Nearest Neighbor*: This algorithm always updates the tracking filter with the measurement closest to the predicted state.*Multi-Hypothesis Track Splitting*: This scheme creates a new hypothesis track for every measurement that is in the validation region, and prunes unlikely tracks using a likelihood ratio.*Probabilistic Data Association*: Each measurement affects the tracking filter to a degree based on the probability that it is the correct given the predicted state.*Optimal Bayesian Filter*: This variation of Probabilistic Data Association splits multiple tracks, like the Multi-Hypothesis algorithm, and eliminates unlikely tracks.

For multiple sensors and multiple targets, the problem becomes increasingly complex. Common association algorithms are:

*Joint Likelihood:*This variation on the Multi-Hypothesis Track Splitting algorithm above extends to multiple tracks.*Joint Probabilistic Data Association:*This algorithm updates the filter for each track based on a joint probability of association between the latest set of measurements and each track.*Multiple Hypothesis Joint Probabilistic:*This variation of the Optimal Bayesian Filter uses joint probabilities among multiple track associations for multiple hypotheses. It is by far the most computationally complex algorithm, and requires intelligent pruning techniques. It is NP-complete, which provides considerable incentive to find non-exhaustive ways to search the space of possible associations to maximize the joint probability. One data association optimization technique based on iterative relaxation is presented by Pattipati et al. in [114].

Some comparisons of target tracking data association techniques are provided by Drummond [115], Bar-Shalom and Fortmann [117], and Deb et al. [116].

For tracking the positions of people walking around in a room for applications such as surveillance, image-difference is used to segment the person's body from the background, and their measured position is determined from room and camera geometry. For data association, object position as well as color is exploited as state variables. This allows objects to cross directly in front of one another without losing track of which is which after they separate. One may measure the color histogram difference, H(I,M), between each new measured object and the previously detected target data using Swain and Ballard's histogram intersection technique [111]:

where *Ij *is the jth color histogram bin of an object
in the current frame, and *Mj* is the *j*th color bin of a
tracked object in the previous frame. 64 bins are currently used in
each color histogram, which provides sufficient resolution to
differentiate most colored objects, such as peoples' clothing, from
one another.

In order to obtain a distance metric for data association that
incorporates both the histogram intersection and position difference,
we calculate the joint probability of these two measurements. Let us
define *Xi,j *as the event that a detected object i is actually
the previous object *j*, *Yi,j* as the value of the
histogram intersection between objects *i* and *j*, and
*Zi,j* as the distance between the position of object *i*
and the predicted position of object *j*. One may express the
probability of a correct match conditioned on the statistically
independent measures of color and position as

This probability may be incorporated into association/tracking
algorithms such as nearest-neighbor, joint probabilistic data
association, and multi-hypothesis track splitting. For
person-tracking, the color/position metric has been found to be good
enough for a simple winner-take-all nearest-neighbor data association
scheme to suffice. If one assumes equal prior probabilities for all
*Xi,j*, one may simplify the nearest neighbor decision process
to one that seeks to maximize the value
*Fy*(*Yi,j*)*Fz*(*Zi,j*), where

and

*Fy*(*Yi,j*) and *Fz*(*Zi,j*) are
monotonically decreasing functions of the color histogram
intersection and position differences, respectively. These functions
may be generated by tasking statistics on typical color differences
and position distances between the same and different objects in
successive frames. By fitting the data to a parameterized model, such
as a decaying exponential curve, a general matching function with
appropriate weights for color and position may be obtained.

When tracking faces, similar methods may be used for data association. However, since most faces are quite similar in color, and intensity may vary as the person moves through different lighting angles, position is the most reliable state variable. For this dissertation project, only position information was used for data association between detected faces and targets being tracked. Using the estimated measurement error covariance in polar coordinates, the covariance matrix was transformed to Cartesian coordinates (linear assumptions were made due to very small angle errors) and the Mahalanobis distance was used to calculate the distance between new measurements and previously tracked targets. For this project, a simple nearest-neighbor assignment policy was used for target measurement updates.

In target tracking applications, the most popular methods for updating target positions incorporate variations of the Kalman filter/state estimator [117,118,119]. The Kalman filter assumes that the dynamics of the target can be modeled, and that noise affecting the target dynamics and sensor data is stationary and zero mean. In cases where the target is actively maneuvering, the plant disturbance is not zero mean, and the performance of the Kalman filter degrades. To compensate, it is important to minimize sensor noise, such that the sensor data gains will be higher and the reliance on the model dynamics will be reduced. This is of considerable importance when tracking people, whose erratic movements are poorly matched to any model of more than second order.

Since the face is attached to the rest of the human body, its dynamics are directly related to those of the body. Head movements may be thought of as zero-mean random fluctuations with respect to the body position, so the same state model used for tracking the entire body has also been used for tracking head position. A human being has a complex locomotion system, which poses serious challenges in modeling. Rather than analyzing properties of the human gait, one may assume a drastically simplified model of the target as a mass under the influence of two forces: average leg force and friction. Leg force is used to push the body into motion, and friction comes from the inefficiency of the human body maintaining this motion. Intra-stride dynamics and rotation of the body, assuming that the body may move in any direction, are constrained only by leg force and inertia. A block diagram of the target model is given in Figure 51.

**Figure 51: Model of Target Dynamics**

This model is generic for any target mass with linear friction and force response. Actual forces and movements of the system are in three dimensions, but for now only one dimension of motion, x, will be considered. A target mass of 72.57 kg (160 lbs) and a friction coefficient of 100 N/m/s was assumed. This gives the continuous state space matrices

where

For the discrete computations of the Kalman filter, one needs a discrete state-space representation of the target, calculated for a sample rate of 5Hz as:

The objective of using this model to remove measurement noise with a Kalman filter/state estimator. The Kalman filter used in this system is identical to a current observer, except that the feedback gain vector G varies with time, and is calculated to provide optimally low error. This optimal solution incorporates the target model, state disturbances, and estimates of sensor noise variance. Figure 52 shows the model of the target including the state disturbance noise, W(k) and the sensor noise, V(k).

** Figure 52: Noise Entering the
Plant**

Note that in this application, the input U(k), or leg
force, cannot be measured by the system, and must be treated as part
of the disturbance. Disturbance models for the Kalman filter assume
stationary, zero mean, Gaussian noise distributions. While the human
leg force may not be stationary, the Kalman filter may still
compensate for its effects. For indoor applications, it is rare that
a human being will accelerate faster than 0.3 m/s2. For the 72.57 kg
model, this would require a 21.8 N force. Thus the variance of the
input force is estimated to be approximately (21.8)2 = 474 N2. For
visual sensor noise, sensor covariance changes depending on the
position and range of the speaker, and is recalculated for every
target during each frame. The covariance of *W*(*k*),
assumed to be constant, shall be assigned to the variable *Rw*,
and the covariance of *V*(*k*) shall be assigned to the
variable *Rv*(*k*).

This Kalman filter is based on a current observer state estimator
that provides an estimate, *q*(*k*), of the current system
state *x*(*k*), as well as a prediction, , of the state at sample
*k*+1. From [120], the filter equations are

Kalman filter design develops the observer sensor feedback matrix
*G*(*k*) such that the values of *G*(*k*) lead to
an optimal estimator, where the expected values of the squared
estimation errors are minimized. The determination of
*G*(*k*) is recursive, and must be calculated at run-time
for this application since the sensor covariance *Rv*(*k*)
changes depending on target position. From [120], the following
equations are used to find *G*(*k*):

Where *M*(*k*) is the covariance of the prediction
errors, *P*(*k*) is the covariance of the estimation
errors, and *B1* = *B*. When a new target is detected and
its tracked path is initialized, the values of *q*(*k*) and
*q*~(*k*) are set equal to the current sensor measurement
and *M*(*k*) is set equal to the identity matrix.

If the measurement error for each dimension of movement (x, y, and z) were statistically independent, then a separate Kalman filter state estimator could be used for each dimension. Unfortunately, the errors are not statistically independent, primarily because errors in depth perception in polar coordinates cause a measurement error that typically exists along a diagonal when transformed to Cartesian coordinates. For this reason, the three dimensions must be combined in the vector , increasing the size of the vectors and matrices that make up the filter. The new state representation becomes:

With this increase in dimensionality, *P*(*k*) and
*M*(*k*) become 6 x 6 matrices, and *Rw* and
*Rv*(*k*) become 3 x 3 matrices.

**Figure 53: Tracking a Person
Walking**

**Figure 54: Position Estimate while Tracking
a Face**

**Figure 55: Target Tracking of a Face in
3D**

Figure 53 shows the Kalman filter tracking a person walking. The rectangle shows the bounding box around the pixels extracted from the background, and the crosshairs show the projection of the estimated target state onto the current view. Here, color histogram data and position are used to associate regions measured by the vision system with active tracks. New tracks are created when new visual targets do not match any existing tracks with more than an acceptable threshold of probability, and tracks that have not been matched to new target data over an arbitrary time limit are deleted. An overview of applicable track initiation and pruning schemes is described in [117]. Figure 54 shows a screen-shot of the computer program while tracking a face. The rectangle around the face indicates the largest region of "noisy face pixels;" the red crosshairs mark position estimate from the Kalman filter, and the green vertical line indicates the angle of peak sound intensity. Kalman filter output results for tracking a face in three dimensions are given in Figure 55. The Kalman filter suppresses sensor noise as well as occasional incorrect association of targets.

The target tracking methods described in this chapter allow noisy sensor measurement to be combined into more reliable estimates of target positions. The model of human motion dynamics and in Cartesian coordinates provides the basis for filtering and smoothing the sensor data. Target position data may then be used for camera control or for intelligent room applications. However, question of which target to follow and how to look for targets still remains. In the next chapter, the foundations for behavior-based control will be presented.