This proposal summary contains excerpts that identify the deliverables for our work. Most of the text is gray, to allow for color annotations that indicate the status of specific items.
This proposal focuses on the concept of ibots, interface agents that interact with software applications through the graphical user interface, in essentially the same way that human users do. The proposal will produce the following results:
This taxonomy will help us build a cognitive framework, with a strong visual component, for understanding tool use. Drawing on research in cognitive modeling and computational vision, we will develop an end-to-end cognitive model that exhibits visually guided tool-using behavior in the user interface [P:2002a.] (We have extended this goal to consideration of agents in simulated physical environments; see Note 1.)
. . .Our work will be applied to two specific problems: the generation of a testbed for the evaluation of AI planners, and the construction of sensors and effectors to allow computational cognitive models to interact directly with off-the-shelf software.
Domains.
. . .We propose to remedy these problems with a set of tools for defining and evaluating domains. Addressing the first and second problems requires building an interactive object specification tool, one that allows a developer to build symbolic and numerical descriptions of the objects in an application interface, based on existing and newly defined features, and to adjust the representation until it reaches an appropriate level of detail. Addressing the third and fourth problems entails internal changes to the image processing module to allow for user input in object identification [software release (pattern/object definition interfaces); Ajay: thesis in preparation (PDDL translator).] The issue here is management of the SegMan's object recognition knowledge base, such that it can be extended without disproportionate effort.
Control.
SegMan is effective largely because its design can exploit the strong correspondences between planning assumptions and restrictions on the dynamic behavior of the user interface. Thus if a controller requires that the environment be discrete, deterministic, static, accessible, and so forth, as many theoretically motivated planners do, this does not impose an impossible burden on SegMan; most applications provide such an environment most of the time. These assumptions, however, do not hold universally. An interactive application may break a general guideline for many reasons. User interfaces often have variable response times; this can result in non-deterministic behavior for a controller with a too-short execution cycle. Exogenous events sometimes occur, in the form of notifications that new mail has arrived or a transient system error has been detected. State information, such as whether the paste buffer holds data, may be not be directly accessible. (Even worse, some dynamic applications, such as games, work in direct opposition to many of the most common design guidelines for business and productivity applications, but these are beyond the scope of our proposed work.)
We propose to modularize the controller
interface so that its internal interaction strategies can be activated
and deactivated on demand, depending on the capabilities of a given
controller. [Sameer: thesis in
preparation.] This will allow for the selective relaxing
of assumptions a controller makes about its environment.
In addition, rather than following an ad
hoc development process, we further propose to adopt a general
modeling approach to representing interaction with the interface, in
which restrictions on the environment are not assumptions, but rather
parameters of the model.
We propose to build an MDP-based representation of agent/user
interface interaction, to support a clear understanding of the
relationship between a controller and SegMan's controller
interface. (See Note 2. Nevertheless we have made some progress toward a more
restricted model for a related problem: identifying efficient mappings
between interface controls and low-level user actions across platforms
[Clarence: P:2003b].) This work
will add a new chapter to research on formal models of human-computer
interaction, as well as extending past work on the use of Markov
models to describe user interaction in dynamic environments.
Evaluation.
Our proposed extensions to Aide involve two components:
A planning testbed.
Our proposed work capitalizes on recent movement in the planning community toward common standards for domain and plan representation. Extensive use over a long period of time by planning researchers would be the most most compelling evidence for the value of such a testbed. Nevertheless, in the shorter term informal evaluation methods can give us important feedback as to its effectiveness. Briefly, our plan involves the following:
An interface execution system for cognitive models.
Our proposed work will allow researchers to automatically develop realistic input scenarios, to evaluate the ecological validity of models with respect to real-world applications, and in general to treat cognitive modeling as a tool for user interface exploration, expanding the current boundaries of experimental practice.
A taxonomy of tool use in the interface.
Our goal will be to develop a conceptual framework based on these informal (and even conflicting) characterizations, in which we can describe and differentiate specific agent activities in the interface as examples of tool use. More specifically, we propose to construct a taxonomy to describe tool-related behavior in the user interface [P:2002b, P:2002f, P:2003a.]
. . .Our work will flesh out this brief description to cover a much broader range of activities in the user interface. With widespread consistency in interface controls and their functionality, we believe it is possible to build a relatively comprehensive taxonomy along these lines. This will provide the groundwork for a more difficult task: building a computational model that can represent and reproduce such tool use in the interface.
Cognitive models and tool use in the interface.
We see a natural correspondence between the task-oriented properties of this vision model and the interaction requirements for intelligent tool use. We propose to build a new component in SegMan to replicate the functionality of the current image processing module. This new component will constitute a cognitively plausible vision model, based on the high-level vision concerns briefly laid out above [Kunal: thesis in preparation.] (A spin-off from this effort has had implications for human-robot interaction: [P:2003c.]) The modular structure of the current SegMan will facilitate development; we expect that visual routines, for example, can be constructed from elemental operators based on simple combinations of the existing interpretation rules. One advantage we have over previous work is that SegMan supports the development of an end-to-end model of vision, from early vision processes to high-level vision. The vision models associated with current unified cognitive models mainly address higher-level processing and thus will have limited fidelity in this situation; we will have the chance to explore cognitive processing dependence on lower-level vision results. A visual routines approach is also attractive in that it supports what Chapman calls visually guided activity. We believe this will be key to a model of effective tool use.
In addition to building a new model of vision in the SegMan substrate, we also propose to develop a controller that implements a cognitive model of tool use, relying on visual processing to guide its behavior [P:2002a; Thomas: thesis in in preparation (tool simulation); Ergun: thesis in preparation (common-sense reasoning for tool use).] (See Note 4.) This model will need to accommodate a number of novel influences, including task context, work practice, and long term behavioral policies. It will also require the ability to reason about ecological relationships between effectors, goals, and tools, at a level of detail deeper than usually considered in cognitive modeling (or agent planning) work. We expect to draw on current work with unified cognitive models and their close relatives, task analysis models, plus MDP modeling. . . Research in all of these areas addresses important issues for our work, in particular the relationship between sensing and acting.
Note 1: Our interest in tool use has extended
beyond the user interface to representations of tool use in the real
world. A technical report [TR:2002a]
gives a relatively detailed overview of our current thinking.
Ideally, we would like to build a detailed simulation of a physical
environment in which a simulated robotic agent can learn how to use
tools.
Note 2: We have attempted to develop a more
sophisticated controller interface, and we have experimented more
extensively with the current system. We have found that there is
insufficient variability in standard user interfaces to justify a
probabilistic model of user interaction. Behavior can be largely
deterministic, with fixups only rarely needed. We have considered an
MDP-based model for more dynamic interfaces, such as the driving game,
but a move in this direction means that we abandon almost all of the
"facilitating" properties we associate with user interfaces; it
becomes equivalent to a much-simplified vision and robotics domain.
Note 3: Ajay's preliminary testing has shown
that so-called "primitive-action" planning techniques [Wilkins and
desJardin, AI Magazine, 2001] are not sufficiently powerful to reason
about the hundreds of objects visible on the screen at one time. This
is not set in stone, however; we are continuing to test different
planners. An obvious solution to this potential problem, the addition
of a focusing mechanism, puts too much of the responsibility of
planning on an external system; what's left would not be of interest
in planning research. We might try to develop novel planning
techniques, possibly a knowledge-based planner, but this is too far
beyond the scope of the proposed work. Instead, we are pursuing these
alternatives:
Note 4: We have extended the scope of this
area of our work. Thomas's HabilisDraw work supports a relatively
high-level analysis of tool use, but of course has significant
differences from physical tools. With Thomas and Ergun's new efforts
to build a physical simulation system in which an agent can reason
about the properties of simulated physical tools, we should be able to
make much stronger connections.
Publications