by Mark O. Riedl and Rob St. Amant
In this document you should be able to find enough information to use and to build on the SegMan perceptual substrate. The system described in this document can be downloaded from this URL: http://www.csc.ncsu.edu/faculty/stamant/code/segman-03-10-14.zip.
1. Overview
2. Architecture
3. Segmentation fundamentals
3.1. Pixel neighbors
3.2. Pixel patterns
3.3. Pixel pattern definition
3.4. A pixel
pattern definition utility
4. Feature recognition
5. Memory management
6. Description of files
6.1. segman.dll and the *.cpp files in src/
6.2. foreign-interface.lisp
6.3. wrappers.lisp
6.4. segmentation.lisp
6.5. segman.lisp
6.6. state.lisp
SegMan is a perceptual substrate that uses computational vision to "see" the Microsoft Windows graphical direct-manipulation interface. SegMan enables other programs to be able to see the graphical interface screen as a human would see it. This enables programs to iteract with Microsoft Windows as if it were a user sitting at the console instead of relying on low-level APIs. With SegMan we can create and test more realistic cognitive models of direct-manipulation interface usage, build AI agents that can reason about and use the graphical interface, and write scripts and programs that learn and perform routine tasks in the graphical interface.
SegMan is a substrate because it is a layer of functionality that sits just above the level of the operating system and provides hook functions that other programs can use to perceive and manipulate the graphical user interface. SegMan itself does not perform any functionality except segmenting the screen into well-understood features and widgets that other programs and scripts can utilize for their own ends.
The computational vision routines that SegMan use are fairly rudimentary -- coming no where close to the sophistication of human vision. However, the Microsoft Windows graphical user interface is highly rectilinear and highly standardized so we can use short-cuts for detecting features and widgets on the screen. Section 3 goes into more detail on how SegMan segments the screen into useful visual components.
The architecture is a layered architecture. On the bottommost level lies the operating system. For the current version of SegMan, the only operating system supported is Microsoft Windows; SegMan's feature detection routines are geared specifically for recognizing screen widgets that are defined by the Microsoft Windows look-and-feel.
Segman.dll is a dynamic-link library that is loaded into memory by SegMan during load-time. Segman.dll is a platform specific piece of C++ code that is able to capture the Windows screen and break it into groups of contiguous like-colored pixels. These groups are called pixel-groups. At capture time, the pixel-groups are not recognized or sorted in any fashion.
SegMan, a collection of lisp routines, accesses the DLL and retrieves the pixel-groups from the DLL's memory. Pixel-groups are subjected to predicates that identify their shapes and are categorized. SegMan determines the state of the Windows screen as a list of all pixel-groups and symbolic references for what they might look like and what they might be used for. More complicated routines can be run on the screen-state to identify increasingly more complicated features such as windows, buttons, and text.
Above this is a functional substrate. Programs and scripts can be written that access the data structures and functions representing the screen to solve problems. Scripting programs that use the graphical direct-manipulation interface can be accomplished at the time of document's creation.
Just above this is a state-oriented substrate. The state-level representation is intended to abstract away the procedures for identifying specific types of objects. The relevant functions are designed to generate information for specific states and to create transitions between states. Programs can be constructed based either on the functional or the state-based layer, at the discretion of the designer. State functions are described in Section 6.6.
Finally an additional, optional level, is being built on top of the substrate that provides a collection of common interface functions that can be used by planners and cognitive models. Planners and cognitive models require a certain amount of robustness, consistency, and predictability of the screen if they are to operate effectively. The controller interface will provide robustness, consistency, and predictability that might not otherwise exist in the direct-manipulation graphical interface. The controller interface is not yet complete.

SegMan uses simple computational vision routines to pick out features of interest in the Microsoft Windows graphical interface. The basic architecture of the SegMan system has a dynamic-like library (DLL) which is able to capture the screen as a bitmap and then process the bitmap into lists of pixel-groups. A pixel-group is a region of the screen where all pixels are like-colored. All pixels on the screen are assigned to pixel-groups. Pixel-groups are non-overlapping.
![]()
Group 1 is a pixel-group comprised of the pixels in the letter 'F'. Group 2 is the pixel-group comprised of the pixels in the dot. Group 3 is the pixel-group comprised of the pixels in the stem of the 'i'. Group 4 is the pixel-group consisting of the backgroup pixels.
Pixel-groups can then be examined for specific shapes and for relationships between shapes. For shapes that consist of a single pixel-group, such as the letter 'F', recognition is simple. One could either look at the arrangement of pixels within the group, or one could look at the pixel-neighbor numbers.
Each pixel-group has an array of pixel-neighbor numbers associated with it. The pixel-neighbor numbers are encodings of the relationships between pixels within the group. Each pixel in a group has 0-8 neighbors. Looking at an individual pixel in a pixel-group, there are eight possible positions that a neighbor can be in: west, southwest, south, southeast, east, northeast, north, and northwest. We assign a numerical value to each neighbor position, respectively. Thus the west position is assigned to "0" and the northwest position is assigned to "7".
![]()
Each of these numbers corresponds to a bit in a single integer. Thus if a pixel is the top-right corner of a box, it has neighbors to the south (position 2), southeast (position 3), and to the east (position 4). The pixel-neighbor value of that top-right corner pixel is 2^2 + 2^3 +2^4 = 28.
![]()
For an entire pixel-group, we add up all the member pixels with pixel-neighbor values of 0, 1, 2, ..., 255. For example, a pixel-group representing a 5-by-5 solid box, the pixel-neighbor numbers will look like:
pixel_neighbors[0] = 0 pixel_neighbors[1] = 0 pixel_neighbors[2] = 0 pixel_neighbors[3] = 0 pixel_neighbors[4] = 0 pixel_neighbors[5] = 0 pixel_neighbors[6] = 0 pixel_neighbors[7] = 1 .... pixel_neighbors[28] = 1 pixel_neighbors[29] = 0 pixel_neighbors[30] = 0 pixel_neighbors[31] = 3 .... pixel_neighbors[112] = 1 .... pixel_neighbors[124] = 3 .... pixel_neighbors[193] = 1 .... pixel_neighbors[199] = 3 .... pixel_neighbors[241] = 3 .... pixel_neighbors[255] = 9
The four corner pixels generate unique pixel-neighbor values (7, 28, 112, 193). There are three pixels on each edge if you exclude the corners (31, 124, 199, 241). There are nine pixels in the center that are completely surrounded by neighbors (255).
We can detect a single-colored box by looking for pixel-groups with the following combination of pixel-neighbor numbers:
pixel_neighbors[7] == 1 AND pixel_neighbors[28] == 1 AND pixel_neighbors[112] == 1 AND pixel_neighbors[193] == 1 AND pixel_neighbors[31] > 1 AND pixel_neighbors[124] > 1 AND pixel_neighbors[199] > 1 AND pixel_neighbors[241] > 1
Conjunctive tests such as the one described above for finding a box were used in the original C++ code for SegMan and were then ported to the Lisp side for flexibility. Eventually we developed a declarative form in which pixel patterns could be specified. The form allows not only neighboring values to be specified, but also other properties of a group:
We define patterns to capture combinations of group properties. Below is a description of how patterns can be specified.
pattern-def := (DEFINE-PATTERN name () pattern-list)
pattern-list := pattern-form pattern-list || pattern-form
pattern-form := (bool pattern more-patterns) || pattern
more-patterns := pattern more-patterns || pattern
pattern := code ||
(code count) ||
(code count comp) ||
(:NEIGHBOR code count) ||
(:NEIGHBOR code count comp) ||
(accessor count) ||
(accessor count comp)
bool := :AND || :OR
comp := < || <= || = || >= || >
accessor := :COUNT ||
:AREA ||
:HEIGHT ||
:WIDTH ||
:RED ||
:GREEN ||
:BLUE ||
:COLOR ||
:PROPORTION
Terminal types:
name := SYMBOL
code := INTEGER
count := INTEGER
area := INTEGER
The variables count and comp default to 1 and =, respectively, if they
are not specified. An atomic code is equivalent to the form
(:NEIGHBOR code 1 =). In other words, a single instance
of the pixel pattern exists in the pixel group. Unfortunately, these
forms are not mutually exclusive, but confusions are minor.
Patterns can thus be specified either verbosely or concisely (in
both cases somewhat cryptically) as shown in the patterns below. The
single number 20 in the letter-e form expands to a test
equivalent to pixel_neighbors[20] == 1; the form (1
3) expands to pixel_neighbors[1] == 3; the form
(28 1 >=) in the :rectangles form expands to
pixel_neighbors[28] >= 1.
(define-pattern letter-e (:translation #\e) (:and (1 3) 20 80 84) ;upper E (:and 18 9 72 2 (17 3))) ;lower E (define-pattern :rectangles () (:and (28 1 >=) ;top-left (112 1 >=) ;bottom-left (193 1 >=) ;bottom-right (7 1 >=)) ;top-right (:and (28 1 >=) ;top-left (112 1 >=) ;bottom-left (193 1 >=) ;bottom-right (5 1 >=)))
We have included a few very simple debugging functions,
group-under-cursor, show, and
pixel-pattern, to help the developer construct new pixel
patterns. The function call (group-under-cursor) will
return a list of groups whose centers are nearest the current position
of the cursor. Some experimentation is usually required to figure out
exactly which pixel group is which. The function call (show
group), where group is one of the elements of the list returned
by group-under-cursor, or for that matter any pixel group
at all, will color that pixel group red. Finally, given a pixel
group, the function pixel-pattern will generate a pixel
pattern specification of that group.
For example, here is one way I might define a pattern for the letter
E: I position the cursor over a specific pattern, and and call
(group-under-cursor). This returns a list of groups.
For each group, I call (show group), and see if the pixel
group I'm interestd in is colored. When I get the right one, I call
(pixel-pattern group), wrap a define-pattern
form around the result, and save it in a file.
SEGMAN(56): (segment-screen)
Beginning segmentation of entire screen.
Completed segmentation; 2736 groups found.
((:WIN-LOGOS (217164432)) (:CHECK-BOX (243214072 243216392))
...)
;;; Move cursor to specific location
SEGMAN(57): (group-under-cursor) ; The two forms returned are the
; pixel group list and each group's
; distance from the cursor position.
(244391488 244411208 244414688 244390328 244384528)
(2.828427 3.1622777 3.1622777 3.6055512 3.6055512)
SEGMAN(59): (setf groups *) ; Put the result in a variable.
(244391488 244411208 244414688 244390328 244384528)
;;; A caveat: these fixnums reference structures created and
;;; maintained by the DLL but not on the Lisp side. Further calls to
;;; group-under-cursor will work, but another call to segment-screen
;;; will rebuild the structures, and attempts to access the pixel
;;; groups referenced by these specific pointers/fixnums will result
;;; in an error.
SEGMAN(60): (show 244391488) ; Color one of the groups. This one
; turns out not to be the intended one.
244391488
SEGMAN(61): (show 244411208) ; Color another group, the right one.
244411208
SEGMAN(62): (pixel-pattern 244411208) ; Generate a pattern.
(:AND (:COUNT 34) (:AREA 2/7) (10 2) (17 8) (34 6) (46 2) 48 (49 2)
(81 3) ...)
SEGMAN(63): (print *) ; Oops, can't see it all. . .
(:AND (:COUNT 34) (:AREA 2/7) (10 2) (17 8) (34 6) (46 2) 48 (49 2)
(81 3) 129 (136 5) 142 (145 2) 162)
(:AND (:COUNT 34) (:AREA 2/7) (10 2) (17 8) (34 6) (46 2) 48 (49 2)
(81 3) ...)
;;; Now cut this form, insert it into a define-pattern form, and save
;;; it away in a file for loading in a different context.
(define-pattern new-pattern ()
(:AND (:COUNT 34) (:AREA 2/7) (10 2) (17 8) (34 6) (46 2) 48 (49 2)
(81 3) 129 (136 5) 142 (145 2) 162))
Note that there's no generalization in the result returned by
pixel-pattern, in the sense that there are no "don't
care" neighbor values. The pattern returned is exhaustive over the
256 neighbors and uses equality for its comparisons. It would be an
interesting problem to try to learn minimal representations of
different patterns that changed dynamically based on the input of new
patterns. This is a classic machine learning problem, which
unfortunately we haven't had the time to look at.
;;; Note that the listener starts in the common-graphics-user package. CG-USER(0): :ld /research/systems/segman-v2/load-system ;;; . . .At this point a window will pop up, as shown below. This window gives limited access to the functionality described in the previous section. It is a prototype, so not everything works, but it has some useful characteristics.. . . CG-USER(1): (capture) #(PATTERN-SPECIFICATION-WINDOW :CAPTURE in Listener 1 @ #x20f874a2)

Start by clicking the "Grab Screen" button. A dialog box will come up, explaining how to drag over a region of the screen without using the mouse keys (so that you don't select other applications; you end up using the left control key as a substitute for mouse down and mouse up.) Once you select a region, the window will expand to show what you have selected, as shown below. The redisplay process can be somewhat slow, in that it is carrying out the segmentation process in addition to displaying the bitmap. Unfortunately, one of the current limitations of the application is that you can't move the window once you've selected a region to be displayed; if you do, you'll need to grab the screen again.

Once you have a region displayed in the window, the Pixel Groups list box will display the pixel groups in that region. These may not be all of the pixel groups; a text box shows the minimum number of pixels that must be contained in a group for it to be displayed. (Currently this box cannot be edited; this is a bug that we will take care of shortly.)
Clicking on a group in the Pixel Groups list box will cause that group to be colored in the display. If the "Move pointer to selected group" flag is set, the mouse pointer will move automatically to the selected group as well. A selection action will also cause the properties of the group to be displayed in the Group Properties list box. This is not a complete list of the group's properties, in that it leaves out pixel neighbor information, but it should be enough to give an overview of the group's properties.
It would be tedious to click through all of the potentially hundreds of groups in the Pixel Groups list box, searching for a specific visual element. You can also click on groups in the image. This will cause the system to search for every group in the image whose center is within a constant value of the mouse click event. It will display these groups in the Pixel Groups list box. You'll probably have to click through these to find exactly the group you're interested in, but it's a much smaller search. To go back to all the groups in the image, you can click the Reset Groups button.
When you've selected a group for which you are interested in creating a specification, you can click the Show Pattern button. This will print out a Lisp form to the Lisp listener. The specification form for the group shown in the picture above is as follows:
(SEGMAN:DEFINE-PATTERN T1 () (:AND (:COUNT 42) (:AREA 1) (:SIZE 42) (:HEIGHT 6) (:WIDTH 7) (:RED 212) (:GREEN 208) (:BLUE 200) (:COLOR 13947080) (:PROPORTION 5/6) 7 28 (31 5) 112 (124 4) 193 (199 4) (241 5) (255 20)))
You would edit the form slightly, to change the name T1 to a symbol more descriptive of the pattern, such as small-square, and perhaps remove some of the properties that may be overly restrictive such as the RGB color information. The resulting form can be saved away in a file for later use, or could be evaluated and tested interactive, as shown below. The first form is obtained by cutting and pasting from the result of the Show Pattern operation above.
CG-USER(5): (SEGMAN:DEFINE-PATTERN small-box ()
(:AND (:COUNT 42) (:AREA 1) (:SIZE 42) (:HEIGHT 6) (:WIDTH 7)
(:RED 212) (:GREEN 208) (:BLUE 200) (:COLOR 13947080)
(:PROPORTION 5/6)
7 28 (31 5) 112 (124 4) 193 (199 4) (241 5) (255 20)))
((:AND (:COUNT 42) (:AREA 1) (:SIZE 42) (:HEIGHT 6) (:WIDTH 7) (:RED 212) (:GREEN 208)
(:BLUE 200) (:COLOR 13947080) ...))
CG-USER(6): (segman::pattern-groups 'small-box)
;;; These are the small boxes visible on the entire screen, at least
;;; the last time segment-screen was called.
(143093736 142246920 142088000 142030000 142020712 141069656 140787768)
CG-USER(7): (segman::show *)
;;; Now all the small boxes should be colored on the screen.
(143093736 142246920 142088000 142030000 142020712 141069656 140787768)
This application is only appropriate for defining patterns associated with individual pixel groups, rather than combinations of groups, as is often necessary. An additional application is under development, as discussed below.
Simple segementation of the screen into pixel-groups gives us a lot of power in terms of recognizing features. However, simple segmentation only allows us to see shapes that consist of a single pixel-group. Often it is valuable to recognize features on the screen that are made up of more than one pixel-group. Examples of features made up of multiple pixel groups are icons, buttons, window borders, and strings of letters.
To recognize features that are not made up of a single pixel-group we must employ a two-step process. The first step is to find the pixel-groups that make up the feature. We do this by looking for specific pixel-groups that might be part of overall feature. We do this by selecting pixel-groups that have the right shape (the correct pixel-neighbor numbers). Not all pixel-groups with the correct shape are necessarily going to be part of the feature we are trying to detect. The second step is to choose from the candidate pixel-groups the ones that are in proximity to each other and in the correct spatial configuration. The SegMan system provides a variety of functions that find pixel-groups based on the spatial relationship to others.
For example, a standard Windows button is a rectilinear feature that appears to be raised out of the screen. This raised effect is created by applying a thin strip of color around the edges; lighter on the top and darker on the bottom. As far as SegMan is concerned, a button is made up of three pixel-groups: a rectangle and two L-shaped regions. However, these three groups must be in the correct relationship to each other in order to form what looks like a button. The lighter L-shape (upper shading) must be directly above and to the left of the rectangle and the darker L-shape (lower shading) must be directly below and to the right of the rectangle. When these relationships hold, there is a feature recognizable to the human use as a button.

The following is psuedocode for recognizing a button:
PROCEDURE find_buttons (screen) DO
rectangles = find_all(rectangles, screen)
upper_shadings = find_all(upper_shading, screen)
lower_shading = find_all(lower_shading, screen)
buttons = EMPTY_LIST
FORALL rect IN rectangles DO
upshade = find_group_containing(rect, upper_shadings)
lowshade = find_group_containing(rect, lower_shadings)
IF distance_between(upshade, rect) < 5 AND
distance_between(lowshad, rect) < 5 AND
color(upshade) > color(lowshade) THEN DO append(rect, buttons)
RETURN buttons
In the first stage, we find all the pixel-groups of the shapes we need: rectangles, upper-shading, and lower-shading. Buttons is an empty list into which we will collect all features that look like buttons. In the first stage, we iterate through the rectangles, looking for those in the proper relationship to the other shapes we have indicated. We find a pixel-group in the upper_shadings list that most closely contains the rectangle. We find a pixel-group in the lower_shadings list that most closely contains the rectangle. Containment is a useful relationship because, even though upper_shadings and lower_shadings are L-shaped, their bounding boxes enclose a much larger area that, ideally, will contain a rectangle if the feature is a button. The next check is proximity of the L-shapes to the rectangle. This is very important because a button might be contained in a window and windows also are bounded by L-shaped shaded areas. But if the shading belonged to a window, one or both shadings will probably be further than five pixels away. Finally, we much make sure that the L-shape above the rectangle is lighter in color than the L-shape below the rectangle. If the upper L-shape was darker than the lower L-shape, perceptually, the feature will look recessed into the screen instead of raised.
All features that comprise of more than one pixel-group can be detected by applying one or more of the following relationships: contains, above, below, to the left, to the right. Additional details such as distance may be required to ensure robust recognition.
Memory management is tricky with SegMan. The pixel-groups are data objects that are stored on the C++ side (in segman.dll) and are not subject to the Lisp interpreter's garbage collection. Pointers to pixel-groups are passed to the lisp interpreter through calls to the built-in iterator. Pixel-groups, on the lisp side, are essentially integer addresses of the corresponding pixel-group objects in the DLL. Thus pixel-groups in lisp are not true objects themselves. Special helper functions are used to access the data objects stored in the DLL.
Segmentation:
In order to ensure that memory is not leaked, SegMan deletes all pixel-groups when segmentation occurs and creates a new list of pixel-groups from scratch. However, this means that any pixel-group pointers that are held as lisp values become dangling pointers; there is no way for the lisp interpreter to know that these values should be invalidated (to the lisp interpreter, pixel-group pointers look like fixnums). When segmentation is performed, all pixel-groups from the previous screen become dangling and any attempt to access the old pixel-group pointers through helper functions will result in seg-faults.
Pixel-groups that are created during segmentation appear red
on the screen when (show) is called.
Pixel-groups that show up red are transient; they will be deleted
automatically at the next screen segmentation. These pixel-groups
are also called "unsafe."
"Safe" pixel-groups:
Pixel-groups can be created "safely." A
"safe" pixel-group is one that will not be deleted when
the next screen segmentation occurs. A safe pixel-group is
created using special functions such as (make-pixel-group) and (make-pixel-group-by-bounds). The pixel-groups created through these function
calls are still kept in the DLL and their addresses are returned,
but these pixel-groups will not be deleted until the user
explicitly asks the DLL to delete them. The pixel-groups can be
deleted using the function (delete-pixel-group). However, the pointers to these pixel-groups only
exist on the lisp side. So, if a variable holding a safe
pixel-group pointer is lost due to garbage collection, there is
no way to recover the address of the safe pixel-group. It's
memory has been effectively leaked. Memory leakage will impact
SegMan's performance over time.
Safe pixel-groups show up blue on the screen when (show) is called.
Other considerations:
Most functions that perform feature detection in SegMan return
non-safe pixel-group pointers. For example, (find-buttons) returns a list of pixel-group
pointers, referring to the unsafe pixel-groups. Some functions,
such as (find-string) returns safe pixel groups. (find-string) returns a list of pixel-groups that
are effectively bounding the sequence of characters making up a
string on the screen. Because a string consists of many
pixel-groups that are not necessarily adjacent, we must make a
new, safe, pixel-group instead of returning unsafe pixel-groups.
It is important, therefore, to know whether you are retrieving
safe or unsafe pixel-groups when you call a function so you know
whether to delete the memory after use or whether the memory is
transient.
Sometimes it is important to remember pixel-groups after the
next screen segmentation. Special functions are provided to
convert unsafe pixel-groups into safe pixel-groups. (make-pixel-group (get-bounds unsafe-group)) will return a safe pixel-group with the same bounded
region as the unsafe pixel-group. (memory-safe
unsafe-group-list) will return a list of safe
pixel-groups given a list of unsafe pixel-groups. However,
converting an unsafe pixel-group to a safe pixel-group means
information is lost. Safe pixel-groups do not store information
about individual pixels inside the group, only the bounded
region. Therefore a safe pixel-group is not equivalent to an
unsafe pixel-group although the bounded regions are equivalent.
The following discussion is what to find in each file that makes up the SegMan system. The core system is located in systems/segman. All other directories contain supporting systems such as planners and cognitive models. Lisp source files are contained in systems/segman/segman. The Microsoft Developer Studio files used to build segman.dll are contained withing systems/segman/src.
This dynamic-link library is created by compiling the
Microsoft Developer Studio C++ project. The DLL contains the code
for capturing the screen and segmenting the captured bitmap into
groups of like-colored pixels called "pixel-groups".
The pixel-groups are stored on the C++ side but pointers to the
pixel-group objects can be obtained and passed to the lisp side
via iterator calls (c_is_next), (c_get_next), and (c_reset_iterator). Since pointers are passed to the lisp side, the DLL
also provides functions for manipulating the objects referred to
by the pointers. I have created a convention of prefixing
exported DLL calls with "c_", although I have also
written wrapper functions that make calling DLL calls easier (see
wrappers.lisp).
CScreenProcessor::FindBasicGeometry(), CScreenProcessor::FindButtons(), etc for examples of how
screen objects would be recognized by C++ code, although
these functions are legacy and are not actually called.(c_segment_screen). Once segmentation occurs,
CSegManApp stores a list of all pixel-groups on the
screen. These can be recovered using the built-in
iterator:
c_segment_screen(0, 0, 1024, 768); //initialize the pixel-group list
while (c_is_next()) { //iterate through the list
CPixelGroup* g = c_get_next();
//do something with g.
}
c_reset_iterator(); //reset the list for the next iteration.
Segman.dll also provides provisions for debugging. Calling c_show(CPixelGroup* p) will cause
the system to color the pixels belonging to the pixel-group on
the screen so you can see a visual representation of the
pixel-group. The pixel-group will show up as red pixels. If the
pixel-group is created by means other than segmentation (e.g. (make-pixel-group)), the
pixel-group will be displayed blue.
This file contains the foreign-function interface for Allegro CL 5.01. Each def-foreign-call corresponds to a function exported by the segman.dll. so that you can call the functions as if they were lisp functions.
Cursor position functions:
(c_get_cursor_x) returns the x-position of
the mouse cursor.(c_get_cursor_y) returns the y-position of
the mouse cursor.Segmentation and iteration functions:
(c_segment_screen left top
right bottom) segments the portion of the
screen given certain bounds. After the call, the
pixel-group iterator will be initialized with the new
pixel-group information.(c_set_base_screen) tells segman.dll to store
the current screen's bitmap for a future comparison. This
is useful for establishing a baseline screen from which
we can poll for changes.(c_get_screen_difference) returns a CRect*
bounding-box that represents a region in which the screen
has changed at the bit-level. This is based on the
current screen compared to the screen stored by the (c_set_base_screen). CRects* are pointers to
data-structures in the DLL. There are helper functions
for manipulating the data within.(c_wait_for_difference) loops until there is a
change in the screen at the bit-level. I do not recommend
using this function.(c_reset_iterator) resets the pixel-group
iterator so collection can occur. This is automatically
done when (c_segment_screen) is called.(c_is_next) returns a boolean. True if
the pixel-group iterator is not at the end of the
pixel-group list held by the DLL.(c_get_next) returns a pointer to a
CPixelGroup data structure and increments the pixel-group
iterator to the next pixel-group in the pixel-group list
held in the DLL. CPixelGroup* is a C++ data structure.
There are helper functions for manipulating the data
within.(c_get_group_at point) returns the pixel-group
located at a given point. The point is passed in as a
list of screen coordinates: (x y). This function returns
the pixel-group from the last screen segmentation. If
there has been no screen segmentation, an error will
occur.Screen manipulation functions:
(c_double_click) causes the mouse to
double-click at it's current position.(c_single_click) causes the mouse to
single-click at it's current position.(c_move_mouse_to x y) causes the mouse to move to
the given screen coordinates.(c_mouse_down) causes the mouse button to
be depressed and remain depressed until (c_mouse_up) is called. Use this for
dragging operations.(c_mouse_up) causes the mouse button to
be released if it had been previously depressed.(c_press_key str) causes a key to be pressed
and released. The input argument is a string containing a
single character or the following: "RETURN",
"SHIFT", "CONTROL", "MENU",
"WINDOWS", "INSERT",
"DELETE", "BACKSPACE",
"LEFT", "UP", "RIGHT",
"DOWN", "HOME", "END", or
"ESCAPE".(c_key_down str) causes a key to be pressed
but not released. Used for depressing special key
combinations such as control or shift sequences. The key
remains depressed until (c_key_up
str) is called.(c_key_up str) causes a key to be released
if it had already been depressed.Pixel-group creation and deletion:
(c_new_pixel_group left top
right bottom color) causes the DLL to create a
new CPixelGroup and return a pointer to the data
structure. This is a safe pixel-group.(c_delete_pixel_group
pointer) causes a CPixelGroup's
memory to be erased. To avoid memory leaks, it is
essential that you call this function on every safe
pixel-group. However, the DLL manages the memory for
unsafe pixel-groups and deleting any unsafe pixel-groups
will result in a seg-fault.Pixel-group helper functions:
(c_get_pixel_neighbors group
code) returns the number of pixels
in the given pixel-group that have a particular
pixel-neighbor value.(c_get_pixel_count group) returns the number of pixels
in a pixel-group.(c_get_group_color group) returns a coding of the
color of the pixels in a pixel-group. The coding used is
(red + green + blue)/3. Thus 255 is always white, 0 is
always black, and if c1 > c2 then c1 is more luminous
(not-perceptually) than c2.(c_get_rgb_color group) returns a coding of the
color of the pixels in a pixel-group. The coding is such
that (c & 0xFF) is blue, (c & 0xFF00) is green,
and (c & 0xFF0000) is red.(c_show group) causes the pixels in a
pixel-group to be painted to the screen. If the
pixel-group was created during segmentation (unsafe), the
pixels will be colored red. If the pixel-group is safe,
the region within its bounds will be filled blue.(c_get_left group) returns the left-most bound
of the pixel-group.(c_get_top group) returns the top-most bound
of the pixel-group.(c_get_right group) returns the right-most bound
of the pixel-group.(c_get_bottom group) returns the bottom-most
bound of the pixel-group.CRect* helper functions:
(c_get_rect_left rect) returns the left bound of
the rectangle.(c_get_rect_top rect) returns the top bound of the
rectangle.(c_get_rect_right rect) returns the right bound of
the rectangle.(c_get_rect_bottom rect) returns the bottom bound of
the rectangle.This file duplicates the functions in foreign-interface.lisp
but with lisp-friendly function names. (c_double_click) is wrapped by a new function called (double-click) with the same parameters. The
wrapper functions are superior to the foreign-interface functions
in that they perform some pointer error-checking. Pointers in
lisp and pointers in C++ are not always handled the same so
conversions are made to ensure certain errors do not occur. Other
wrapper functions simplify calls to the DLL such as (get-cursor) which wraps calls to (c_get_cursor_x) and (c_get_cursor_y) and returns the results as a single list-value. Some
wrapper functions do not correspond to any functions in
foreign-interface.lisp but provide helper-routines for
pixel-groups that can be derived from foreign-interface functions
such as (get-height) which wraps (c_get_top) and (c_get_bottom) with some processing.
Cursor position functions:
(get-cursor) returns the
screen-coordinates of the mouse cursor in the list form
(x y).Pixel-group iterator and screen difference functions:
(set-base-screen) tells the SegMan system to
remember the screen as was captured during the last
segmentation. Use this function in conjunction with (get-screen-difference) to determine if the screen
has changed.(get-screen-difference) returns a bounding box that
contains the region of the screen that has changed since
the last (set-base-screen) call. The value returned is
a list, (left top right bottom).(reset-iterator) resets SegMan's built-in
pixel-group list iterator so pixel-group retrieval can
begin from the start of the list.(get-next) returns the next pointer to
a pixel-group in SegMan's pixel-group list. The iterator
is updated to the next pixel-group in the list, if there
is one. The return value is an integer address of the
pixel-group. Functions are provided to access member data
of the pixel-group.(is-next) returns t if there is a next
pixel-group in SegMan's pixel-group iterator.(get-group-at point) returns the pixel-group
located at a given point. The point is passed in as a
list of screen coordinates: (x y). This function returns
the pixel-group from the last screen segmentation. If
there has been no screen segmentation, an error will
occur.Pixel-group creation and deletion:
(make-pixel-group bounds) creates a new pixel-group
with a bounding box given as a list, (left top right
bottom). The pointer to the new pixel-group is returned.
SegMan will not garbage collect the pixel-group created
by this method, they are safe. The user must clear the
data himself before releasing the pointer or memory will
be leaked. Use (delete-pixel-group) to release the data.(make-pixel-group-by-bounds
left top right bottom)
creates a new safe pixel-group with the given bounds. See
(make-pixel-group).(delete-pixel-group group) deletes the memory
associated with a pixel-group. SegMan cleans up after the
pixel-groups created during segmentation but does not
clean up safe pixel-groups created. To release the memory
allocated by these two functions, (delete-pixel-group) must be used or memory will
be leaked. However, if this call is made on pixel-groups
created during segmentation, SegMan will seg-fault on the
next segmentation.Pixel-group helper functions:
(get-pixel-neighbor group
code) returns the number of pixels
in the given pixel-group that have the given
pixel-neighbor value.(get-pixel-count group) returns the number of pixels
in the pixel-group.(get-group-color group) returns an encoded value
representing the color of the pixels in the pixel-group.
The coding is (red + green + blue)/3 so that 255 is
always white, 0 is always black, and if c1 > c2 then
c1 is more luminous (non-perceptually) than c2.(get-rgb-color group) returns an encoded value
representing the color of the pixels in the pixel-group.
The coding is such that (c & 0xFF) is blue, (c &
0xFF00) is green, and (c & 0xFF0000) is red. See the
helpers, (red), (green), and (blue) in this file.(show group) paints the pixels in the
pixel-group on the screen. If the pixel-group was created
during segmentation, the pixels will be painted red. If
the pixel-group was created with (make-pixel-group), the area within the
pixel-group's bounding box will be painted blue.(get-left group) returns the left-most bound
of the pixel-group or returns the left-most bound of a
list of pixel-groups.(get-top group) returns the top-most bound
of the pixel-group or returns the top-most bound of a
list of pixel-groups.(get-right group) returns the right-most bound
of the pixel-group or returns the right-most bound of a
list of pixel-groups.(get-bottom group) returns the bottom-most
bound of the pixel-group or returns the bottom-most bound
of a list of pixel-groups.(get-width group) returns the width of the
bounding box around the pixel-group.(get-height group) returns the height of the
bounding box around the pixel-group.(get-distance-to group
target) returns the distance from
the pixel-group, group, to the pixel-group, target.(get-right-distance-to group
target) returns the distance from
the right bound of group to the right bound of target.
Both parameters must be pixel-groups.(get-left-distance-to group
target) returns the distance from
the left bound of the group to the left bound of the
target. Both parameters must be pixel-groups.(get-distance-between group
target) returns the horizontal
distance between the two pixel-groups.(get-horizontal-distance-between
group target) returns the horizontal
distance between the two pixel-groups.(get-vertical-distance-between
group target) returns the vertical
distance between the two pixel-groups.(contains-p outer inner) returns t if the bounding
box of outer completely encloses the bounding box of
inner.(to-the-right-p group target) returns t if target's left
bound is greater than group's right bound. The two
pixel-groups upper and lower bounds must overlap.(to-the-left-p group target) returns t if target's right
bound is less than group's left bound. The two
pixel-groups upper and lower bounds must overlap.(above-p group target) returns t if target's lower
bound is less than group's upper bound. The two
pixel-groups left and right bounds must overlap.(below-p group target) returns t if target's upper
bound is greater than group's lower bound. The two
pixel-groups left and right bounds must overlap.(get-center group) returns the centerpoint of
the bounding box around the pixel-group. The return value
is a list, (x y).(get-area group) returns the area, in pixels
of the bounding box around the pixel-group.(get-bounds group) returns the bounding box of
the pixel-group in list form: (left top right bottom).(get-bounds-values group) returns the bounding box of
the pixel-group as four return values: left, top, right,
and bottom.Screen manipulation functions:
(double-click) causes the mouse to
double-click at it's current position.(single-click) causes the mouse to
single-click at it's current position.(move-to object) causes the mouse cursor to
be moved to the specified location. If object is a list
in the form (x y), the mouse moves to those
screen-coordinates. If object is a pixel-group pointer,
the mouse moves to the center-point of that pixel-group's
bounding box.(move-mouse-to point) moves the mouse to the
specified screen coordinates. Point should be a list (x
y).(mouse-down) causes the mouse button to
be depressed but not released. The mouse-button remains
depressed until (mouse-up) is called. Use this function
for mouse drag operations.(mouse-up) causes the mouse button to
be released if it was depressed.(jiggle-mouse &optional
(times 10) (severity 5))
causes the mouse to jiggle-randomly. The times parameter
indicates how many jigs the cursor should make. Severity
indicates the maximum number of pixels the cursor is
allowed to move. The cursor will be returned to its
original spot after jiggling is complete to prevent
driftage.(press-key key) causes a key to be pressed
and released. The parameter must be a string containing a
single character or one of the following multi-character
strings: "RETURN", "SHIFT",
"CONTROL", "MENU",
"WINDOWS", "INSERT",
"DELETE", "BACKSPACE",
"LEFT", "UP", "RIGHT",
"DOWN", "HOME", "END", or
"ESCAPE".(key-down key) causes a key to be pressed
but not released. The key is not released until (key-up key) is called. See (press-key) for valid parameter values.
Use this function for key combination sequences.(key-up key) causes a key to be released
if it was already depressed. See (press-key) for valid parameter values.(type-string string) causes the given string to
be typed by pressing the key corresponding to each
character in the string.RGB color helper functions:
(red integer-value) returns the red component of
the color encoding returned by (get-rgb-color) by performing the
appropriate bit-operations.(green integer-value) returns the green component
of the color encoding returned by (get-rgb-color) by performing the
appropriate bit-operations.(blue integer-value) returns the blue component
of the color encoding returned by (get-rgb-color) by performing the approprate
bit-operations.The wrapper for (c_segment_screen) is in segmentation.lisp. It combines all the
pixel-group iterator routines plus extra processing to identify
and record pixel-groups.
This file contains the functions used to initiate segmentation
of the screen, collect pixel-groups into lisp data structures,
and to begin classification of pixel-groups into single-group
features. The function (segment-screen) causes SegMan to capture the screen and break it down
into its constituent pixel-groups. Each pixel-group is collected
and a series predicates are used to identify each pixel-group.
Pixel-groups are all unknown when they are retrieved from the
built-in iterator. Pixel-groups are categorized and inserted into
an association list according to the predicates that recognize
them. For example, all pixel-groups retrieved from the screen
that cause the predicate, (rectangle-p), to return true are collected into the association
list under the key, :rectangles. If a pixel-group is not recognized by any predicate,
it is categorized under the key, :unknown. The following is an example of the association list
returned by (segment-screen):
((#\5 (93580260)) (#\8 (102169832)) (#\y (93302820 102005680 102469236)) (#\x (93750192 102884240)) (:down-triangles (93301664 102614892 97748952 98491104)) (:check-marks (97966280 98642540)) (:rectangles (122619108 103739436 93224212 93227680 93234616 93245020 93246176 ...)) ... )
In the example, there a one character g, one character 8,
three character y's, two character x's, four downward pointing
triangles, two check marks, and a multitude of rectangle shapes.
The size of the association list can be quite large since there
are a lot of pixel-groups on any given screen and a lot of
predicates. The association list returned by (segment-screen) effectively represents the state of
the screen at the time it was captured. All pixel-groups are
enumerated at least once in the data structure. Furthermore, this
is the first pass at recognizing features on the screen. Letters,
check marks, and other features that are represented by a single
contiguous set of like-colored pixels can be found in the
association list.
Predicates used for recognition of pixel-groups are listed in
a special global variable called *segmentation-predicates*. This variable lists all segmentation predicates and
the keys that matching pixel-groups should be categorized under.
Without going into details about each predicate, most predicates
are built on the principal of using pixel-neighbor numbers to
detect salient features of the pixel-group. The (segment-screen) function iteratively applies each
predicate to every pixel-group. The segmentation process can thus
be computationally expensive. It should be noted that a
pixel-group may be recognized by more than one predicate and show
up under more than one key entry.
The (segment-screen) call, by default will capture and segment the entire
screen. However, the optional parameter bounds, given as a list
in the form (left top right bottom), can be used to constrain the
screen capture area.
This file contains higher-level recognition routines for finding screen features that are made up of more than one pixel-group, for example buttons, windows, and strings of text. Multi-group features are detected by selecting single-group features out of the screen-state association list and by making comparisons between the candidate pixel-groups. For example, a button is found by finding all rectanges that have shading above and below.
Most functions take the screen-state as a parameter because it
must search for pixel-groups that are the right shape and in the
right relationship with other pixel-groups. The screen-state
refered to here is the association list returned by (segment-screen).
Many of the functions described in this section have been changed; updates to the documentation are in progress.
Debugging functions:
(show-all group-list) recursively calls (show) on each pixel-group in a
list.(all-groups group-list) collects all pixel-groups in
the group-list into a single, flat list.Pixel-group search functions:
(find-group-containing group
group-list) returns the pixel-group
found in group-list that encloses the given group. If
more than one pixel-group encloses the group, the one
that most tightly encloses it is returned.(find-group-to-the-right
group group-list) returns the pixel-group
found in group-list that is most closely to the right of
the given group.(find-group-to-the-left group
group-list) returns the pixel-group
found in group-list that is most closely to the left of
the given group.(find-group-above group
group-list) returns the pixel-group
found in group-list that is most closely above the given
group.(find-group-below group
group-list) returns the pixel-group
found in group-list that is most closely below the given
group.Widget detection functions:
(raised-p group screen-state) returns t if the given group
meets the criteria for a raised object. A raised object
has shading directly above and directly below and the
color of the upper-shading is lighter than the color of
the lower-shading.(lowered-p group
screen-state) returns t if the given group
meets the criteria for a lowered object. A lowered object
has shading directly above and directly below and the
color of the upper-shading is darker than the color of
the lower-shading.(find-raised screen-state) calls (raised-p) on every rectangle in the
screen-state and returns a list of those rectangles that
meet the criteria.(find-lowered screen-state) calls (lowered-p) on every rectangle in the
screen-state and returns a list of those rectangle sthat
meet the criteria.(find-buttons screen-state) returns a list of
pixel-groups identified as buttons. A button is a raised
pixel-group that does not contain any raised
pixel-groups.(find-windows screen-state) returns a list of
pixel-groups identified as windows. A window is a raised
pixel-group that has a button in the upper right-hand
corner.(find-text-areas
screen-state) returns a list of
pixel-groups identified as text areas. A Text area ia a
white, lowered pixel-group.(find-check-boxes
screen-state) returns a list of
pixel-groups identified as check boxes. A check box is a
pixel-group that is lowered, white, and square. It may or
may not have a check mark in it.(find-vertical-scroll-bars
screen-state) returns a list of
pixel-groups identified as vertical scroll bars. A
vertical scroll bar is a made up of a rectangle
containing an upward facing triangle, a rectangle
containing a downward facing triangle, and the space
between. This function returns safe pixel-groups.(find-horizontal-scroll-bars
screen-state) returns a list of
pixel-groups identified as horizontal scroll bars. A
horizontal scroll bar is made up of a rectangle
containing a left facing triangle, a rectangle containing
a right facing triangle, and the space between. This
function returns safe pixel-groups.(find-radio-buttons
screen-state) returns a list of
pixel-groups identified as radio buttons. Radio buttons
are single pixel-group features under the screen-state
key, :circle-areas.(find-drop-boxes
screen-state) returns a list of
pixel-groups identified as drop boxes. A drop box is a
pixel group that is lowered, white, is longer than it is
tall, and has a downward facing triangle to the right.
This function only returns the lowered component.Word detection functions:
(find-string string
screen-state) returns a list of
pixel-groups. Each pixel-group represents an instance of
the specified string on the screen. This function returns
safe pixel-groups.(find-strings screen-state) returns a list of strings
that are are words found on the screen.Adobe Illustrator widget detection functions:
(find-illustrator-color-reverse-widgets
screen-state) returns a list of
pixel-groups identified as color reverse widgets. The
color reverse widget swaps the foreground and background
colors in Adobe Illustrator.(find-illustrator-default-palette-widgets
screen-state) returns a list of
pixel-groups identified as default palette widgets. The
default paletee widget resets the Adobe Illustrator
palette to white foreground with black background.The Adobe Illustrator canvas's border can be found using (second (assoc :illustrator-canvases screen-state)). The border around objects that are selected in the
Adobe Illustrator canvas can be found using (second (assoc :illustrator-selections screen-state)).
Functions for detecting change:
(wait-for-object find-func
&optional initial-screen &key (interval 1)
(segment-func #'segment-screen) (test #'groups-equal-p))returns two values: a list of
object of the specified type that have appeared since the
screen was last segmented and a bounding region around
all the areas of the screen that have changed. The
find-func parameter is any function that will return a
list of pixel-groups. The find-func should take a single
argument: the screen representation (association list).
Examples of valid find-funcs are (find-buttons) and (find-windows). The initial-screen is the
screen-state of the last screen segmentation that has
occured. The function continuously re-segments the screen
until it finds new objects that match the given type.
This function will loop infinitely if no change is
detected. The interval specifies how many seconds the
function should sleep between segmentations. This
function does not return safe pixel-groups. Additionally,
(wait-for-object) makes repetative calls to segment-func so all unsafe pixel-groups
before the call is made will be dangling pointers after
the call is complete (except the pixel-groups returned by
the function itself). By default, segment-func, is the function (segment-screen). However, any function can
be substituted that takes a bounds list, (left top right
bottom), as an argument and returns an association-list
of pixel-groups.Menu navigation functions:
(simple-start name
screen-state) causes SegMan to move the
mouse cursor and click on the Microsoft Windows Start
Button and search the Start Menu for the application with
the specified name. This function does not search
sub-menus. The function causes SegMan to move the mouse
cursor and click on the application name. This function
assumes the Start Button is visible on the screen and
that the application name is listed in the Start Menu.
This function will cause all unsafe pixel-groups to
become dangling pointers.(complex-start name
screen-state) performs the same task as (simple-start) but this function also
searches all sub-menus in a depth-first order. This
function returns the screen-state just before the
application name is clicked on. This is done so that any
functions that might call (complex-start) will be able to detect
changes. All unsafe pixel-groups, except those returned
by the function itself, will become dangling pointers. (new-state) causes the current state to be updated,
by calling functions to resegment the screen and process its contents.
This function has an alternative name,
(update-current-state), with equivalent functionality.
(find-all-in-state specification) This function returns
a set of objects that match to the specification. The type of the
specification determines the objects that the function returns. There
are several possibilities:
symbol specification causes different types of
objects to be returned. Any one of the symbols string, :string,
:word, or :words causes all words to be returned. A symbol associated
with a pattern (e.g., :rectangles, :left-triangles, or letter-a
through letter-z) causes the groups for that pattern to be returned.
A symbol associated with object types (specifically, window, windows,
menu, menus, button, buttons horizontal-scroll-bar,
horizontal-scroll-bars, vertical-scroll-bar, vertical-scroll-bars,
drop-box drop-boxes, warning-dialog warning-dialogs, radio-button
radio-buttons, check-box check-boxes, text-area, or text-areas) causes
all objects of those types to be returned. (In case of possible
"collisions" between symbols, group symbols are tested first, then
object symbols.)
character specification causes characters to be
returned. When a specific character is given (e.g., #\D), objects
representing all instances of the character on the screen are
returned, without regard for case.
string specification initiates a search for
strings (which are constructed of characters. When a specific string
is given (e.g., "File"), objects representing all instances of the
string on the screen are returned. As with characters, the test of
strings is not case-sensitive.
list specification causes a recursive call to
find-all-in-state, in which results are appended and returned.
(find-in-state specification) This function returns a
single object, but otherwise behaves identically to
find-all-in-state.
(wait-for-state-change specification) This function
polls the interface until a new object of the given specification
appears. Here, unfortunately, specification does not have the
generality above; it can only be a string or symbol associated with an
object (i.e., it cannot be a character, list, or pattern symbol.)
Ideally, this is the level at which application-specific functions should be built. For example, the logic of the function select-from-menu, if we strip out most of the details, looks something like the form below. (The "details" we've stripped out of this description include directives concerning when a screen segmentation can be dispensed with, the types of objects that should be returned, and so forth.)
(defun select-from-menu (remaining-items)
(wait-for-state-change 'menu)
(let* ((menu (find-in-state 'menu))
(string (find-in-state (first remaining-items) (bounds menu))))
(move-to string)
(single-click)
(when remaining-items
(select-from-menu (rest remaining-items)))))
This function takes a list of strings, and starts by waiting for a menu to appear, assuming that a click has been made on a menu header (e.g., "File", "Edit", etc.) Once the menu appears, the first item in the list is searched for and clicked on. If there are remaining strings, this means that cascading menus are expected, and the function recurses. The function can then be combined with others to provide more sophisticated behavior.