Luca Giachino - August 1993


CEFRIEL - Via Emanueli, 15 - 20126
Milano - Italy

Work performed while visiting the Telepresence Project at the University of Toronto


Portholes is a system that provides support for general awareness in a distributed work group. It gathers still images from different environments (workspaces and public areas) and distributes them to subscribers of the service, without performing any semantic interpretation. It is up to the subscribers to look at the images and to give them a meaningful semantic.

Given this architecture, it is interesting to investigate the automatic extraction of information from the environment through the images collected by Portholes. This information can be used for issuing notifications of events concerning persons' presence and availability. In this way the Portholes architecture becomes a mean for performing activity sensing on the environment, thereby providing a bridge between passive awareness and active awareness.

In order to provide active awareness, in general the kind of activity we are interested to monitor is human activity, so that we can issue notifications concerning relevant people's actions, like leaving their workspace, logging out from the iiif server, changing the state of their offices' door. Among these possible actions, the person's presence in front of his or her camera can be detected by properly comparing sequential images grabbed from that camera. Other actions can be more easily detected inquiring the iiif server.

In our media space one of the reasons for using the Portholes images for sensing human activity, instead of other environment sensors, is that the cameras, the wiring and the iiif server are already there. Another reason for using images provided by cameras is that we can expect in the next years an increasing production of personal computers equipped with cameras, together with speakers and microphones.

In order to detect the presence of human activity using the Portholes images, I wrote a software module, called CmpFrames, that given two consequential frames of the same scene guesses if there is or there is not human activity. CmpFrames has been designed to be very quick in providing a guess, but it provides only a 'guess'. For the sake of execution speed, it assumes that the scene has some specific properties, like a stable background, for instance. If these properties are not met, then the guess can be wrong, and the sensed activity can be of other nature (a screen saver, for instance).

CmpFrames is the fundamental brick upon which I created a personal notification tool, called Monitor, that provides notifications of people's presence and availability. It has been mainly written for testing the CmpFrames module, but it turned out to be an interesting personal notification tool.

At the time of this writing, the CmpFrames module has not yet been integrated in Portholes. The images are gathered by asking the iiif server to connect people's cameras to a frame-grabber. Users privacy is guaranteed by the access control performed by the iiif server.

The following two sections describe the concept of active awareness and the CmpFrames algorithm. If you are mainly interested in testing and using the Monitor program you can skip these two sections and read the next one, The Monitor Program session, that explains the purposes of the program and how to use it.

A Bridge from Passive Awareness to Active Awareness

Before describing in details the CmpFrames software module and the Monitor program, it is useful to briefly describe what I intend for a bridge between passive awareness and active awareness. The terminology distinction between passive and active awareness is somehow new, and needs to be more investigated than I did so far. Provided that the terminology in the computer supported awareness field is not yet there, we are in the position of enhancing and conceiving terminology, with all the possible debates that can follow.

In terms of communication involving human beings and computers, Portholes can be considered as a system that allows HH communication in the background. the following figure shows a conceptual frame (borrowed from Professor Bill Buxton) for describing the possible kinds of communications we find in systems that involve both human beings and computers (eventually connected by networks). In this frame the Portholes system falls in the HH background communication category, since it allows human to human communication performed without explicit users' actions.

Figure 1: The Communication Arena

In addition to investigate the single areas, it is also interesting to explore the possible migrations from one area to another. The CmpFrames software module provides the path for one possible migration, which is shown in the next figure. There we see that by means of the CmpFrames module, Portholes detects on behalf of a client that a user is now available (HC communication in the background), alerts the client through the GUI asking if he or she wants to call a video meeting (HC communication in the foreground), and finally a video meeting is called (HH communication in the foreground).

Figure 2: A Possible Path from Background to Foreground HH Communication

According to this scheme, whilst the Portholes system provides passive awareness, that is awareness gained in the background, the CmpFrames software module and the Monitor program constitute the means for achieving active awareness, that is awareness gained in the foreground. But the question is: once we enhance Portholes for providing active awareness, how will we consider it? A more general awareness support tool or a mixing of somehow unrelated facilities? Moreover, we could experiment people reluctance in having their images periodically distributed among other users (Portholes), but not as much as for active awareness purposes that do not imply the distribution and public accessibility of images (CmpFrames and Monitor).

The CmpFrames Software Module

In order to describe the CmpFrames software module and the design issues that drove its realization we have to describe the scene we consider, which parts of the scene are of interest for the detection of human activity, and how we can gather images of the scene. Moreover, we have to outline the constraints we have to take into account, mainly imposed by the hardware, by the grabbing rate, and finally by the requirement to have a quick guess at the price of some uncertainty.

The Scene

The scene we consider for detecting human activity is rather complex. We want to monitor people's workspace, and so we restrict the spectrum of the possible scenarios to those you can find in a typical office environment. At the Telepresence Project laboratories we have both closed offices (Toronto) and open offices (Ottawa). Offices can have real or artificial windows with moving scenarios behind (a street, a cloudy and windy sky, a soccer court). If they have real windows, the light inside the office can strongly depend on the weather conditions. In the offices there can be computer monitors with screen savers that can modify the displays even when nobody is there. Moreover, in open offices it is likely that people besides the person we want to monitor move in the environment.

The View of the Scene

Besides the peculiarities of the scene we consider, also the position and orientation of the camera we use and the quality of the hardware involved play a crucial role. In an office, the camera from which the images are frame-grabbed can be in different positions. In our media space, for example, some people have one only camera located in front of them very close to their work desk and oriented towards the person. Some others have additional cameras for shooting the whole office space, or for shooting from particular perspectives (like a camera located on the office's door). All of these cameras provides different views of the environment, and this augments the complexity of the scene. For our purposes, the farther the camera is, the better is since we can monitor a larger space. But the drawback is that the chance of having moving objects in the scene, in addition to the person, increases.

The cameras we use provide colored video sequences, but for our purposes we consider only gray-levels images (256 gray levels). This choice is partially due to the fact that Portholes handles gray-level images in order to reduce the network overhead. These images are pretty small (240x160), but convey enough information for the human activity detection.

The rate at which the images are provided can range between one every 30 seconds to one every 5 minutes. Portholes generally provides images at the rate of one every 5 minutes. The rate has important implications on the design of the algorithm for the activity detection, as it is described later.

The Relevant Information

Our purpose is to detect human activity in a very quick way. We don't need to recognize the person (a very time consuming process) since we can assume that in an office the most part of the activity is performed by the person who actually works in that office (compared to colleagues and cleaning staff). We do not even need to extract geometrical descriptions of the image. Any recognition process is generally a complex task, and this is way we decided to avoid any morphological interpretation of the images' contents. Moreover, a human shape recognition system, for example, might require even more constraints on the scene, like to have the person perfectly oriented towards the camera.

The CmpFrames Algorithm

According to the described scene and constraints, I opted for an algorithm that given two consequential images counts the percentage of pixels that can be considered different (according to an error threshold), and then compares this percentage with a quiet threshold. If it is lower than the quite threshold, then the algorithm assumes that there is not activity, otherwise that there is activity. As long as the images are very similar one to the next we assume that there is not relevant activity. The activity condition is the complementary one, no matter how different the images can be. In other words, we actually detect quietness as apposite to human activity. This approach is pretty interesting, since it allows us to detect activity also when the person rotates the camera, since the images from one state of the camera to the next one will be even more different, and this is enough to assume that there is activity. Even if the camera is now oriented towards the wall, at least for one more image we can still detect activity, and this guess is correct since the person had to be there for rotating the camera. The drawback is that we need to have two frames in order to make a decision, and this fact leads to a delay in recognizing when the person leaves his or her workspace (this point will be better explained describing the Monitor program). The fact that we need two consequential images means also that their histograms must not be altered (for instance, through a normalization process), otherwise the process would easily detect changes that are actually produced by the image processing calculations on the histograms.

In order to have this approach effective, we assume that when a person is not in the view of the camera, the sequential images are very similar one to the next, that is, the background does not change. This is obviously a strong limitation, since in the view of the camera, as already mentioned, it is likely to have moving objects other than the person we want to monitor. Moreover, even in the presence of a very stable background, the hardware limitations are likely to produce consequential images that are not equal pixel per pixel.

The Hardware Limitations

The hardware between the scene and the grabbed image is composed by a camera, a wire with possible intermediate analog devices, and a frame-grabber. All these components produce noise on the analog video signal. In our media space, for instance, many cameras are located pretty far from the frame-grabber. In within there are meters of wire and sometimes devices for conveying different video signals on the same wire.

The hardware limitations produce two different effects on the images. The first one is the presence of spikes, that is isolated pixels that have a very different gray-level from their neighborhood. The second one is that even with the same scene pixels in the same positions can have different levels of grays in two consequential images.

The algorithm we propose filters the images using a convolution mask that actually smoothes the discrepancies. This process reduces the incidence of the spikes, and cleans the image producing more uniformity between images of the same scene.

After this smoothing process is completed, the algorithm begins to compare the images. During the comparison of pixels the algorithm uses a proximity level, or error threshold, for deciding if two pixels are to be considered equal or not. This error threshold helps to reduce the incidence of the hardware limitations.

The error threshold is a parameter of the algorithm that can be tuned in order to accommodate the hardware limitations. If the hardware is of good quality, then the error threshold can be lowered. The algorithm requires an error threshold expressed in percentage. It then calculates the absolute error threshold by: 1) extracting the maximum gray level from each of the two images, 2) choosing for the lower one, 3) applying the percentage to this level of gray. In this way we use the worst conditions for calculating the absolute error threshold, and we recalculate it for every pare of images.

The Scene Constraints

The quiet threshold is used to decide if the percentage of changed pixels is high enough to detect the presence of activity. By choosing a proper level, we can discriminate between small and consistent changes in the scene. In this regard, the grabbing rate becomes very important. If the camera video signal is sampled every 5 minutes, then it is pretty likely that if a person is in front of the camera in two sequential images he or she will have moved consistently (aside sleeping conditions), even if deeply concentrated on something. In this case, the quiet threshold can be increased enough to let the algorithm consider small changes as absence of activity (a screen saver or a truck moving on the street). On the other hand, if the rate is of one frame every 30 seconds, then it can happen that in such a small amount of time the person does not move a lot, perhaps because is typing on the keyboard. In this case the threshold must be lowered so that even very small movements of the person can be enough to detect the presence of activity. But the drawback is that also small changes in the background could trigger such a condition, even if the person is not there.

The distance of the action from the camera is also important. Since we use a percentage of changed pixels for detecting action, as the action is smaller in the images (because the people is far), the percentage of changed pixels is lower. In this case the quiet threshold should be lowered accordingly.

Finally, the brightness of the scene effects the amount of noise introduced by the camera. As the scene becomes darker and darker, the noise introduced by the camera increases, and a higher error threshold might be required.

As appears from this description, the thresholds are very important in order to have the algorithm working properly under different conditions. Even if the best solution would be to have adaptive thresholds, good performances can already be obtained by using personal profiles in which the best empirical thresholds for every person we want to monitor are stored. We will better describe this idea in the Future Work session.

How to Fine Tune the Thresholds

Our experience with the hardware we have and the kind of scenes we have showed that an error threshold of 3% is enough for masking the hardware limitations, and that consequently a good quiet threshold is 8%. Note that the effectiveness of the quiet threshold depends on the error threshold. If this one is changed, then the quiet threshold must be retuned.

A good procedure for setting the thresholds is the following. The first threshold that is to be set is the error one. In order to accommodate the hardware limitations, you grab several images from a very stable scene, and tune the error level so that the amount of changed pixels that is detected by the CmpFrames algorithm is lower than 2%. Than you grab images with a person in the scene who tries to move really little, and set the quiet threshold just under the average percentage of changes pixels that the algorithm detects. In the session that describes the Monitor program this procedure will be explained in more details.

The Complexity of the Algorithm

The proposed algorithm has a linear complexity. It scans the two images two times. During the first scan it smooth the image with a proper convolution mask and meanwhile computes the maximum gray level for each of the two images. Then during the second scan it computes the percentage of pixels that are different and compares it with the quiet threshold. It can be improved by storing the maximum gray level of the second image so that we don't need to recompute it at the next comparison. In this way one of the scanning can be saved.

Technical Details

The CmpFrames algorithm is actually implemented by an ANSI C function called CmpFrames() that is defined in the source file chkact.c of the distribution package. This source file depends only on the include file chkact.h, and can be linked with any program that needs to detect activity presence by comparing environment images. The CmpFrames() function assumes that the images are stored in memory as two dynamic bi-dimensional arrays of unsigned chars, one pixel per char. These kind of arrays are defined as a pointer to an array of pointers to unsigned chars (that is, typedef unsigned char **frame), and can be easily handled using the functions defined in the source file array.c. However, the chkact.c source is completely independent from the array.c source. In the current version of the CmpFrames() function the smooth process is not performed. It is assumed to be performed by the calling code. Finally, if requested the function can store in a third array an image in which the pixels that have been considered changed are set to black whilst all the other are set to white. This image can be useful for testing and fine-tuning purposes.

The Fcmp Tool

In order to provide a user level access to the CmpFrames() function, a program called Fcmp has been written. It uses the CmpFrames() function for comparing images stored on disk in the PGM format (see the public domain PbmPlus package). It mainly parses the command line for arguments, reads and stores the images in memory, and calls the CmpFrames() function. If requested (by option -d), the program stores in a file a PBM image in which the black pixels denote the pixels that have been considered different by the CmpFrames() function, whilst the white pixels denote areas in which no changes have been detected. Note that the program does not run the convolution mask on the images. This filtering operation can be easily accomplished at the user level by running the pnmsmooth program, contained in the PbmPlus package, on the images before calling the Fcmp program. The syntax is as follows.

Usage: fcmp frame1 frame2 [-e error_level][-q quiet_level][-d diffFile][-s][-v]

Exit status:

This program can be compiled with an ANSI C compiler, like the gcc compiler.

The Monitor Program

The Monitor program is a personal notification tool that exploits the CmpFrames software module in order to provide notifications of events concerning people's availability. The tool is intended to be easy to use, and does not require specific skills. It provides an alert (a sound and text event description) as soon as the monitored person arrives to or leaves his or her workplace. If X Windows is available, you can also ask Monitor to output in a window the grabbed frames, one after the other, or to display the two frames of the last comparison together with the frames with the changed-pixels (see the Fcmp program). These graphic outputs can be useful for testing the program: with a quick glance you can easily tell if the program has made a good guess. Moreover, the changed-pixels frame can be very effective for checking if the noise generated by the hardware is properly reduced.

Monitor can also work in a silent way, providing only return codes that can be used by other shell script and/or commands for further processing. This working modality can be used for building other personal tools on top of the Monitor program. In the package are included two examples of this kind of personal tools: Areyouthere detects if the person is there or not, whilst Whenarrives returns only when the person has returned or if activity is detected. These programs are very simple shell scripts that exploit the silent facility provided by Monitor. The main difference between using these programs and using directly the Monitor program is that Areyouthere and Whenarrives abort once they can provide the notifications, whilst the Monitor program can be run for continous monitoring purposes.

Originally, one of the main purposes of the Monitor program was to allow an easy testing of the Fcmp software module before integrating it in the Portholes system, but it turned out that it can be used as a useful standalone tool completely independent from the Portholes system.

Monitor relies on the iiif server for connecting users' cameras to a centralized frame grabber. By means of this interaction with the iiif server, the program is able to detect specific conditions related to the state of the person's Telepresence Stack. This possibility enhance the set of the possible notifications that the program can issue.

Event and Status Notifications

A notification is a message that is delivered to the user who has executed the program. Generally, a notification is issued when a related condition is recognized. Notifications are always time-stamped, but the time-stamp must not be considered the exact moment in which the condition arose, but rather the moment in which the program has been able to detect that condition. This is due to the fact that Monitor polls for conditions.

Monitor can provide two main kinds of notifications. The status notifications can be considered the recognition of a state (and not of a change of state). These notifications are issued when there is the lack of memory on a previous state. For instance, when the program is executed for monitoring John, and John is in his office, the program reports a status of activity detected. Clearly, this is not an event. As soon as a change of state in the scene is detected, an event notifications is issued: John has arrived, or John has closed his door, or John has logged out.

The following is a comprehensive list of all the categories of notifications that Monitor can issue.

EVENT) A change of state has been detected (person has arrived
or has left, has closed or opened his or her door,
has logged in or out from the iiif server)

STATUS) A status is reported only when the available information
is not enough to detect any change of state (for
instance, at the beginning of an execution the first two
frames are enough to say if there is or not activity,
but not for detecting a change of state since we do not
have background information yet)

DEBUG) Debug warnings issued only if option -d is indicated
on the command line

ERROR) An error condition occurred. The program aborts

How it Works

Monitor is a shell script that uses the Fcmp program for comparing the frames. Basically, monitor periodically ask the iiif server to connect the person's or node's camera to the frame-grabber, and uses the grab program for grabbing images (more later about the grab program).

Once a frame has been grabbed, Monitor cleans it using the PbmPlus tool pnmconvol. Then, if a previous frame is available, Monitor calls the Fcmp program described in the previous section. Fcmp performs the comparison between the two frames and detects if there is activity. Monitor uses this information and compares it with the previous result, if any, in order to guess what happened on the scene.

Because of the way the absence of activity is detected by the Fcmp program (that is, we have no activity when two sequential frames are very similar one to the next), it turns out that the Monitor program is late of one frame in detecting when the person has left (think about it). Anyway, this is not a big problem since the most used feature should be the notification of people's arrival.


The package contains the following files:

README A readme file
doc The directory containing documentation files, like
monitor.mcw A MacWord version of this report
monitor.rtf A RTF version of this report
monitor.txt An ASCI version of this report A Postscript version of this document
fcmp Compares two frames and detects activity
(to be manually compiled on a different architecture)
grab Grabs a frame from the local frame-grabber
(to be manually compiled on a different architecture)
monitor The Monitor tool
areyouthere The Areyouthere tool
whenarrives The Whenarrives tool
pnmconvol The PbmPlus program for applying convolution
filtering to frames
pnmcat The PbmPlus program for merging different frames in one
pnmscale The PbmPlus program for scaling frames
fcmp_src The directory containing the sources for the fcmp program
grab_src The directory containing the sources for the grab program

In order to install the tool, you need a SunOS 4.1.3 1 compatible workstation, at the binary level. Once you have untared the package, you might want to edit the header of the shell script 'monitor' and change the default configuration. If you already have an iiif client program, all you have to do is to compile the fcmp program, if the workstation is different from the one mentioned. The sources and the Makefile of the fcmp program are stored in the fcmp-src directory. If you do not have an iiif client program then you have to look for it. As regards the grab program for controlling the frame-grabber, the binary version (grab) is provided with the source code (grab.c) and Makefile.

To summarize: