September 7, 1995-Vol27n2: Show & Tell: software system looks, listens

Show & Tell: software system looks, listens

By ELLEN GOLDBAUM

News Bureau Staff

A NEW IMAGE-understanding software system that catalogues, annotates, stores and retrieves images based on its "world-knowledge" and verbal instructions from a human user has been developed by researchers at UB.

Called Show & Tell, it is the first computer system to combine image processing, natural-language processing, speech recognition and spatial reasoning to store and retrieve images from a database, according to Rohini Srihari, research scientist with UB's Center of Excellence for Document Analysis and Recognition (CEDAR) and research assistant professor of computer science. The system was developed with funding from ARPA to expedite image analysis at government agencies that index hundreds of surveillance photos taken by satellites for intelligence purposes.

Srihari said potential applications for Show & Tell include any situation where pictures in large databases must be retrieved, based on the computer's understanding of what is in the picture. Potential uses include the indexing of medical X-rays.

She notes, for example, that a physician would be able to talk to the system about an X-ray he is examining. While information provided by him verbally would be stored through the speech recognition component, the computer-vision component would store visual data that corresponds to the physician's verbal input.

"If the data must be sent to another physician, that doctor will have immediate point and click access to what his or her colleague has observed," she added.

The UB researchers' goals in developing the system were to exploit linguistic context in computer vision-helping the computer to figure out what it is "seeing" based on textual clues provided by the human user. The result is an extremely user-friendly system.

Currently, most image-retrieval systems need precise, three-dimensional information about individual images, down to measurements of buildings and even camera angles.

"The Show & Tell system, on the other hand, detects buildings and other structures and features in an image based on purely qualitative information that is provided by the analyst speaking to the system," said Srihari.

The system also has had significant "world-knowledge" built in. For example, it knows that a gymnasium is an athletic facility. If a user asks Show & Tell to "find pictures containing athletic facilities," it will highlight a building that before may have only been identified as a gymnasium.

The system uses a speech-recognition system that the UB researchers customized for Show & Tell. As the user speaks, the on-screen text box automatically prints each sentence, while it stores the data. This information may be updated or modified by subsequent users.

"We can say to the system, 'Find the headquarters building in this image,' and the building is highlighted," said Srihari.

Also working on Show & Tell were Zhongfei Zhang and Mahesh Venkatraman, CEDAR research scientists; Rajiv Chopra, Debra Burhans, Charlotte Baltus and Radha Anand, UB computer-science graduate students; Eugene Koontz, linguistics graduate student, and Gongwei Wang, electrical-and-computer-engineering graduate student.