Show & Tell Image-Understanding Software "Looks," "Listens," Stores And Retrieves

Release Date: August 25, 1995 This content is archived.

Print

CHANTILLY, Va. -- A new image-understanding software system that catalogues, annotates, stores and retrieves images based on its "world-knowledge" and verbal instructions from a human user has been developed by researchers at the University at Buffalo.

Called Show & Tell, it is the first computer system to combine image processing, natural-language processing, speech recognition and spatial reasoning for the purpose of storing and retrieving images from a database, according to Rohini Srihari, Ph.D., research scientist with UB's Center of Excellence for Document Analysis and Recognition (CEDAR) at UB and research assistant professor of computer science.

She said potential applications for Show & Tell include any situation where pictures in large databases must be retrieved, based on the computer's understanding of what is in the picture. The system is designed to handle the indexing of surveillance photographs. Potential uses include the indexing of medical X-rays.

A prototype of the system will be demonstrated at the Advanced Research Projects Agency/Software and Intelligent Systems Technology Symposium (ARPA/SISTO) from 6-9 p.m. Aug. 28 and all day Aug. 29 in the Westfields International Conference Center, 14750 Conference Center Drive, Chantilly, Va.

The system was developed with funding from ARPA to expedite image analysis at government agencies that index hundreds of surveillance photos taken by satellites for intelligence purposes.

"The idea behind Show & Tell is to develop a tool that makes the image analyst's life easier by automatically annotating images while allowing the analyst to remain oblivious to all the techniques that have been utilized in the system," Srihari explained.

She notes, for example, that a physician would be able to talk to the system about an X-ray he is examining. While information provided by him verbally would be stored through the speech recognition component, the computer-vision component would store the visual data that corresponds to the physician's verbal input.

"If the data must be sent to another physician, that doctor will have immediate point and click access to what his or her colleague has observed," she added.

The UB researchers' goals in developing the system were to exploit linguistic context in computer vision -- helping the computer to figure out what it is "seeing" based on textual clues provided by the human user.

The result is an extremely user-friendly system that overcomes many drawbacks of other image-analysis systems.

In order to index images, Srihari explained, computer systems must be given information about each image that is loaded into them. Currently, most image-retrieval systems depend to a great extent on the image analyst providing the system with very precise, three-dimensional information about individual images, down to the measurements of buildings and even the camera angle from which a picture was taken.

"The Show & Tell system, on the other hand, detects buildings and other structures and features in an image based on purely qualitative information that is provided by the analyst speaking to the system," said Srihari.

The menu-driven system also has had significant "world-knowledge" built into it. For example, it knows that a gymnasium is an athletic facility. If a user asks Show & Tell to "find pictures containing athletic facilities," it will highlight a building that before may have only been identified as a gymnasium.

In this way, the system uses linguistic context to aid the computer's vision.

One of the first processes the system performs is to automatically locate roads in a given picture.

"The task is to label salient things in an image," said Srihari. "We start each image by clicking on the 'Find Roads' menu because roads are important, they partition each image."

After the roads in an on-screen image are highlighted with a yellow beam, the analyst verbally identifies specific buildings based on their shape and location, relative to the roads that were already identified.

The system uses a commercial speech-recognition system that the UB researchers customized for Show & Tell.

As the user speaks, the on-screen text box automatically prints each sentence, while it is also storing the data. This information may be updated or modified by subsequent users who retrieve the same image at later times.

During future queries, the computer will automatically provide answers based on this initial input.

"We can say to the system, 'Find the headquarters building in this image,' and the building is highlighted," said Srihari.

"Right now, there is no nearly automatic way to correlate text with an image," she explained. "Our research has focused on developing a computational model for 'understanding' pictures based on accompanying, descriptive text."

The research is an outgrowth of another system developed by Srihari and UB colleagues. Called PICTION, it allows newspaper photographs stored in computer databases to be retrieved based on the system's understanding of caption information.

Funding from ARPA for both projects has totaled nearly $1 million dollars.

Also working on the development of Show & Tell were Zhongfei Zhang and Mahesh Venkatraman, CEDAR research scientists; Rajiv Chopra, Debra Burhans, Charlotte Baltus and Radha Anand, UB computer-science graduate students; Eugene Koontz, linguistics graduate student, and Gongwei Wang, electrical-and-computer-engineering graduate student.

Media Contact Information

Ellen Goldbaum
News Content Manager
Medicine
Tel: 716-645-4605
goldbaum@buffalo.edu