This article is from the archives of the UB Reporter.
News

Social media provide �digital gold�
for web researchers

  • “It’s amazing how much you reveal just through your language.”

    Rohini Srihari
    Associate Professor of Computer Science and Engineering
By CHARLOTTE HSU
Published: July 22, 2010

Through state-of-the-art data mining techniques, researchers are transforming the mess of content exchanged through social media into useful information, Rohini Srihari, associate professor of computer science and engineering, told an audience of about 100 people at her UBThisSummer lecture July 21.

The material that users of Facebook, Twitter, YouTube, blogs, chat rooms, discussion boards and other online forums post each day are like digital gold: The information locked away in those communications can help the government unravel terrorist networks or enable businesses to gauge their reputations and increase customer satisfaction.

“If you are a company, let’s say Rolex or BMW or, heaven forbid, BP,” Srihari said, “what are people saying about me? And when they talk about me, is it positive or negative? And when they talk about me, what kind of adjectives are they using to talk about me?”

These are the questions the nebula of information on the Web can answer. JetBlue Airways, which receives more than 400 pieces of feedback from patrons each day, uses data mining to track performance—to find out which aspects of the company’s operations people like, and which aspects might need improvement. The next step will be to mine social media sites to learn what people are saying about JetBlue.

Google has put its data trove to work, too. By studying the locations of users entering flu-related search terms on its online query system, the Internet giant was able to identify localized outbreaks in early stages.

With the amount of information available on the Web, computers, which can read about 100,000 words per minute, are key to mining data because they can sift through the material so much more quickly than humans can. The challenge for scientists like Srihari is to devise computational techniques that enable machines to “understand” the meaning of an excerpt of text and extract relevant information.

A system that makes too many mistakes—one that misinterprets syntax—is no good. But getting it right isn’t always easy. Srihari pointed out that, on the Internet, people tend to write casually, spelling words incorrectly, employing abbreviations (“IMHO,” for example, in place of, “In my humble opinion”), and otherwise playing with language (typing “happy birthday” with several “As” and 10 exclamation points, for instance). Some people switch between languages, creating further complications.

And, of course, computers don’t think like humans, so the algorithms engineers design must enable machines to recognize parts of speech, identify the subjects and objects of written excerpts, and determine the writer’s attitude toward a subject. One technique to classify sentiments involves establishing a set of words that are generally positive and a set that are generally negative, and analyzing how often words in a written excerpt appear in a Google search alongside words in the positive and negative sets.

A creative solution for the “IMHO” and "happy birthdaaaaay!!!!!!!!!!" conundrum is to post problem phrases to Amazon Mechanical Turk, an online marketplace through which employers can pay people to small sums of money—5 cents, for instance—to complete simple tasks—spelling out “IMHO,” for instance—that require human intelligence. Later, a machine learning system can use the data entered into Mechanical Turk to automatically correct noisy text.

Though researchers have increased the accuracy of data mining in recent years, the field is still growing, with plenty of room for improvement, Srihari said. And as scientists expand their capabilities, the demand for their services is growing.

Governmental agencies involved in everything from health to security are likely to begin investing more in mining social media for information, Srihari said. Politicians and other decision-makers could comb sites like Facebook and Twitter to find out what members of the public are saying about an issue. Small companies that can’t afford to spend lavishly on market research might turn to data mining to figure out what customers want.

”It’s amazing,” Srihari said, “how much you reveal just through your language.”

The UBThisSummer lecture series ends July 28 with a talk by SUNY Distinguished Professor Esther Takeuchi, winner of the National Medal of Technology and Innovation. Her talk will be titled, “Toward the Bionic Human: Medical Devices and How They are Powered.” For more information and an abstract, click here.