'The crash:' CIT working to restore email service
By CHRISTINE VIDAL
News Services Editor
For the second time in less than a week, UB's central email server crashed this past weekend, and remained down for about two and a half days, leaving the majority of the university's faculty, staff and students unable to send or receive email.
And despite round-the-clock work to fix the problem and restore the system, its cause has left Computing and Information Technology (CIT) administrators and personnel with as many questions as answers.
There was no way to foresee the crisis, and once the problem has been solved, the university will need to examine how its email system is set up to ensure that this does not occur again, according to Hinrich Martens, associate vice president for computing and information technology.
"We have to live through this week and come to some stable operation again," Martens added.
"Then we need to do some serious thinking and evaluation. The server failure looms as a real worrisome problem because we have a major commitment next fall with Access '99 and don't want to be exposed to this risk again under any circumstances."
The system crashed Saturday afternoon and a temporary server was online by mid-afternoon Tuesday. The previous week, the system went down late Wednesday afternoon and was available again by Thursday evening.
Hardware for UB's central email server is supplied by Sun Microsystems, Inc., a leading provider of network computing systems, including workstations and servers, and UB's preferred vendor for more than 10 years, Martens said. The server's software is supplied by Veritas Software Corp., one of the largest independent suppliers of storage management software that has partnerships with Sun, Microsoft, Hewlett Packard and numerous other vendors.
While no other university that he knows of uses this hardware/software installation, Martens said, many companies around the world use it for file storage. "Given UB's performance requirements, there is nothing better out there," he said.
The problem is the quantity of email that UB generates, he noted. Each day, UB faculty, staff and students send and receive 160,000 to 200,000 pieces of email. In addition, the central email server stores more than 8 million files. And those numbers grow daily.
"To provide an email service of this magnitude, in terms of messages-per-day and total volume of files, puts us near the edge of technology," Martens said. "We have the capacity to process those as long as the system remains OK."
UB has had this particular system, hardware and architecture in place for a year, he added. The only difference between now and a year ago is "the volume of the load being processed and the number of files we are storing have increased beyond some sort of point where the system is getting wacky," he said.
Part of the problem may lie with the university's decision to centralize the majority of its email functions.
"We have made a choice to provide a centralized email server, and put ourselves at the leading edge of what's available. Basically, we're using software that is coming out of the development laboratories with the ink drying. (It's) been tested, but we're into the first delivery of the software, and exposed to the problems," Martens said.
So exactly what happened?
Although the problems now seem to have begun a few hours earlier, at about 8 p.m. last Wednesday, Feb. 3, CIT staff noticed that the central email system was not performing normally, Martens said. The problem was traced to a faulty disk controller, which did not give a clear trouble signal. In addition, the overall UNIX system failed to alert operators or managers that there was a problem.
Overnight, attempts were made to restore the files, but even at that point, CIT operators did not know that the problem had been caused by a hardware failure. That problem was discovered Thursday morning, Martens said. By early Thursday evening, CIT had replaced the faulty hardware and put UB's central email system back into operation after the integrity of the file system was checked and verified. But there continue to be unexplained activities from that system crash, Martens said. "The system should have given a signal to operators and UNIX system administrators that there was a problem and provide a clear diagnostic. That is one of the worrisome aspects of this whole episode," he said.
Operations seemed normal on Friday, "and continued to be such until probably noon on Saturday. We were actually quite satisfied that the problem was traceable to a hardware failure," Martens said.
Since operations seemed stable, on Saturday CIT decided to proceed with a previously planned hardware upgrade. The system was backed up, then powered down. Everything appeared normal, Martens said, and CIT proceeded with the upgrade, which involved inserting a power array and a solid-state disk array.
Because there already were concerns about the capacity of the centralized system, which stored more than 8 million files, the single file system was divided into smaller volumes holding approximately 700,000 to 800,000 files.
Based on all signs Friday and Saturday, everything appeared normal to CIT staff. But when the system was restarted, it turned out that all the files on the system were destroyed, Martens said. While a backup exists of files that were created before noon on Saturday, and incoming email was being queued, some files created or received between noon and midnight on Saturday may be unrecoverable, he added.
"We don't know at this point what is actually happening. Needless to say, this is of great concern to us," Martens said Monday afternoon.
CIT, in consultation with Veritas, has been working to find out what caused the system to crash. Efforts early this week have been focused on getting a temporary email system up and running, and on restoring the lost files.
On Monday morning, CIT decided to split the university's central email activities into two parts.
The first, which went into effect Tuesday, allows users to read and respond to mail, but does not allow them to store the mail into folders, nor to access previously stored materials. At the same time, CIT is making a copy of all incoming mail and saving it. This way, "at least people can read their mail and respond to it, but you can't store it," Martens said.
CIT also is working to rebuild users' stored files, which is expected to take some time. By the time the restoration is complete, Martens said, CIT staff hopefully will have a definitive answer from the software vendor on what caused the system failure.
All mail that users cannot file will be redelivered next week. "Unfortunately, people will be seeing a repeat of mail, but we feel this is an effective way to provide mail service. Actual storage and sorting will have to wait until next week," Martens said.
The delay in restoring saved files is in part a result of the quantities that must be resurrected, and in part because of the constraints of time.
"A backup (of the central email system) takes more than 12 hours to occur, and it's done in bits and pieces so it doesn't all conglomerate at the same time," Martens said. "When we do a restore-and this is the first time we've had to do a complete file restore-we have to examine and identify electronically each file. It's very time-consuming because the clock on the wall continues to tick.
"Routine operations are on a 24-hour cycle that must be completed at least once during that time period. Otherwise, we fall behind, and that's the position we're in now," Martens said.
There was absolutely no way to foresee this crisis, Martens emphasized. UB is "one of the very few sites in the world using this software and pushing this many pieces of mail and supporting this many files," he said.
The university needs to better educate the UB community about the advantages and disadvantages of central mail service versus distributed services, Martens said.
"We need to educate ourselves about the pros and cons of our long-term strategy, and we need to accelerate our development efforts toward becoming a more 'bullet-proof' email system," he said. But, he conceded, a fail-proof central email system may be as likely to find as Utopia. "You can't completely protect yourself, but you can take steps toward becoming more 'bulletproof,'" Martens said.
The problem is scalability, he said. "The people who work in the computing industry run into this from time to time, that you have a hardware/software configuration that's working fine as long as you don't exceed a certain point...it's a constraint in the system, a bug, that you have to identify and fix."
That's difficult to do with a mail system that is constantly being challenged by demands for greater and faster service, Martens said.
And even once the source of the problem is known and has been corrected, the stability of the system will not be certain right away.
"We will know it's stable after two to four days with no 'hiccups,' Martens said. "We have to subject the system to the real world, the peculiarities of how traffic comes in. "At any one time, 1,200 to 1,400 people are working on the system. That kind of load cannot be simulated."
Martens said he hopes that dividing the file server will help to eliminate future problems. In addition, UB may need to establish a larger margin between operation levels and potential need, installing higher-speed servers and disks that would double the capacity, speed and performance of UB's computing environment in order to create a safe and comfortable margin of use.
"We suspect that this happened (Saturday), either during the shutdown or during the restart after the hardware upgrade, that these files got corrupted. Everything was working up to the point we shut the system down," Martens said.
And he said he is absolutely confident that the source of the crash will be found and, more importantly, fixed.
Front Page | Top Stories | Briefly | Events | Electronic Highways | Sports
Jobs | Y2K@UB |
Current Issue | Comments? |
Archives |
Search
UB Home |
UB News Services | UB Today