Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

[802SEC] Re: [RPRWG] Fw: News Article "Beth Israel Deaconess copes with a massive computer crash "



Bob -

This is not a "problem with Spanning Tree", but a problem with *configuring* (or more to the point, mis-configuring) Spanning Tree. The 7-hop "limit" applies to the default "out of the box" parameter settings, which are defined on the assumption that the max number of hops is 7; the Spanning Tree parameters can (and indeed must) be changed if you intend to construct networks that have the potential to create bigger hop counts. The real problem here is likely to be the uncontrolled growth of a network without applying sufficient thought to the consequences.

This is very far from being a "newly discovered feature". It has been known for as long as Spanning Tree has existed that it is necessary to configure the timers appropriately for the number of hops in the network. The 7-hop defaults are simply a reasonable compromise between usefulness (in terms of the number of hops encountered in the majority of networks out there) and the time taken for the protocol to converge (which increases as the number of hops goes up).

Some of the changes already proposed in 802.1y for RSTP would cause a hard partition of one part of the network from another under these conditions, which may make this kind of problem easier to find; however, if you are constructing a large network, there is no substitute for applying some level of thought and planning to the way it will be put together, and configuring Spanning Tree appropriately, before you start connecting it up.

Regards,
Tony

At 13:41 27/11/2002 -0500, Robert D. Love wrote:
Alan Marshall alerted me to a rather interesting story that ran in yesterday's Boston Globe (appended to the bottom of this email note) which described a massive computer system outage caused by a problem with Spanning Tree.  I followed up by writing to the head of the computer center there.  His reply is attached directly below.  Basically, they violated the 7 hop count limit for spanning tree.  One consequence of that violation is that the system came down on their heads producing a three day problem.  If the system simply didn't allow the traffic to continue beyond 7 hop counts, that would be understandable.  However, having the whole system crash is a very undesirable outcome.  Are there any changes that should be considered in 802.1, or the standards that use Spanning Tree to minimize the risk of a sudden massive system crash? 
 
Tony, you may want to have 802.1 investigate this problem.  
 
Other working group chairs whose standards use spanning tree may want to alert your working groups to this newly discovered feature.
 
Best regards,
 
Robert D. Love
President, Resilient Packet Ring Alliance
President, LAN Connect Consultants
7105 Leveret Circle     Raleigh, NC 27615
Phone: 919 848-6773       Mobile: 919 810-7816
email: rdlove@ieee.org          Fax: 208 978-1187
----- Original Message -----
From: jhalamka@caregroup.harvard.edu
To: rdlove@ieee.org
Sent: Wednesday, November 27, 2002 11:06 AM
Subject: RE: News Article "Beth Israel Deaconess copes with a massive computer crash "

Here's the technical explanation for you.

When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer2 hops from root.

The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other.

Part of this restriction is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when a BPDU is propagated from the root bridge towards the leaves of the tree, the age field is incremented each time it goes though a bridge. Eventually, when the age field of a BPDU goes beyond max age, it is discarded. Typically, this will occur if the root is too far away from some bridges of the network. This issue will impact convergence of the spanning tree.

A major contributor to this STP issue was the PACS network and its connection to the CareGroup network. To eliminate its influence on the Care Group network we isolated it with a Layer 3 boundary. All redundancy in the network was removed to ensure no STP loops were possible.

Full connectivity was restored to remote devices and networks that were disconnected in troubleshooting efforts prior to TACs involvement. Redundancy was returned between the core campus devices. Spanning Tree was stabilized and localized issues were pursued.

Thanks for your support.
-----Original Message-----
From: Robert D. Love [mailto:rdlove@nc.rr.com]
Sent: Wednesday, November 27, 2002 9:26 AM
To: jhalamka@caregroup.harvard.edu
Subject: News Article "Beth Israel Deaconess copes with a massive computer crash "

Dear Dr. Halamaka,
 
I read the referenced article with great interest since for the past 20 years I have been deeply involved in the creation of IEEE 802 standards.  These standards include Ethernet, WiFi, and other evolving standards which all use the Spanning Tree Protocol.  The article brings up the question as to whether there is any fundamental weakness in the algorithm that those of us creating the standards need to be aware of.  Therefore, I would greatly appreciate any help you can provide me in finding out more information to allow me to better understand any potential weaknesses in the Spanning Tree Algorithm.  I would also like to know if you believe that this problem should be addressed by IEEE 802.3, the working group which created the Ethernet standards, or by other working groups within IEEE 802. 
 
My concern is more than academic.  I am presently the Vice Chair of the Working Group (IEEE 802.17) which is defining a metropolitan area networking standard, Resilient Packet Ring, which will also be using the spanning tree protocol.
 
My full contact information is contained in the signature line.  E-mail is probably the easiest way to reach me.
 
Thank you in advance for any assistance you can provide me.
 
Best regards,
 
Robert D. Love
President, Resilient Packet Ring Alliance
President, LAN Connect Consultants
7105 Leveret Circle     Raleigh, NC 27615
Phone: 919 848-6773       Mobile: 919 810-7816
email: rdlove@ieee.org          Fax: 208 978-1187
 

Beth Israel Deaconess copes with a massive computer crash

By Anne Barnard, Globe Staff, 11/26/2002

hirteen days ago, as his computer crunched the mountain of data he hoped would be his humble contribution to medical progress, the researcher - he shall remain nameless - got a phone call he'd never forget.

     

 
It was Dr. John Halamka, the former emergency-room physician who runs Beth Israel Deaconess Medical Center's gigantic computer network. He told the professor that his flood of numbers was overwhelming the system, threatening to freeze thousands of electronic medical records and grind the hospital's network to a halt.

''He said, `Oh, my God!' and pulled the plug out of the wall,'' Halamka said last week.

It was too late. Somewhere in the web of copper wires and glass fibers that connects the hospital's two campuses and satellite offices, the data was stuck in an endless loop. Halamka's technicians shut down part of the network to contain it, but that created a cascade of new problems.

The entire system crashed, freezing the massive stream of information - prescriptions, lab tests, patient histories, Medicare bills - that shoots through the hospital's electronic arteries every day, touching every aspect of care for hundreds of patients.

Within a few hours, Cisco Systems, the hospital's network provider, was loading thousands of pounds of network equipment onto an airplane in California, bound for a 2 a.m. arrival at Logan International Airport. In North Carolina's Research Triangle area, computer experts were being rousted out of bed to join a batallion of electronic shock troops who would troubleshoot the situation. Closer to home, Cisco technicians were converging on Boston from across Massachusetts.

The crisis began on a Wednesday afternoon, Nov. 13, and lasted nearly four days. Before it was over, the hospital would revert to the paper systems that governed patient care in the 1970s, in some cases reverting to forms printed ''Beth Israel Hospital,'' from before its 1996 merger. Hundreds of employees, from lab technicians to chief executive officer Paul Levy, would work overtime running a quarter-million sheets of paper from one end of the campus to the other.

And hospitals across the country - not to mention investment banks, insurance companies and every other business that relies on a constantly accessible stream of quickly-changing information - would get a scary reminder of how dependent they are on their networks, and what would happen if they disappeared.

''It's like the Y2K that never happened,'' said Dianne Anderson, vice president for patient care services at Beth Israel Deaconess.

Now, Halamka - the hospital's chief information officer and a networking addict who answers e-mails on his Blackberry device whether he's at a meeting or a family dinner - is hustling to answer questions from all over the country, from community hospitals in Western Massachusetts and major medical centers such as Johns Hopkins University, and financial-services companies that could lose millions in a crash.

''The message,'' he said, ''is make sure you're ready for a massive disruption of your network - whether it's 9/11 or a natural disaster or whatever.''

As a result of the crash, Beth Israel Deaconess plans to spend $3 million to replace its entire network - creating an entire parallel set of wires and switches, double the capacity the medical center thought it needed.

No other Massachusetts hospital has ever reported such a long-lasting or disruptive network crash, said Elliot Stone, executive director of the Massachusetts Health Data Consortium, a group that brings together chief information officers from hospitals and health plans around the state. He praised Beth Israel Deaconess for being open about the problem and sharing lessons learned, both about technology itself and about policy - such as the need to enforce rules against unauthorized additions of new software onto the network. Not least, Stone said, Halamka's counterparts see the incident as ammunition in their constant quest to convince management to pay for network upgrades.

The crash surprised experts in the field because most disaster planners mainly worry about backing up hard drives and building redundant servers. But in this case, it wasn't those repositories of information that were in trouble. It was the network itself - the ''pipes'' that carry the information from one place to the other. It was like when at busy times at the office, your e-mail slows down - only so bad that everything ceased to function.

''Usually, when you think about backup, you're talking about backing up hard drives. You don't think about the network itself,'' said Mark Tuomenoksa, founder and chairman of Woburn-based OpenReach, a network-security consulting company.

Halamka said that was the case at Beth Israel Deaconess: ''We don't just have a backup generator, we have a backup-backup generator, and then we have batteries. Servers are clustered; data writes on five different hard drives.'' There is even a double ''pipeline'' between the computer center on Tremont Street and Beth Israel Deaconess's main campuses - but during the crash, both were clogged.

The crisis had nothing to do with the particular software the researcher was using. The problem had to do with a system called ''spanning tree protocol,'' which finds the most efficient way to move information through the network and blocks alternate routes to prevent data from getting stuck in a loop. The large volume of data the researcher was uploading happened to be the last drop that made the network overflow.

Halamka said Beth Israel Deaconess's recent economic troubles were not behind the problem. In fact, on Oct. 1, hospital officials had approved a consultant's plan to overhaul the network - just not quite in time. ''Now,'' he said, ''we're going to do it faster.''

The crisis also tapped into medicine's ambivalence about computers. Yesterday, doctors at Brigham and Women's Hospital reported in the Archives of Internal Medicine that 73 percent of medication-related mistakes involved in malpractice claims are preventable and probably could be averted through computerized prescription ordering - the latest in a growing pile of evidence that computerization can cut medical errors.

At the same time, clinicians have sometimes been wary of turning over control to a computer, Tuomenesko said: ''When I enter something into a computer, how do I know it got there?''

That was part of the problem Beth Israel Deaconess had: New information could sometimes be entered, but since network function was fading in and out, clinicians weren't sure whether that information was being delivered. So, the hospital decided to shut down the computers - taping handwritten ''Do Not Use'' notes to monitors - creating an instant generation gap, said Anderson, the hospital's top nurse executive.

''Nurses and doctors over the age of 35 were very much at ease,'' she said. ''The younger nurses and doctors were very uncertain. We were teaching residents how to write orders; we were showing nurses how to do flow sheets.''

Meanwhile, the hospital was figuring out how to run at its usual pace without the 100,000 e-mails it usually sends a day. The lab was dumping 3,000 results a day on paper into plastic bins, to be delivered by runners who came by every 10 to 15 minutes. Microbiologists were ferrying lab results. Cardiac fellows were digging through paper records to find old cardiograms to compare to new ones. People at all levels of the hospital hierarchy had to deal with each other face to face.

''The lab is usually anonymous until something goes wrong,'' said Gina McCormack, technical director of the West Campus lab. ''A lot of people realized we're here. People got to understand each other's jobs.''

Anne Barnard can be reached, when the network is working, at abarnard@globe.com.
This story ran on page C1 of the Boston Globe on 11/26/2002.
© Copyright 2002 Globe Newspaper Company.

 

 


Regards,
Tony