In late March a lab wide power outage which resulted in the shutdown of the lab over the weekend. As it was a weekend, no one really noticed. This was coupled with a breakdown in network connectivity the day before meant that all those seti at home clients having processed their little sliver of data and wanted to send back the results and get a new sliver of data. Typically because machines process the results at different speeds, at any one time there are not that many machines hitting the Seti at Home servers. Because of the network failure, coupled with the power failure, nearly every single Seti@Home client was hitting the central servers at the same time. For nearly a week after the recovery of the lab at Berkeley, I was unable to get data from the server as it was too swamped with requests . This highlights a problem in the approach taken by many distributed computing applications.. If the central servers lose connectivity for an extended period, they can be hit by tidal wave of requests for data. Nor is Berkley alone in this little hiccup.
On April 12, 2005 Microsoft announced an update containing five critical updates. Since I am a safe Microsoft user, I have auto updates turned on. Since I actually didn't think about it too much, I simply let the application chose the default time which is 3:00 AM. I think probably every single user probably lets the application chose the default time. As a result when a large series of updates occurred after nearly a month of no updates, the system bogged down. In fact my machine nearly slowed to a crawl as it repeated kept trying to reach Microsoft's auto update servers. The servers which I am sure where heavily loaded, simply were not responding. The auto update feature of Windows does not degrade gracefully, the application kept trying and trying to update, literally consuming most my CPU.
A better design would be to randomly chose a time between midnight and 7:00 AM. This way the load on central servers is minimized while still allowing users to get their updates. What undoubtedly happened is the progammers thought,"3:00 AM, no one uses their computer that early. That's a great time to auto update." Of course what was missing was the thought, how can we support having millions of machines hitting us at the same time? Usually updates are small and MS has developed a sufficient infrastructure so that they can take the pounding. Not so this update as it was significantly larger than the typical MS update.
Recommended Reading.
Interested in developing a large scale distrbuted application? Here's my beginning reading list to get you started.
- Distributed Systems: Concepts and Design (3rd Edition). Get this one if you get no other.
- Distributed Systems: Principles and Paradigms