In this Scientific American article, Armando Fox and David Patterson discussed an unconventional approach to build reliable computing systems. Instead of focusing on improving software and hardware reliability, they consider failures inevitable. Their recovery-oriented computing (ROC) approach focusing on bringing the service back quicker. Some method such as micro-rebooting is considered.
Their study also revealed that operator errors cause most of the system downtime. Perhaps the most important boost to reliability is to improve system usability.
2003.05.31 comments -