document updated 13 years ago, on Feb 18, 2011
Problems to avoid when designing HA systems:
- single point of failure
The solution is redundancy in one form or another, included up-front in the original design. The historically most common method was internally-redundant hardware. It's also possible to use software along with commodity hardware to achieve redundancy.
- the "split brain" problem
This is where two live computers lose contact with each other, but are still running. Because they can't contact each other, they both assume that they're the only remaining server, that the other one has died. So they decide to take complete control over shared resources. In some cases, some resources (particularly hard drives) can't Node fencing (fencing) (STONITH) (linux-ha.org) is a solution to prevent this. Quorum is a way to prevent computers from fencing each other, over and over.