mirror of https://github.com/sipwise/heartbeat.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
316 lines
15 KiB
316 lines
15 KiB
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<title>Preliminary Linux HA Hardware Installation Guide</title>
|
|
</head>
|
|
<body>
|
|
|
|
<h1>
|
|
Linux-HA Hardware Installation Guideline</h1>
|
|
<i><font size=-1>This document (c) 1999 Volker Wiegand
|
|
<a href="mailto:Volker.Wiegand@suse.de"><Volker.Wiegand@suse.de></a></font></i>
|
|
<p>This document serves as the starting point to plan, execute, and verify
|
|
your hardware setup for a High Availability (HA) environment.
|
|
<h2>Contents</h2>
|
|
<OL>
|
|
<LI><a href="#Introduction">Introduction</a>
|
|
<LI><a href="#Hardware Requirements">Hardware Requirements</a>
|
|
<OL>
|
|
<LI><a href="#Minimum Installation">Minimum Installation</a>
|
|
<LI><a href="#More Advanced Installation">More Advanced Installation</a>
|
|
<LI><a href="#Fully Redundant Installation">Fully Redundant Installation</a>
|
|
</OL>
|
|
<LI><a href="#Hardware Setup and Test">Hardware Setup and Test</a>
|
|
<OL>
|
|
<LI><a href="#Serial Ports">Serial Ports</a>
|
|
<LI><a href="#LAN Interfaces">LAN Interfaces</a>
|
|
<LI><a href="#Other Devices">Other Devices</a>
|
|
</OL>
|
|
<LI><a href="#Troubleshooting">Troubleshooting</a>
|
|
<LI><a href="#References">References</a>
|
|
</OL>
|
|
<h2>
|
|
<a NAME="Introduction"></a>Introduction</h2>
|
|
With the high stability Linux has reached, this Operating System is well
|
|
suited to be used for HA purposes. The Linux-HA project, based upon
|
|
Harald Milz's
|
|
<a href="http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html">HOWTO</a>
|
|
and Alan Robertson's Heartbeat code, provides the building
|
|
blocks for a professional solution.
|
|
<p>This document provides some advice on the initial planning, the
|
|
installation and cabling, and the test and verification of the overall
|
|
setup. We use the word takeover to mean transferring some kind of
|
|
server functionality from a broken entity to a sane one. In this context,
|
|
entities can be network adapters, computers, or something else. Our current
|
|
focus is to provide HA capabilities among PC's which we will call "nodes"
|
|
from now on.
|
|
<h2>
|
|
<a NAME="Hardware Requirements"></a>Hardware Requirements</h2>
|
|
Since we want to be able to provide failover capabilities on the machine
|
|
level, we need at least two computers. Obvious, isn't it? In our current
|
|
setup, all we require that they are running Linux.
|
|
No particular distribution is preferred (although most tests have been
|
|
carried out on RedHat and SuSE systems).
|
|
The minimum kernel version is
|
|
<B>[TODO: which one is it ???]</B>, although the software makes fairly minimal
|
|
demands on the OS.
|
|
<p>These two nodes have to be connected in some way to exchange status
|
|
information and to monitor each other. The more channels our nodes have
|
|
to talk to each other, the better it is. We will use the term "medium"
|
|
for such a communication channel.
|
|
<p>In general we work from the assumption that we use standard hardware
|
|
where ever possible. This means that we do not modify our PC's other than
|
|
to expand them with components off the shelf. And we use only cabling that
|
|
can be bought without "special orders" such as split serial cables or the
|
|
like. After all we want solutions that can be installed and used by everyone,
|
|
not just some experts.
|
|
<h3>
|
|
<a NAME="Minimum Installation"></a>Minimum Installation</h3>
|
|
In order for the takeover to work, we need at least one medium to exchange
|
|
messages. Given that we use TCP/IP as the basis for our service, some kind
|
|
of LAN is certainly available. Of course having only the LAN provides poor
|
|
monitoring capabilities, but on the other hand this is the minimum chapter
|
|
anyway :-)
|
|
<p>So how will the hardware be planned? Well, straight forward.
|
|
<pre>
|
|
+-------------------+ +-------------------+
|
|
| | | |
|
|
| Node A | | Node B |
|
|
| | | |
|
|
| eth0 | | eth0 |
|
|
+---------+---------+ +---------+---------+
|
|
| |
|
|
| |
|
|
|---------+----------------------+---------| LAN (Ethernet, etc.)</pre>
|
|
As was mentioned before, this design obviously provides
|
|
insufficiently reliable monitoring capabilities.
|
|
In a LAN, there are many different points of failure.
|
|
Another issue is that the LAN is a public medium and that there are several
|
|
levels of possible failures.
|
|
So we would be well-advised to look for more robust options to
|
|
use in addition to the LAN for heartbeats.
|
|
<h3>
|
|
<a NAME="More Advanced Installation"></a>More Advanced Installation</h3>
|
|
So let's see what we can do to provide sound monitoring and good takeover
|
|
capabilities and still not having to purchase excessive hardware or software
|
|
add-ons.
|
|
<p>The main idea is to have a simple private medium like one or more serial
|
|
cables. We can use the standard serial ports, provided they are not already
|
|
occupied by modems, mice, or other vermin.
|
|
If you have a server with a PS/2 mouse, it probably has two such
|
|
ports available.
|
|
<p>So here's what this configuration looks like.
|
|
<pre>
|
|
+----------------------+ (Nullmodem Cable)
|
|
| |
|
|
+---------+---------+ +---------+---------+
|
|
| ttyS0 | | ttyS0 |
|
|
| | | |
|
|
| Node A | | Node B |
|
|
| | | |
|
|
| eth0 | | eth0 |
|
|
+---------+---------+ +---------+---------+
|
|
| |
|
|
| |
|
|
|---------+----------------------+---------| LAN (Ethernet, etc.)</pre>
|
|
What do we gain? We have now two media to exchange the heartbeat.
|
|
This provides greater reliability in the case of failure.
|
|
Of course
|
|
the restriction with the LAN still holds true, but now Node A could use
|
|
the serial line to initiate a takeback of the service. And if just the
|
|
serial connection should fail, we still have the LAN.
|
|
Reliable intracluster communications is very important, and this design
|
|
is a low-cost improvement over the previous one.
|
|
<h3>
|
|
<a NAME="Fully Redundant Installation"></a>Fully Redundant Installation</h3>
|
|
The point of the addition of the serial links to the system is that a single
|
|
failure cannot cause the nodes to become confused about the overall
|
|
system configuration. This is vitally important for many HA systems, because
|
|
the cost of this confusion can be scrambled disks, and other problems
|
|
which are often worse than the cost of an outage.
|
|
With more resources, the
|
|
following provides a general guideline to set up things.
|
|
To illustrate
|
|
the principle, a third node has been included, but we can install any number
|
|
of nodes in this way. Well, almost any.
|
|
<B>Note:</B> The takeover code which is
|
|
part of the <i>heartbeat</i> package will not yet correctly manage takeovers
|
|
for more than two nodes.
|
|
<p>The serial lines are now arranged in a ring structure.
|
|
As you will have noticed,
|
|
this occupies two serial ports on each node as per our discussion in the
|
|
previous chapter. But on the other hand we do now have a general setup
|
|
that can easily be extended and also provides a good level of redundancy.
|
|
We can now send our heartbeat now in both directions over the
|
|
ring, thus reaching every other node even in case of a (single) cable defect
|
|
(or down system) anywhere on the ring.
|
|
<p>Another facet of our high end design covers the LAN access. Having two
|
|
adapters connected to the wire allows us to provide intra-node failover
|
|
capabilities in case of an interface or LAN cable breakdown. Plus it gives
|
|
us the chance to take over the IP address of Node A, eth0 onto Node B,
|
|
eth1 and keeping Node B, eth0 as it is. In fact, this is the primary operation
|
|
mode of several professional systems, including IBM's HACMP or HP's
|
|
MC/ServiceGuard. Which doesn't imply that we are not professional, of course
|
|
:-)
|
|
<p>So, here is the block diagram for this third design.
|
|
<pre>
|
|
(Nullmodem Cables)
|
|
+-----------------------------------------------------+
|
|
| +--------------+ +-------------+ |
|
|
| | | | | |
|
|
+-----+-------+-----+ +-----+--------+----+ +-----+-------+-----+
|
|
| ttyS0 ttyS1 | | ttyS0 ttyS1 | | ttyS0 ttyS1 |
|
|
| | | | | |
|
|
| Node A | | Node B | | Node C |
|
|
| | | | | |
|
|
| eth0 eth1 | | eth0 eth1 | | eth0 eth1 |
|
|
+-----+--------+----+ +-----+--------+----+ +-----+--------+----+
|
|
| | | | | |
|
|
| | | | | |
|
|
|-----+--------+-------------+--------+-------------+--------+----|
|
|
LAN (Ethernet, etc.)
|
|
</pre>
|
|
Future releases of the <i>heartbeat</i> software will support such a
|
|
configuration, but current takeover code restricts the configuration
|
|
to a single interface and two nodes in the network.
|
|
Of course we could also use other media for the heartbeat exchange. Recent
|
|
suggestions include SCSI buses in target mode and IrDA ports "connected"
|
|
with a mirror. Another candidate that comes to mind is the USB found in
|
|
many modern PC's. As I said before, the more (and more different) the better.
|
|
<h2>
|
|
<a NAME="Hardware Setup and Test"></a>Hardware Setup and Test</h2>
|
|
The following chapter deals with the installation and verification of the
|
|
various components within the nodes.
|
|
<h3>
|
|
<a NAME="Serial Ports"></a>Serial Ports</h3>
|
|
First of all, let's recap how a Nullmodem Cable is wired. The pain is that
|
|
you certainly possess the pin assignment a thousand times, but you don't
|
|
have it handy when you need it. So here it is ...
|
|
<pre>
|
|
25-pin 9-pin 9-pin 25-pin
|
|
|
|
2 TxD 3 -------------------------- 2 RxD 3
|
|
3 RxD 2 -------------------------- 3 TxD 2
|
|
4 RTS 7 -------------------------- 8 CTS 5
|
|
5 CTS 8 -------------------------- 7 RTS 4
|
|
7 GND 5 -------------------------- 5 GND 7
|
|
6 DSR 6 ---+---------------------- 4 DTR 20
|
|
8 DCD 1 ---+ +--- 1 DCD 8
|
|
20 DTR 4 ----------------------+--- 6 DSR 6
|
|
</pre>
|
|
|
|
Once you have these cable(s) in place you will want to test them. This
|
|
is fairly easy since the serial ports are usually configured with decent
|
|
default values. On a freshly booted Linux system we can assume the ports
|
|
to be in a "sane" state, with the speed set to 9600 baud. If not, you can
|
|
do a "<tt><b>stty sane 9600 </dev/ttyS0</b></tt>"
|
|
with <i>ttyS0</i> replaced by the actual device. Please note the input
|
|
redirection which selects the device.
|
|
|
|
</P>
|
|
<p>Then you can set up one node as receiver
|
|
("<tt><b>cat </dev/ttyS0</b></tt>") and
|
|
the other one as transmitter ("<tt><b>echo hello >/dev/ttyS0</b></tt>").
|
|
Voila! What you expect is that the "hello" is printed out at the receiver.
|
|
Pressing Ctrl-C on the receiver's keyboard will return you to the prompt.
|
|
Then do the same test with mutually exchanged roles.
|
|
<h3>
|
|
<a NAME="LAN Interfaces"></a>LAN Interfaces</h3>
|
|
Rumor has it that there is work in progress to provide some level of diagnostic
|
|
capabilities for Ethernet adapters and wiring. I don't know the actual
|
|
status, and can only suggest to use a shabby "ping" provided that the interfaces
|
|
are set up correctly with "ifconfig". For more information on Linux ethernet,
|
|
please check the
|
|
<a href="http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html">Ethernet
|
|
HOWTO</a>.
|
|
<p>If you are planning to use more than one adapter per node (usually called
|
|
"Standby Adapters"), please make sure to connect them to the same physical
|
|
medium as the primary adapters. Otherwise you will of course not be able
|
|
to takeover the IP address. Having them in a different subnet is perfectly
|
|
okay. More than that: it's preferred. <b>[TODO: this is what I learned with
|
|
HACMP. Can anyone please give the *correct* rationale --- or rephrase the
|
|
whole paragraph?]</b>
|
|
<p><b>Note:</b> The <i>heartbeat</i> software does not yet support this
|
|
kind of configuration.
|
|
<h3>
|
|
<a NAME="Other Devices"></a>Other Devices</h3>
|
|
<b>[TODO: well, to do]</b>
|
|
<h2>
|
|
<a NAME="Troubleshooting"></a>Troubleshooting</h2>
|
|
If things don't work in the first place -- don't panic! Usually it's just
|
|
a trifle. Things to check include:
|
|
<ul>
|
|
<li>
|
|
Check the startup messages of the kernel, e.g. using "dmesg". Is the serial
|
|
driver (either the standard one or the special one for your hardware) compiled
|
|
in or available as a module?</li>
|
|
|
|
<li>
|
|
Check the serial port(s) and cable(s). Do your modem and mouse still work?
|
|
Using a battery, a light bulb or buzzer and some wire you can easily verify
|
|
that all pins are connected and there are no short circuits.
|
|
Inexpensive <i>breakout boxes</i> are available for diagnosing such conditions
|
|
as well. They contain the light bulbs, the connectors and the wire in one
|
|
handy little unit.
|
|
<li>
|
|
For serial ports, the file <tt>/proc/tty/driver/serial</tt> can be very
|
|
helpful for diagnosing serial port problems in Linux.
|
|
It contains lines of this form in it:
|
|
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:24423 rx:4680 RTS|CTS|DTR|DSR|CD</tt></PRE>
|
|
This particular line corresponds to a working "raw" serial port,
|
|
/dev/ttyS<b>1</b>
|
|
with both sides cabled up correctly, and heartbeat active on both sides.
|
|
The first number on the line is the port number. The built-in serial ports
|
|
on PCs are numbered 0 and 1.
|
|
With heartbeat only active on the local side (and not the far side), it
|
|
looks like this instead:
|
|
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:43558 rx:12277 RTS|DTR</tt></PRE>
|
|
Note the lack of the
|
|
<b>CTS</b> (Clear To Send),
|
|
<b>DSR</b> (Data Set Ready), and
|
|
<b>CD</b> (Carrier Detect)
|
|
bits on the interface.
|
|
When heartbeat is only running on the far side interface, it looks
|
|
like this:
|
|
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277 CTS|DSR|CD</tt></PRE>
|
|
Note that when the local port isn't active, the
|
|
<b>RTS</b> (Request To Send), and
|
|
<b>DTR</b> (Data Terminal Ready) bits aren't active.
|
|
|
|
When heartbeat isn't running on either interface, the line looks like this:
|
|
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277</tt></PRE>
|
|
This is essentially a software breakout box.
|
|
</li>
|
|
|
|
<li>
|
|
Check that the cables are properly plugged into their sockets.
|
|
For a production
|
|
High-Availability system, it is a very good idea to fasten the screws
|
|
in order to avoid loose contacts.</li>
|
|
|
|
<li>
|
|
For more information on diagnosing ethernet problems, consult the
|
|
<a href="http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html">Ethernet
|
|
HOWTO</a>.</li>
|
|
|
|
<li>
|
|
For more information on diagnosing serial port problems, consult the
|
|
<a href="http://metalab.unc.edu/LDP/HOWTO/Serial-HOWTO.html">Serial
|
|
Port HOWTO</a>.</li>
|
|
</ul>
|
|
|
|
<h2>
|
|
<a NAME="References"></a>References</h2>
|
|
The Linux-HA homepage on the internet is:
|
|
<a href="http://linux-ha.org/">http://linux-ha.org/</a>
|
|
<p>Harald Milz' Linux-HA HOWTO that started the whole thing can be found
|
|
at:
|
|
<a href="http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html">http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html</a>
|
|
<p>A comprehensive survey on professional HA solutions is here:
|
|
<a href="http://www.sun.com/clusters/dh.brown.pdf">http://www.sun.com/clusters/dh.brown.pdf</a>
|
|
<br><b>[TODO: should we include links to HACMP, Veritas, Wizard, ... ???]</b>
|
|
<hr>
|
|
</body>
|
|
</html>
|