[LARTC] IP Failover

John Klingler john@jupiter.com
Tue, 07 Oct 2003 12:55:56 -0700


This is a multi-part message in MIME format.
--------------020109000608090606030901
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

  If anyone is interested, in my quest for a networking solution which 
provides IP Failover on heterogenous redundant networks, I have listed 
the solutions I found below. I would welcome comments from anyone who is 
familiar with these.

   1. faild - I have included a description below of a program daemon
      which monitors the Ethernet connections and changes the routing
      tables when a failure is detected. IP Failover is all this simple
      program does. Being simple, however, makes it small and easy to port.
   2. High Availability Linux Project (HAL) (http://linux-ha.org/) has
      code available for FreeBsd and Solaris (and probably reasonably
      portably to other UNIX platforms. It supports virtual (redundant)
      servers but could probably therefore be configured to support
      redundant LANs.
   3. Advanced Network Services (ANS 2.3.x) for Linux* Operating
      Systems.  which is available from Intel on both PCs and UNIX OS's.
      ANS provides IP Failover and much more, such as switch failover,
      load leveling, etc. See:
      http://www.intel.com/support/network/adapter/onlineguide/PRO1000/DOCS/SERVER/index.htm.

   4. Linux Virtual Server Project (LVS) - VRRPD, Virtual Router
      Redundancy Protocol (http://off.net/~jme/vrrpd/) which also
      provides IP Failover. It implements RFC2338 but is only available
      on Linux but may be portable. As with HAL, it is probably
      configureable to provide redundant LAN.

It seems the days of industry-wide standards and interoperability are 
becoming casualties of war.


John Klingler
Automatic IP Failover: faild

Figure 1 shows a typical redundant network configuration where all nodes 
are connected to two, separate Ethernet LANs (here referred to as 
Ethernet A and Ethernet B). Each node must have two Ethernet interfaces, 
one for each LAN. Distinct IP addresses are assigned to all Ethernet 
interfaces.

                        _____________________ . . .
                                |                     |
                            Host 1           Host 2
                        ____|________ __|______ . . .

                    Figure 1: Typical Redundant Network Configuration


A route monitor daemon is started on all nodes. Each daemon is 
configured to be either a responder or both a requestor and responder. 
Typically the host daemons are requestor/responders.

Requestor daemons broadcast inquiry (INQ) packets on all available 
networks at a specified interval. Upon receiving an INQ each responder 
daemon sends back an acknowledgment (ACK) via the same route. These 
packets are all sent using UDP (Unreliable Datagram Protocol) so the 
daemons can quickly detect if a route is active.

If the requestor daemon does not get ACKs from a given node and if the 
responder daemon does not get INQs as expected, then each daemon 
independently determines that the particular route has become 
unreliable, or more likely, has gone dead. Each daemon then changes its 
local system routing tables so future traffic will be routed over the 
alternate (and presumably healthy) LAN. This detection and failover 
occurs very quickly, in a matter of a few seconds, depending on how the 
daemon's timing parameters are set.

When a route fails, network traffic carried by reliable protocols (such 
as X Window traffic via TCP -- Transmission Control Protocol) is held in 
abeyance until the IP stack recognizes that packets are not getting 
through. When the IP stack times out packets waiting for delivery will 
be retransmitted. Since the daemon has changed the routing tables the 
retransmitted packets will go via the new route.

The IP time-out time is the critical parameter determining how long it 
will take from initial route failure to establishing communication over 
the new route. This parameter may or may not be user-settable on your 
system. Field experience so far indicates lag times of 20-40 seconds 
before communication resumes.

As soon as the original route becomes reliable again, the daemon will 
restore the routing tables and communication resumes over the original 
interface. There should be no noticeable delay on the switchback. 
Request packet interval, failover interval, and switchback interval are 
all configurable.

To initiate a failover daemon on your host system, use the following 
convention:

    faild [-r] [-t <n>] [-f <n>] [-s <n>] [-p <n>] [-l <p>]
    -r should launch requestor
    -t <n> : timer interval (in secs) for sending of pkts
    -f <n> : num missed pkts before if is invalidated
    -s <n> : num good pkts before if is revalidated
    -p <n> : port number to use -l <p> : full path to message log file

    * Note: This daemon currently runs on VxWorks, Digital UNIX and
      Solaris, and is being ported to OpenVMS. Any other platforms would
      require porting the daemon to the target OS.


--------------020109000608090606030901
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
  <title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
<div class="moz-text-html" lang="x-western"> If anyone is interested,
in my quest for a networking solution which
provides IP Failover on heterogenous redundant networks, I have listed
the solutions I found below. I would welcome comments from anyone who
is familiar with these.<br>
<ol>
  <li><i><b>faild</b> - </i>I have included a description below of a
program
daemon which monitors the Ethernet connections
and changes the routing tables when a failure is detected. IP Failover
is all this simple program does. Being simple, however, makes it small
and easy to port. <br>
  </li>
  <li><i><b>High Availability Linux</b></i> <i><b>Project (HAL)</b></i>
(<small><font><a href="http://linux-ha.org/"><font size="+2"><small>http://linux-ha.org/</small></font></a></font></small>)
has code available for FreeBsd and Solaris (and probably reasonably
portably to other UNIX platforms. It supports virtual (redundant)
servers but could probably therefore be configured to support redundant
LANs.<br>
  </li>
  <li><big><big><big><font face="Times New Roman, Times, serif"><small><small><small><b><i>Advanced
Network Services</i></b> (<b>ANS</b> <b>2.3.x</b>) <i><b>for Linux*
Operating Systems</b></i>.&nbsp; which is available from Intel on both PCs
and UNIX OS's. ANS provides IP Failover and much more, such as switch
failover, load leveling, etc. See: <a class="moz-txt-link-freetext"
 href="http://www.intel.com/support/network/adapter/onlineguide/PRO1000/DOCS/SERVER/index.htm">http://www.intel.com/support/network/adapter/onlineguide/PRO1000/DOCS/SERVER/index.htm</a>.
    <br>
    </small></small></small></font></big></big></big></li>
  <li><i><b>Linux Virtual Server Project</b><b> (LVS)</b></i> - VRRPD,
Virtual Router Redundancy Protocol (<a class="moz-txt-link-freetext"
 href="http://off.net/%7Ejme/vrrpd/">http://off.net/~jme/vrrpd/</a>)
which also provides IP Failover. It implements RFC2338 but is only
available on Linux but may be portable. As with HAL, it is probably
configureable to provide redundant LAN. <br>
  </li>
</ol>
It seems the days of industry-wide standards and interoperability are
becoming casualties of war. <br>
<br>
<br>
John Klingler<br>
<div align="center"><b>Automatic IP Failover: faild </b><br>
</div>
<br>
Figure 1 shows a typical redundant network configuration where all
nodes are connected to two, separate Ethernet LANs (here referred to as
Ethernet A and Ethernet B). Each node must have two Ethernet
interfaces, one for each LAN. Distinct IP addresses are assigned to all
Ethernet interfaces. <br>
<blockquote>
  <blockquote>
    <blockquote>
      <blockquote>
        <blockquote>
          <blockquote>_____________________ <b>. . .</b><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br>
&nbsp;&nbsp;&nbsp; Host 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Host 2<br>
____|________ __|______ <b>. . .</b><br>
          </blockquote>
          <div align="center">Figure 1: Typical Redundant Network
Configuration <br>
          </div>
        </blockquote>
      </blockquote>
    </blockquote>
  </blockquote>
</blockquote>
<br>
A route monitor daemon is started on all nodes. Each daemon is
configured to be either a responder or both a requestor and responder.
Typically the host daemons are requestor/responders. <br>
<br>
Requestor daemons broadcast inquiry (INQ) packets on all available
networks at a specified interval. Upon receiving an INQ each responder
daemon sends back an acknowledgment (ACK) via the same route. These
packets are all sent using UDP (Unreliable Datagram Protocol) so the
daemons can quickly detect if a route is active. <br>
<br>
If the requestor daemon does not get ACKs from a given node and if the
responder daemon does not get INQs as expected, then each daemon
independently determines that the particular route has become
unreliable, or more likely, has gone dead. Each daemon then changes its
local system routing tables so future traffic will be routed over the
alternate (and presumably healthy) LAN. This detection and failover
occurs very quickly, in a matter of a few seconds, depending on how the
daemon's timing parameters are set. <br>
<br>
When a route fails, network traffic carried by reliable protocols (such
as X Window traffic via TCP -- Transmission Control Protocol) is held
in abeyance until the IP stack recognizes that packets are not getting
through. When the IP stack times out packets waiting for delivery will
be retransmitted. Since the daemon has changed the routing tables the
retransmitted packets will go via the new route. <br>
<br>
The IP time-out time is the critical parameter determining how long it
will take from initial route failure to establishing communication over
the new route. This parameter may or may not be user-settable on your
system. Field experience so far indicates lag times of 20-40 seconds
before communication resumes. <br>
<br>
As soon as the original route becomes reliable again, the daemon will
restore the routing tables and communication resumes over the original
interface. There should be no noticeable delay on the switchback.
Request packet interval, failover interval, and switchback interval are
all configurable. <br>
<br>
To initiate a failover daemon on your host system, use the following
convention: <br>
<blockquote>faild [-r] [-t &lt;n&gt;] [-f &lt;n&gt;] [-s &lt;n&gt;] [-p
&lt;n&gt;] [-l &lt;p&gt;] <br>
-r should launch requestor <br>
-t &lt;n&gt; : timer interval (in secs) for sending of pkts <br>
-f &lt;n&gt; : num missed pkts before if is invalidated <br>
-s &lt;n&gt; : num good pkts before if is revalidated <br>
-p &lt;n&gt; : port number to use -l &lt;p&gt; : full path to message
log file <br>
</blockquote>
<ul>
  <li><b>Note:</b> This daemon currently runs on VxWorks, Digital UNIX
and Solaris, and is being ported to OpenVMS. Any other platforms would
require porting the daemon to the target OS. <br>
  </li>
</ul>
</div>
</body>
</html>

--------------020109000608090606030901--