-  Quit your bellyachin'!  We needed a "catch-all" 
document to supply useful information in a way that was easily referenced 
and would grow without a lot of work.  It's closer to a FAQ than anything 
else.
   - Yes!  There are two public mailing lists
for Linux-HA.  You can find out about them by visiting http://linux-ha.org/contact/.
 
    -  HA (High availability Cluster) - 
 A cluster that allows a host (or hosts) to become Highly Available.  This
means that if one node goes down (or a service on that node goes down) another
node can pick up the service or node and take over from the failed machine.
    http://linux-ha.org 
 Computing Cluster - This is what a Beowulf cluster is. It allows distributed 
computing over off the shelf components. In this case it is usually cheap 
IA32 machines. http://www.beowulf.org/
 Load balancing clusters - This is what the Linux Virtual Server project
does. In this scenario you have one machine with load balances requests to
a certain server (apache for example) over a farm of servers. www.linuxvirtualserver.org
 All of these sites have howtos etc. on them. For a general overview on
clustering under Linux, look at the Clustering HOWTO.
    -  Resource scripts are basically (extended)
System V init scripts. They must support stop, start, and status operations.  
In the future we will also add support for a "monitor" operation for monitoring 
services as you requested. The IPaddr script implements this new "monitor" 
operation now (but heartbeat doesn't use that function of it). For more info 
see Resource HOWTO.
    -  Heartbeat itself was not designed for monitoring
 	various resources. 	If you need to monitor some resources (for example,
availability 	of WWW server) you need some third party software. 	Mon is
a reasonable solution.     
 
 	- Get Mon  from http://kernel.org/software/mon/. 
 	
- Get all required modules listed. You can find them at 	nearest
mirror or at the CPAN archive (www.cpan.org). 	I am not very familiar with
Perl, so I downloaded them from CPAN 	archive as .tar.gz packages and installed
them in the usual way 	(perl Makefile.pl && make && make test && 
	make install).           
- Mon is software for monitoring different network resources. 	It
can ping computers, connect to various ports, monitor WWW, 	MySQL etc. In
case of dysfunction of some resources it triggers 	some scripts.        
  
- Unpack mon in some directory. Best starting point is README file. 
	Complete documentation is in the <dir>/doc, where <dir> 	is the place
you unpacked mon package.           
- For a fast start do following steps:         
        
  	- copy all subdirs found in <dir> to /usr/lib/mon 
- create dir /etc/mon 
- copy auth.cf from <dir>/etc to /etc/mon
 Now, mon is prepared to work. You need to create your own mon.cf
file, 	where you should point to resources mon should watch and 	actions
mon will start in case of dysfunction and when resources 	are available again.
  	All monitoring scripts are in /usr/lib/mon/mon.d/. 	At the beginning
of every script you can find explanation how to use it. 	
 All alert scripts are placed in /usr/lib/mon/alert.d/. 	Those are scripts
triggered in case something went wrong. 	In case you are using ipvs on theirs
homepage 	(www.linuxvirtualserver.org) you can find scripts for adding and 
	removing servers from an ipvs list.
 
 
     -           
    Yes!  Use the ipfail plug-in.  For
    each interface you wish to monitor, specify one or more "ping" nodes or
    "ping groups" in your configuration.   Each node in your cluster
    will monitor these ping nodes or groups.  Should one node detect a
    failure in one of these ping nodes, it will contact the other node in
    order to determine whether it or the ping node has the problem.
     If the cluster node has the problem, it will try to failover its
    resources (if it has any). To use ipfail, you will need to add the following to your /etc/ha.d/ha.cf 
files:
 respawn hacluster /usr/lib/heartbeat/ipfail
 ping <IPaddr1> <IPaddr2> 
... <IPaddrN>
 See Kevin's      documentation
for more details on the concepts.            IPaddr1..N are your ping nodes.  NOTE:  ipfail requires the auto_failback 
	option to be set to on or off (not legacy).
 
 
    -           
    This isn't a problem with heartbeat, but rather 
is caused by various versions of net-tools.  Upgrade to the most recent 
version of net-tools and it will go away.  You can test it with ifconfig 
manually. 
    -  Instead of failing over many IP addresses, just 
fail over one router address.  On your router, do the equivalent 
of "route add -net x.x.x.0/24 gw x.x.x.2", where x.x.x.2 is the cluster IP 
address controlled by heartbeat.  Then, make every address within x.x.x.0/24 
that you wish to failover a permanent alias of lo0 on BOTH cluster nodes.  
This is done via "ifconfig lo:2 x.x.x.3 netmask 255.255.255.255 -arp" etc...
     -  If anything makes your ethernet / IP stack
fail, you may lose both connections. You definitely should run the cables
differently, depending on how important your data is...
     -  To make heartbeat work with ipchains, you
must accept incoming and outgoing traffic on 694 UDP port. Add something
like     
 /sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP>  
-j ACCEPT
 /sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP>  
-j ACCEPT
    -   	This can be caused by one of two things:
    
 	- System under heavy I/O load, or
- Kernel bug.
 For how to deal with the first occurrence (heavy load), please
read the 	answer to the next FAQ item. 	If your system was not under moderate to heavy load when it 	got
this message, you probably have the kernel bug. 	The 2.4.18 Linux kernel
had a bug in it which would cause it to not 	schedule heartbeat for very
long periods of time when the system was 	idle, or nearly so.  If this is
the case, you need to get a kernel 	that isn't broken. 	      
-   	"No local heartbeat" or "Cluster node returning
after partition" 	under heavy load is typically caused by too small a deadtime
interval. 	 	Here is suggestion for how to tune deadtime: 
 
  	- Set deadtime to 60 seconds or higher 
- Set warntime to whatever you *want* your deadtime to be. 
- Run your system under heavy load for a few weeks. 
-  Look at your logs for the longest time either system went 		without
hearing a heartbeat.
- Set your deadtime to 1.5-2 times that amount. 
- Set warntime to a little less than that amount. 
- Continue to monitor logs for warnings about long heartbeat times. 		If
you don't do this, you may get 		"Cluster node ... returning after partition" 
		which will cause heartbeat to restart on all machines 		in the cluster.
 This will almost certainly annoy you. 		
 Adding memory to the machine generally helps. 	Limiting workload on the
machine generally helps. 	Newer versions of heartbeat are a better about
this than 	pre 1.0 versions.
     - It's common to get a single mangled packet 	on
your serial interface when heartbeat starts up. 	 This message is an indication
that we received a mangled packet.  It's harmless in 	this scenario.  If it happens continually,
there is probably 	something else going on. 	
 
     -  It's probably a permissions problem on authkeys.  
It wants it to be read only mode (400, 600 or 700).  Depending on where 
and when it discovers the problem, the message will wind up in different places.
    
 But, it tends to be in
        - stdout/stderr
- wherever you specified in your setup
- /var/log/messages
 Newer releases are better about also putting out startup messages to stderr 
in addition to wherever you have configured them to go.
     -  Use multicast and give each its
own multicast group. If you need to/want to use broadcast, then run each
cluster on different port numbers.  An example of a configuration using
multicast would be to have the following line in your ha.cf file:
 mcast eth0 224.1.2.3 
694 1 0
 This sets eth0 as the interface over which to send the multicast, 224.1.2.3 
as the multicast group (will be same on each node in the same cluster), udp 
port 694 (heartbeat default), time to live of 1 (limit multicast to local 
network segment and not propagate through routers), multicast loopback disabled 
(typical).
     - There is a CVS repository for Linux-HA. You can
find 	it at cvs.linux-ha.org.  Read-only access is via login guest, 	password
guest, module name linux-ha. 	More details are to be found in the announcement
email.  	It is also available through the web using viewcvs at http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
     -  Heartbeat now uses use automake        
and is generally quite portable at this point.  Join 	the Linux-HA-dev mailing
list if you want to help port it 	to your favorite platform. 	
     - Due to distribution RPM package name differences, 	this
was unavoidable.  If you're not using STONITH, 	use the "--nodeps"
option with rpm.  Otherwise, use the 	heartbeat source to build your
own RPMs.  You'll have the 	added dependencies of autoconf >= 2.53
and libnet 	(get it from http://www.packetfactory.net/libnet). 
 	Use the heartbeat source RPM (preferred) or unpack the heartbeat 	source
and from the top directory, run "./ConfigureMe rpm".  	This will build
RPMS and place them where it's customary for your 	particular distro.  It
may even tell you if you are missing 	some other required packages!
     - You configure a "meatware" STONITH device into 
	the ha.cf file.  The meatware STONITH device asks the 	operator
to go power reset the machine which has gone down.  	When the operator
has reset the machine he or she then issues a 	command to tell the meatware
STONITH plug-in that the reset has  	taken place.  Heartbeat will wait
indefinitely until 	the operator acknowledges the reset has occurred.  During 
	this time, the resources will not be taken over, and 	nothing will happen.
    - STONITH is a form of fencing, and is an acronym 
	standing for Shoot The Other Node In The Head.  It allows one 	node
in the cluster to reset the other.  Fencing is essential 	if you're
using shared disks, in order to protect the integrity of 	the disk data.
 Heartbeat supports STONITH fencing, and 	resources which are self-fencing.
 You need to configure 	some kind of fencing whenever you have a cluster
resource 	which might be permanently damaged if both machines tried to make 
	it active at the same time.  When in doubt check with the 	Linux-HA
mailing list.    
     - To get the list of supported STONITH 	devices,
issue this command: 	
 stonith -L
 To get all the gory details on exactly what these STONITH device 	names
mean, and how to configure them, issue this command:
 stonith -h
 
- This is not something which heartbeat supports 
	directly, however, there are a few kinds of resources which 	are "self-fencing".  
	This means that activating the resource causes it to fence itself 	off from
the other node naturally.  Since this fencing 	happens in the resource
agent, heartbeat doesn't know (and 	doesn't have to know) about it.  Two
possible hardware candidates 	are IBM's ServeRAID-4 RAID controllers and
ICP 	Vortex RAID controllers - but do your homework!!!   	When in doubt
check with the mailing list.    
     -  	Yes, heartbeat has supported active/active
configurations since 	its first release. 	The key to configuring active/active
clusters is to understand 	that each resource group in the haresources file
is preceded by 	the name of the server which is normally supposed to run
that service. 	When in a "auto_failback yes (or legacy)" 	(or old-style "nice_failback
off") configuration, when a cluster node 	comes up, it will take over any
resources for which it is listed 	as the "normal master" in the haresources
file.  Below is an example 	of how to do this for an apache/mysql configuration. 
    server1	10.10.10.1 mysql
 server2 10.10.10.2 apache
 
 In this case, the IP address 10.10.10.1 should be replaced with 	the IP
address you want to contact the mysql server at, and 	10.10.10.2 should be
replaced with the IP address you want 	people to use to contact the web server. 
	Any time server1 is up, it will run the mysql service.  Any time 	server2
is up, it will run the apache service. 	If both server1 and server2 are up,
both servers will be active. 	Note that this is contradictory with the old 
	nice_failback on parameter. 	With the new release which supports
    hb_standby foreign, 	you can manually fail back into an active/active
configuration 	if you have auto_failback off.  This allows administrators 
	more flexibility in failing back in a more customized way 	at more safe
or convenient times.
 
 
- Heartbeat was written to use ifconfig to manage
its interfaces.  That's nice for portability for other platforms, but
for some reasons ifconfig truncates interface names.  If you want to
have fewer than 10 aliases, then you need to limit your interface names to
7 characters, and 6 for fewer than 100 interfaces.
 
     - The auto_failback parameter 	is
a replacement for the old nice_failback parameter. 	The old value
    nice_failback on is replaced by 		auto_failback off. 	The
old value nice_failback off is logically replaced by 		the new auto_failback
on parameter. 		Unlike the old nice_failback off behavior, 		the
new auto_failback on allows the use 		of the ipfail and hb_standby
facilities. 	
    During upgrades from nice_failback to auto_failback, it is 		sometimes
necessary to set auto_failback to 		legacy, as described in
the    		upgrade procedure 		below. 
-  	To upgrade from a pre-auto_failback
version of heartbeat to one 	which supports auto_failback, the following
procedures are 	recommended to avoid a flash cut on the whole 	cluster. 
	
    
 	- Stop heartbeat on one node in the cluster.
- upgrade this node.  If the other node has 		nice_failback on
in ha.cf then set 		auto_failback off in the new ha.cf file. 		If
the other node in the cluster has 		nice_failback off then set 		auto_failback
legacy in the new ha.cf file. 	
- Start the new version of heartbeat on this node.
- Stop heartbeat on the other node in the cluster.
- upgrade this second node in the cluster with the new 		version of heartbeat.
 Set auto_failback 		the same as it was set in the previous step. 		
- Start heartbeat on this second node in the cluster.
- If you set auto_failback to on or 		off, then
you are done.  Congratulations! 		
- If you set auto_failback legacy in your ha.cf file, 		then continue
as described below... 		
- Schedule a time to shut down the entire 		cluster for a few seconds. 
		
- At the scheduled time, stop both nodes in the cluster, 		and then change
the value of auto_failback 		to on in the ha.cf file on both
sides. 		
- Restart both nodes on the cluster at about the same 		time. 		
- Congratulations, you're done! 		You can now use ipfail, and can also 
		use the hb_standby command 		to cause manual resource moves.
 
 
 
-  Please be sure that you read all documentation 
	and searched mail list archives. If you still can't find a 	solution you
can post questions to the mailing list. Please 	include following:      
    
  
    - We love to get good patches.  Here's the preferred
way:   
    
     - If you have any questions about the patch, please check with the 
linux-ha-dev mailing list for answers before starting.
- Make your changes against the current CVS source
- Test them, and make sure they work ;-)
- Produce the patch this way:
 cvs -q diff -u >patchname.txt
- Send an email to the linux-ha-dev mailing list with the patch as 
a [text/plain] attachment. 	If your mailer wants to zip it up for you, please
fix it.