Page History

...

NOTE: This content was created for an earlier version of the Lustre file system and is currently being updated.

Setting Up Cluster Communications

Communication between the nodes of the cluster allows all nodes to “see” each other. In modern clusters, OpenAIS, or more specifically, its communication stack corosync, is used for this task. All communication paths in the cluster should be redundant so that a failure of a single path is not fatal for the cluster.

An introduction to the setup, configuration and operation of a Pacemaker cluster can be found in:

Pacemaker 1.0 Configuration Explained at documentation: www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained. The latest docs are in Pacemaker 1.1 for Corosync 2.x and crmsh.
M. Schwartzkopff, Clusterbau: Hochverfügbarkeit mit pacemaker, OpenAIS, heartbeat und LVSLinux, O'Reilly Vlg. Gmbh & Co., Dec 2009Jun 2012. (German)

Setting Up the corosync Communication Stack

The corosync communication stack, developed as part of the OpenAIS project, supports all the communication needs of the cluster. The package is included in all recent Linux distributions. If it is not included in your distribution, you can find precompiled binaries at www.clusterlabs.org/rpm. It is also possible to compile OpenAIS from source and install it on all HA nodes by running ./configure; make and make install.

...

totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber:    0
		bindnetaddr:   10192.0168.0.0
		mcastaddr:		     226.94.1.1
		mcastport:		     5405
	}
}

Corosync uses the option bindnetaddr to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10192.0168.0.0/24. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes. If you have multiple interfaces you can add additional interface{} sections with different ringnumbers.

2. Edit the aisexec section of the configuration file to designate which user can start the service. The user must be root:

...

Printing ring status. 
Local node ID (...)
RING ID 0 
	id		= (...)
	status				= ring 0 active with no faults

Setting up Redundant Communication Using Bonding

It is recommended that you set up the cluster communication via two or more redundant paths. One way to achieve this is to use bonding interfaces. Please consult the documentation for your Linux distribution for information about how to configure bonding interfaces.

Setting up Redundant Communication within corosync

The corosync package provides a means for redundant communication. If two or more interfaces for the communication exist, an administrator can configure multiple interface{ } sections in the configuration file, each with a different ringnumber. The rrd_mode option tells the cluster how to use these interfaces. If the value is set to active, corosync uses all interfaces actively. If the value is set to passive, corosync uses the second interface only if the first ring fails.

rrd_mode: active
totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber:  0
		bindnetaddr: 192.168.0.0
		mcastaddr:   226.94.1.0
		mcastport:   5400
	}
	interface {
		ringnumber:  1
		bindnetaddr: 192.168.1.0
		mcastaddr:   226.94.1.1
		mcastport:   5401
	}

}

Setting Up Resource Management

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

Note: The simplest way to configure the cluster is by using a crm subshell. All examples will be given in this notation. If you understood the syntax of the cluster configuration, you also can use the GUI or XML notation.

Completing a Basic Setup of the Cluster

To test that your cluster manager is running and set global options, complete the steps below.

...

After the global options of the cluster are set up correctly, continue to the following sections to configure resources and constraints.

Configuring Resources

OSTs are represented as Filesystem resources. A Lustre cluster consists of several Filesystem resources along with constraints that determine on which nodes of the cluster the resources can run.

...

If mounting the file system depends on another resource like the start of a RAID or multipath driver, you can include this resource in the cluster configuration. This resource is then monitored by the cluster, enabling Pacemaker to react to failures.

Configuring Constraints

In a simple Lustre cluster setup, constraints are not required. However, in a larger cluster setup, you may want to use constraints to establish relationships between resources. For example, to keep the load distributed equally across nodes in your cluster, you may want to control how many OSTs can run on a particular node.

...

To find out more about how the cluster resource manager calculates points, see the Pacemaker documentation.

Internal Monitoring of the System

In addition to monitoring of the resource itself, the nodes of the cluster must also be monitored. An important parameter to monitor is whether the node is connected to the network. Each node pings one or more hosts and counts the answers it receives. The number of responses determines how “good” its connection is to the network.

...

Other system checks, such as CPU usage or free RAM, are measured by the Sysinfo resource. The capabilities of the Sysinfo resource are somewhat limited, so it will be replaced by the SystemHealth strategy in future releases of Pacemaker. For more information about the SystemHealth feature, see: www.clusterlabs.org/wiki/SystemHealth

Administering the Cluster

Careful system administration is required to support high availability in a cluster. A primary task of an administrator is to check the cluster for errors or failures of any resources. When a failure occurs, the administrator must search for the cause of the problem, solve it and then reset the corresponding failcounter. This section describes some basic commands useful to an administrator. For more detailed information, see the Pacemaker documentation.

...

This command removes information about the resource called resMyOST on all nodes.

Setting up Fencing

Fencing is a technique used to isolate a node from the cluster when it is malfunctioning to prevent data corruption. For example, if a “split-brain” condition occurs in which two nodes can no longer communicate and both attempt to mount the same filesystem resource, data corruption can result. (The Multiple Mount Protection (MMP) mechanism in Lustre is designed to protect a file system from being mounted simultaneously by more than one node.)

...

When the other nodes are not able to see the node isolated by the firewall, the isolated node should be shut down or rebooted.

Setting Up Monitoring

Any cluster must to be monitored to provide the high availability it was designed for. Consider the following scenario demonstrating the importance of monitoring:

...

These options are described in the following sections.

Using crm_mon to Send Email Messages

In the most simple setup, the crm_mon program can be used to send out an email each time the status of the cluster changes. This approach requires a fully working mail environment and mail command.

...

If a node fails, which could prevent the email from being sent, the resource is started on another node and an email about the successful start of the resource is sent out from the new node. The administrator's task is to search for the cause of the failover.

Using crm_mon to Send SNMP Traps

The crm_mon daemon can be used to send SNMP traps to a network management server. The configuration from the command line is:

...

The MIB of the traps is defined in the PCMKR.txt file.

Polling the Failcounters

If all the nodes of a cluster have problems, pushing information about events may be not be sufficient. An alternative is to check the failcounters of all resources periodically from the network management station (NMS). A simple script that checks for the presence of any failcounters in the output of crm_mon -1f is shown below:

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key