Using Pacemaker 1.1 with a Lustre File System

Setting Up Cluster Communications

Communication between the nodes of the cluster allows all nodes to “see” each other. In modern clusters, OpenAIS, or more specifically, its communication stack corosync, is used for this task. All communication paths in the cluster should be redundant so that a failure of a single path is not fatal for the cluster.

An introduction to the setup, configuration and operation of a Pacemaker cluster can be found in:

Pacemaker 1.1 Configuration Explained at www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained.

This document will use the pcs tool to configure the cluster. This is fully supported on el7 variants or is available from Cluster Labs on github.

Setting Up the corosync Communication Stack

The corosync communication stack, developed as part of the OpenAIS project, supports all the communication needs of the cluster. The package is included in all recent Linux distributions. If it is not included in your distribution, you can find precompiled binaries at www.clusterlabs.org/rpm. It is also possible to compile OpenAIS from source and install it on all HA nodes by running ./configure; make and make install.

1.1 Install Packages

For Redhat 7 and CentOS 7 distributions install the following (available from the default repositories):

# yum -y install corosync corosync-cli fence-agents pcs

Also download (or build) and install lustre rpms and (if using) ZFS rpms.

1.2 Setup IPs

For the dual ring setup, you must have two seperate network interface connections. A cross-over cable between two nodes will work for a second interface.

1.3 Firewall

If you run firewalld on el7 you will need to open the following ports on all servers:

988/TCP - Lustre communication
2224/TCP - PCSd communication
5400-5401/UDP - Corosync cluster communication via Multicast (or as you define in 3. Setup Cluster Proper)

2. Setup Cluster Auth

On each Node:

passwd hacluster

On Single node:

pcs cluster auth NODE1 NODE2

3. Setup Cluster Proper

On single node with double rings:

NODE1=10.0.10.10
NODE2=10.0.10.11
RING0=10.0.10.0
RING1=10.0.11.0
# pcs cluster setup --name MDS $NODE1 $NODE2 --transport udp --token 17000 \

        --addr0 $RING0 --mcast0 226.94.1.0 --mcastport0 5400 \

        --addr1 $RING1 --mcast1 226.94.1.1 --mcastport1 5401 --start

This creates a file (on each listed node) that should look this:

totem {
    version: 2
    secauth: off
    cluster_name: MDS
    transport: udp
    rrp_mode: passive
    token: 17000

    interface {
        ringnumber: 0
        bindnetaddr: 10.0.10.0
        mcastaddr: 226.94.1.0
        mcastport: 5400
    }
    interface {
        ringnumber: 1
        bindnetaddr: 10.0.11.0
        mcastaddr: 226.94.1.1
        mcastport: 5401
    }
}
nodelist {
    node {
        ring0_addr: 10.0.10.10
        nodeid: 1
    }
    node {
        ring0_addr: 10.0.10.11
        nodeid: 2
    }
}
quorum {
    provider: corosync_votequorum
    two_node: 1
}
logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

The totem section of the configuration file describes the way corosync communicates between nodes. The token timeout is extended to 17s (from 1 second) to enabled heavily loaded servers not to timeout ring communications.

Corosync uses the option bindnetaddr to determine which interface is to be used for cluster communication. The example above assumes one of the node’s interfaces is configured on the network 10.0.0.0. The value of the option is calculated from the IP address AND the network mask for the interface (IP & MASK) so the final bits of the address are cleared. Thus the configuration file is independent of any node and can be copied to all nodes.

4. Check Cluster status

# pcs status
Cluster name: MDS
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: ieel-mds04 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Mar 10 17:14:27 2017        Last change: Fri Mar 10 17:14:25 2017 by root via cibadmin on ieel-mds03

2 nodes and 0 resources configured

Online: [ ieel-mds03 ieel-mds04 ]

No resources

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

5. Setup Stonith

See Red Hat 7 HA Fencing

Something along the lines of:

pcs stonith create stonith_id stonith_device_type [stonith_device_options]

Fencing can be disabled (though this is strongly discouraged for production environments):

pcs property set stonith-enabled=false

6. If your cluster consists of just two nodes, switch the quorum feature off.

On a single server enter the following:

 pcs property set no-quorum-policy=ignore

If your lustre setup comprises more than two nodes, you can leave the no-quorum option as it is.

Setting Up Resource Management

All services that the Pacemaker cluster resource manager will manage are called resources. The Pacemaker cluster resource manager uses resource agents to start, stop or monitor resources.

Below walks the user through creating and configuring a Lustre resource (specifically an MGT) with (optionally) ZFS backing.

7.1 ZFS Resource (if using ZFS)

ZFS resource agent via Github (Should be available in resource-agents rpm after version 4.0.1)

this should be installed into:

/usr/lib/ocf/resource.d/heartbeat/ZFS

7.2 Lustre Resource

Lustre resource agents are available in an rpm:

lustre-resource-agents

This rpm is available in Lustre 2.10 and forward.

The main resource agent should be installed as follows:

/usr/lib/ocf/resource.d/lustre/Lustre

8.1 Create ZFS Pool (if using ZFS)

On the main server:

# zpool create -o cachefile=none MGS zpool_definition

8.2 Create Lustre Server

Bellow is an example of creating a Lustre MGT on main server:

# mkfs.lustre --reformat --servicenode $NID1:$NID2 --backfstype=zfs --mgs MGS/MGT

or for ldiskfs:

# mkfs.lustre --reformat --servicenode $NID1:$NID2 --backfstype=ldiskfs --mgs /dev/mapper/mpatha

On each MGS server (main and fail-over):

# mkdir /mnt/MGS

9.1 Configure ZFS Resource (if using ZFS)

Create the ZFS Resource in a group to be grouped with the Lustre Resrouce (below):

pool is the zpool created above
timeout=90 below are to ensure ZFS has enough time to import the pool

# pcs resource create poolMGS ZFS params pool="MGS" op start timeout="90" op stop timeout="90"

9.2 Configure Lustre Resource

Create Lustre resource:

target is the mounted device either MGS/MGT or /dev/mapper/mpatha
mountpoint is the location to be mounted (created above in 8.2)

# pcs resource create lustreMGS Lustre target=MGS/MGT mountpoint=/mnt/MGS/

9.3 Order resource (if using ZFS)

This ensures the zpool is imported and then the MGS is mounted, it also forces the resources to be on the same node:

# pcs resource group add group-MGS poolMGS lustreMGS

Resource groups have implicit co-location and ordering. Ordering is based on the order of resources added to the group. This behavior can also be achieved via "constraints" but the "resource group" method is simpler to administer.

As similar tact could be taken with LVM resources for ldiskfs backed.

External Resources

Configure Two Node Highly-Available Cluster Using iSCSI Fencing on RHEL7 on IBM DeveloperWorks
Linux cluster with ZFS cluster-in-a-box on bm-store.com
High Availability Add-On Reference (for el7) on Red Hat Customer Portal
Resource Agents on linux-ha.org

Page tree