RHEL 7 – Pacemaker – Configure Redundant Corosync Links on Fly– Part 10

Cloud_Devops

9 years ago

Corosync cluster engine provides the reliable inter-cluster communications between the cluster nodes. It syncs the cluster configuration across the cluster nodes all the time. It also maintains the cluster membership and notifies when quorum is achieved or lost. It provides the messaging layer inside the cluster to manage the system and resource availability. In Veritas cluster , this functionality has been provided by LLT + GAB (Low latency transport + Global Atomic Broadcast) . Unlike veritas cluster, Corosync uses the existing network interface to communicate with cluster nodes.

Why do we need redundant corosync Links ?

By default ,we configure the network bonding by aggregating couple of physical network interfaces for primary node IP. Corosync will use this interface as heartbeat link in default configurations. If there is an issue with network and lost the network connectivity between two nodes , cluster might need to face the split brain situation. To avoid split brain , we are configuring additional network links. This network link should be configured with different network switch or we can use the direct network cable between two nodes.

Note: For tutorial simplicity , we will use unicast (Not Multicast) for corosync. Unicast method should be fine for two node clusters.

Configuring the additional corosync links is an online activity and can be done without impacting the services.

Let’s explore the existing configuration:

1. View the corosync configuration using pcs command.

[root@UA-HA ~]# pcs cluster corosync
totem {
    version: 2
    secauth: off
    cluster_name: UABLR
    transport: udpu
}

nodelist {
    node {
        ring0_addr: UA-HA
        nodeid: 1
    }

    node {
        ring0_addr: UA-HA2
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

[root@UA-HA ~]#

2. Corosync uses two UDP ports mcastport (for mcast receives) and mcastport – 1 (for mcast sends).

mcast receives: 5405
mcast sends: 5404

[root@UA-HA ~]# netstat -plantu | grep 54 |grep corosync
udp        0      0 192.168.203.134:5405    0.0.0.0:*                           34363/corosync
[root@UA-HA ~]#

3. Corosync configuration file is located in /etc/corosync.

[root@UA-HA ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: UABLR
    transport: udpu
}

nodelist {
    node {
        ring0_addr: UA-HA
        nodeid: 1
    }

    node {
        ring0_addr: UA-HA2
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}
[root@UA-HA ~]#

4. Verify current ring Status using corosync-cfgtool.

[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.203.134
        status  = ring 0 active with no faults
[root@UA-HA ~]# ssh UA-HA2 corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 192.168.203.131
        status  = ring 0 active with no faults
[root@UA-HA ~]#

As we can see that only one ring has been configured for corosync and it uses the following interfaces from each node.

[root@UA-HA ~]# ifconfig br0
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.203.134  netmask 255.255.255.0  broadcast 192.168.203.255
        

[root@UA-HA ~]# ssh UA-HA2 ifconfig br0
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.203.131  netmask 255.255.255.0  broadcast 192.168.203.255
        
[root@UA-HA ~]#

Configure a new ring :

5. To add additional redundancy for corosync links, we will use the following interface on both nodes.

[root@UA-HA ~]# ifconfig eno33554984
eno33554984: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.0.3  netmask 255.255.255.0  broadcast 172.16.0.255
        
[root@UA-HA ~]# ssh UA-HA2 ifconfig eno33554984
eno33554984: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.0.2  netmask 255.255.255.0  broadcast 172.16.0.255
       
[root@UA-HA ~]#

Dedicated Private address for Corosync Links:
172.16.0.3 – UA-HA-HB2
172.16.0.2 – UA-HA2-HB2

6. Before making changes in corosync configuration, we need to move the cluster in to maintenance mode.

[root@UA-HA ~]# pcs property set maintenance-mode=true
[root@UA-HA ~]# pcs property show maintenance-mode
Cluster Properties:
 maintenance-mode: true
[root@UA-HA ~]#

This will eventually puts the resources in unmanaged state.

[root@UA-HA ~]# pcs resource
 Resource Group: WEBRG1
     vgres      (ocf::heartbeat:LVM):   Started UA-HA (unmanaged)
     webvolfs   (ocf::heartbeat:Filesystem):    Started UA-HA (unmanaged)
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started UA-HA (unmanaged)
     webres     (ocf::heartbeat:apache):        Started UA-HA (unmanaged)
 Resource Group: UAKVM2
     UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2 (unmanaged)
[root@UA-HA ~]#

7. Update the /etc/hosts with following entries on both the nodes.

[root@UA-HA corosync]# cat /etc/hosts |grep HB2
172.16.0.3     UA-HA-HB2
172.16.0.2     UA-HA2-HB2
[root@UA-HA corosync]#

8. Update the corosync.conf with rrp_mode & ring1_addr.

[root@UA-HA corosync]# cat corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: UABLR
    transport: udpu
    rrp_mode: active
}

nodelist {
    node {
        ring0_addr: UA-HA
        ring1_addr: UA-HA-HB2
        nodeid: 1
    }

    node {
        ring0_addr: UA-HA2
        ring1_addr: UA-HA2-HB2
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}
[root@UA-HA corosync]#

Here is the difference between previous configuration file vs New one.

[root@UA-HA corosync]# sdiff -s corosync.conf corosync.conf_back
   rrp_mode: active                                           <
        ring1_addr: UA-HA-HB2                                 <
        ring1_addr: UA-HA2-HB2                                <
[root@UA-HA corosync]#

9. Restart the corosync services on both the nodes.

[root@UA-HA ~]# systemctl restart corosync
[root@UA-HA ~]# ssh UA-HA2 systemctl restart corosync

10. Check the corosync service status.

[root@UA-HA ~]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2015-10-19 02:38:16 EDT; 16s ago
  Process: 36462 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
  Process: 36470 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
 Main PID: 36477 (corosync)
   CGroup: /system.slice/corosync.service
           └─36477 corosync

Oct 19 02:38:15 UA-HA corosync[36477]:  [QUORUM] Members[2]: 2 1
Oct 19 02:38:15 UA-HA corosync[36477]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 19 02:38:16 UA-HA systemd[1]: Started Corosync Cluster Engine.
Oct 19 02:38:16 UA-HA corosync[36470]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Oct 19 02:38:24 UA-HA corosync[36477]:  [TOTEM ] A new membership (192.168.203.134:3244) was formed. Members left: 2
Oct 19 02:38:24 UA-HA corosync[36477]:  [QUORUM] Members[1]: 1
Oct 19 02:38:24 UA-HA corosync[36477]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 19 02:38:25 UA-HA corosync[36477]:  [TOTEM ] A new membership (192.168.203.131:3248) was formed. Members joined: 2
Oct 19 02:38:26 UA-HA corosync[36477]:  [QUORUM] Members[2]: 2 1
Oct 19 02:38:26 UA-HA corosync[36477]:  [MAIN  ] Completed service synchronization, ready to provide service.
[root@UA-HA ~]#

11. Verify the corosync configuration using pcs command.

[root@UA-HA ~]# pcs cluster corosync
totem {
    version: 2
    secauth: off
    cluster_name: UABLR
    transport: udpu
   rrp_mode: active
}

nodelist {
    node {
        ring0_addr: UA-HA
        ring1_addr: UA-HA-HB2
        nodeid: 1
    }

    node {
        ring0_addr: UA-HA2
        ring1_addr: UA-HA2-HB2
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

[root@UA-HA ~]#

12.Verify the ring status.

[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.203.134
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.0.3
        status  = ring 1 active with no faults
[root@UA-HA ~]# ssh UA-HA2 corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
        id      = 192.168.203.131
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.0.2
        status  = ring 1 active with no faults
[root@UA-HA ~]#

You could also check the ring status using following command.

[root@UA-HA ~]# corosync-cmapctl |grep member
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.203.134) r(1) ip(172.16.0.3)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.203.131) r(1) ip(172.16.0.2)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
[root@UA-HA ~]#

We have successfully configured redundant rings for corosync .

13. Clear the cluster maintenance mode.

[root@UA-HA ~]# pcs property unset maintenance-mode

or 

[root@UA-HA ~]#  pcs property set maintenance-mode=false

[root@UA-HA ~]# pcs resource
 Resource Group: WEBRG1
     vgres      (ocf::heartbeat:LVM):   Started UA-HA
     webvolfs   (ocf::heartbeat:Filesystem):    Started UA-HA
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started UA-HA
     webres     (ocf::heartbeat:apache):        Started UA-HA
 Resource Group: UAKVM2
     UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2
[root@UA-HA ~]#

Let’s break it !!

You could easily test the rrp_mode by pulling out the network cable from one of the configured interface. I have just used “ifconfig br0 down” command to simulate this test on UA-HA2 node. Assuming that application/DB is using different interface.

[root@UA-HA ~]# ping UA-HA2
PING UA-HA2 (192.168.203.131) 56(84) bytes of data.
^C
--- UA-HA2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1002ms

[root@UA-HA ~]#

Check the ring status. We can see that ring 0 has been marked as faulty.

[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.203.134
        status  = Marking ringid 0 interface 192.168.203.134 FAULTY
RING ID 1
        id      = 172.16.0.3
        status  = ring 1 active with no faults
[root@UA-HA ~]#

You could see that cluster is running perfectly without any issue.

[root@UA-HA ~]# pcs resource
 Resource Group: WEBRG1
     vgres      (ocf::heartbeat:LVM):   Started UA-HA
     webvolfs   (ocf::heartbeat:Filesystem):    Started UA-HA
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started UA-HA
     webres     (ocf::heartbeat:apache):        Started UA-HA
 Resource Group: UAKVM2
     UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2
[root@UA-HA ~]#

Bring up the br0 interface using “ifconfig br0 up”. Ring 0 is back to online.

[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 192.168.203.134
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.0.3
        status  = ring 1 active with no faults
[root@UA-HA ~]#

Hope this article informative to you. Share it ! Comment it !! Be Sociable !!!