NetApp – Clustered DATA ONTAP – Storage Failover – SFO – Part 10

Cloud_Devops

9 years ago

In NetApp Storage system, RAID feature will provide the fault tolerance in-case of any disk failures. But what will happen if the controller(node) itself fails ? NetApp ships the controllers as HA pairs (Two controllers in one chassis ). If one node fails , automatically other controller will take over the storage. Enabling storage failover (SFO) is done within pairs, regardless of how many nodes are in the cluster. For SFO (storage failover), the HA pairs must be of the same model. The cluster itself can contain a mixture of models, but each HA pair must be homogeneous. The version of the Data ONTAP operating system must be the same on both nodes of the HA pair, except for the short period of time during which the pair is upgraded. Two HA interconnect cables are required to connect the NVRAM cards (except for FAS and V-Series 32×0 models with single-enclosure HA). The storage failover(SFO) can be enabled on either node in the pair. Storage Failover(SFO) can be initiated from any node in the cluster.

Cluster high availability (HA) is activated automatically when you enable storage failover on clusters that consist of two nodes, and you should be aware that automatic giveback is enabled by default. On clusters that consist of more than two nodes, automatic giveback is disabled by default, and cluster HA is disabled automatically.

Let’s have closer look at HA Pairs:

HA pair controllers are connected to each other through an HA interconnect. This allows one node to serve data that
resides on the disks of its failed partner node. Each node continually monitors its partner, mirroring the data for each other’s nonvolatile memory (NVRAM or NVMEM). The interconnect is internal and requires no external cabling if both controllers are in the same chassis.

HA pairs are components of the cluster, and both nodes in the HA pair are connected to other nodes in the cluster
through the data and cluster networks. But only the nodes in the HA pair can take over each other’s storage. Non-HA nodes are not supported in a cluster that contains two or more nodes. Although single-node clusters are supported, joining two single-node clusters to create one cluster is not supported, unless you wipe clean one of the single-node clusters and join it to the other to create a two-node cluster that consists of an HA pair.

Let’s see that what will happen during the unplanned event,

Assume that Node1 and Node 2 own their root and data aggregates.
If Node1 fails ,
Node2 takeover root and data aggregates of Node1 .

When a node fails, an unplanned event or automatic takeover is initiated (8.2 and prior). Ownership of data aggregates is changed to the HA partner. After the ownership is changed, the partner can read and write to the volumes on the partner’s data aggregates. Ownership of aggr0 disks remain with the failed node, but the partner takes over control of the aggregate which can be mounted from the partner for diagnostic purposes.

Giveback :

Automatic or Manual giveback is initiated with storage failover giveback command.
Aggr0 is given back to node 1 to boot the node.
Data aggregate giveback occurs one aggregate at a time.

Giveback is initiated by the storage failover giveback command or by automatic giveback if the system is configured for it. The node must have access to its root volume on aggr0 to fully boot. The CFO HA policy ensures that aggr0 is given back immediately to the allow the node to boot. After the node has fully booted, the partner node returns ownership of the data aggregates one at a time until giveback is complete. You can monitor the progress of the giveback with the storage failover show-giveback command. I/O resumes for each aggregate when giveback is complete for that aggregate, thereby reducing the overall outage window of each aggregate.

Aggregation Relocation: (ARL)

Aggregate relocation operations take advantage of the HA configuration to move the ownership of storage aggregates
within the HA pair. Aggregate relocation occurs automatically during manually initiated takeover and giveback operations to reduce downtime during maintenance. Aggregate relocation can be initiated manually for load balancing. Aggregate relocation cannot move ownership of the root aggregate.

During a manually initiated takeover, before the target controller is taken over, ownership of each aggregate that belongs to the target controller is moved to the partner controller one aggregate at a time. When giveback is initiated, the ownership is automatically moved back to the original node. To suppress aggregate relocation during the takeover, use the -bypass-optimization parameter with the storage failover takeover command.

Planned Event in ONTAP 8.2 with ARL:

When a node takes over its partner, it continues to serve and update data in the partner’s aggregates and volumes. To do this, it takes ownership of the partner’s data aggregates, and the partner’s LIFs migrate according to network interface failover rules.

What is the difference between NetApp CFO and SFO ?

Root Aggregates are always assigned to CFO (controller Failover) policy.
Data Aggregates are assigned to SFO (Storage Failover policy)

Check the HA Pair status :

cluster::> storage failover show
                    Takeover
Node      Partner   Possible State
--------- --------- -------- --------------------------------
A         B         true     Connected to B
B         A         true     Connected to A

Check the aggregate’s failover policy on the cluster nodes.

NetUA::> aggregate show -node NetUA-01 -fields ha-policy
  (storage aggregate show)
aggregate     ha-policy
------------- ---------
NetUA01_aggr1 sfo
aggr0_01      cfo
2 entries were displayed.

NetUA::> aggregate show -node NetUA-02 -fields ha-policy
  (storage aggregate show)
aggregate ha-policy
--------- ---------
aggr0_02  cfo

NetUA::>

Aggr0_xx represents the root volume of controller node.So the failover policy will be set to CFO always. All the data aggregate storage policy has been set to SFO.

Note: We should not store any data volumes on aggr0.

The following commands will help you to identify the failover policy for specific node.

NetUA::> storage failover show -node NetUA-01
NetUA::> storage failover show -node NetUA-02

NetUA-01 & NetUA-02 are HA node names.

To disables auto giveback on the HA nodes, use the following command.

NetUA::> storage failover modify -node NetUA-01 -auto-giveback false
NetUA::> storage failover modify -node NetUA-02 -auto-giveback false

To enables the auto giveback on HA nodes, use the following command.

NetUA::> storage failover modify -node NetUA-01 -auto-giveback true
NetUA::> storage failover modify -node NetUA-02 -auto-giveback true

To initiate the failover, use the following command.

NetUA::storage failover> storage failover takeover -ofnode  NetUA-02

or 

NetUA::storage failover> storage failover takeover -bynode   NetUA-01
NetUA::storage failover>

You can use either one of the above command to take over the NetUA-02 node’s storage.

Please read the failover man page carefully to know the available option .

Note: { -ofnode {|local} - Node to Takeover
This specifies the node that is taken over. It is shut down and its partner takes over its storage.

| -bynode {|local} } - Node Initiating Takeover
This specifies the node that is to take over its partner's storage.

[-option ] - Takeover Option
This optionally specifies the style of takeover operation. Possible values include the following:

[-bypass-optimization {true|false}] - Bypass Takeover Optimization
If this is an operator-initiated planned takeover, this parameter specifies whether the takeover optimization is bypassed. This parameter defaults to false.

[-skip-lif-migration [true]] - Skip LIF Migration
This parameter specifies that LIF migration prior to takeover is skipped. Without this parameter, the command attempts to synchronously migrate data and cluster management LIFs away from the node prior to its takeover. If the migration fails or times out, the takeover is aborted.

o normal - Specifies a normal takeover operation; that is, the partner is given the time to close its storage resources gracefully before
the takeover operation proceeds. This is the default value.

o immediate - Specifies an immediate takeover. In an immediate takeover, the takeover operation is initiated before the partner is given the time to close its storage resources gracefully. The use of this option results in an immediate takeover which does not do a clean shutdown. In case of NDU this can result in a NDU failure.

Attention: If this option is specified, negotiated takeover optimization is bypassed even if the -bypass-optimization option is set to false.

o allow-version-mismatch - If this value is specified, the takeover operation is initiated even if the partner is running a version of software that is incompatible with the version running on the node. In this case, the partner is given the time to close its storage resources gracefully before the takeover operation proceeds. Use this value as part of a non-disruptive upgrade procedure.

o force - If this value is specified, the takeover operation is initiated even if the node detects an error that normally prevents a takeover operation from occurring. This value is available only at the advanced privilege level and higher.

Attention: If this option is specified, negotiated takeover optimization is bypassed even if the -bypass-optimization option is set to false.
Caution: The use of this option can potentially result in data loss. If the HA interconnect is detached or inactive, or the contents of the failover partner's NVRAM cards are unsynchronized, takeover is normally disabled. Using the -force option enables a node to take over its partner's storage despite the unsynchronized NVRAM, which can contain client data that can be lost upon storage takeover.

SFO summary – NetApp

Hope you got the fair idea about the storage failover on NetApp clustered Data ONTAP.