Sun Cluster Cheat sheet

Cluster Configuration Repository (CCR)

 /etc/cluster/ccr (directory)

Important Files

 /etc/cluster/ccr/infrastructure

Global Services

 One node is to specific global services. All other nodes communicate with the global services (devices, filesystems)

via the Cluster interconnect.

Global Naming (DID Devices)

 /dev/did/dsk and /dev/did/rdsk

DID used only for naming globally — not access

DID device names cannot/are not used in VxVM

DID device names are used in Sun/Solaris Volume Manager

Global Devices

provide global access to devices irrespective of there physical location.

most commonly SDS/SVM/VxVM devices are used as global devices. LVM software is unaware of the

implementation of global nature on these devices.

/global/.devices/node@nodeID

nodeID is an integer representing the node in the cluster

Global Filesystems

 # mount -o global, logging /dev/vx/dsk/nfsdg/vol01 /global/nfs

or edit the /etc/vfstab file to contain the following:

/dev/vx/dsk/nfsdg/vol01 /dev/vx/rdsk/nfsdg/vol01 /global/nfs ufs 2 yes global,logging

Global Filesystem is also known as (aka) Cluster Filesystem (CFS) or PxFS (Proxy File system)

Note

Local failover filesystems (i.e. directly attached to a storage device) cannot be used for scalable services — one

would have to use global filesystems for it.

Console Software

 SUNWccon

There are three wariants of the cluster console software:

cconsole (access the node consoles through the TC or other remote console access method)

crlogin (uses rlogin as underlying transport)

ctelnet (uses telnet as underlying transport)

/opt/SUNWcluster/bin/ &

Cluster Control Panel

 /opt/SUNWcluster/bin/ccp [ clustername ] &

All necessary info for cluster admin is stored in the following two files:

/etc/clusters e.g. sc-cluster sc-node1 sc-node2

/etc/serialports

sc-node1 sc-tc 5002 # Connect via TCP port on TC

sc-node2 sc-tc 5003

sc-10knode1 sc10k-ssp 23 # connect via E10K SSP

sc-10knode2 sc10k-ssp 23

sc-15knode1 sf15k-mainsc 23 # Connect via 15K Main SC

e250node1 RSCIPnode1 23 # Connect via LAN RSC on a E250

node1 sc-tp-ws 23 # Connect via a tip launchpad

sf1_node1 sf1_mainsc 5001 # Connect via passthru on midframe

Sun Cluster Set up

 Don’t mix PCI and SBus SCSI devices

Quorum Device Rules

 A quorum device must be available to both nodes in a 2-node cluster

quorum device info is maintained globally in the CCR db

quorum device should contain user data

Max and optimal number of votes contributed by quorum devices must be N -1 (where N == number of nodes

in the cluster)

If # of quorum devices >= # of nodes, Cluster cannot come up easily if there are too many failed/errored

quorum devices

quorum devices are not required in clusters with more than 2 nodes, but recommended for higher cluster

availability

quorum devices are manually configured after Sun Cluster s/w installation is done

quorum devices are configured using DID devices

Quorum Math and Consequences

 A running cluster is always aware of (Math):

Total possible Q votes (number of nodes + disk quorum votes)

Total present Q votes (number of booted nodes + available quorum device votes) –> Total needed Q votes (

>= 50% of possible votes)

Consequences:

Node that cannot find adequate Q votes will freeze, waiting for other nodes to join the cluster

Node that is booted in the cluster but can no longer find the needed number of votes kernel panics

installmode Flag — allows for cluster nodes to be rebooted after/during initial installation without causing the other

(active) node(s) to panic.

Cluster status Reporting the cluster membership and quorum vote information

 # /usr/cluster/bin/scstat –q

 Verifying cluster configuration info

 # scconf –p

 Run scsetup to correct any configuration mistakes and/or to:

 add or remove quorum disks

add, remove, enable, disable cluster transport components

register/unregister vxVM device groups

add/remove node access from a VxVM device group

change clsuter private host names

change cluster name

Shuting down cluster on all nodes

 # scshutdown -y -g 15

# scstat #verifies cluster status

Cluster Daemons

 lahirdx@aescib1:/home/../lahirdx > ps -ef|grep cluster|grep -v grep

root 4 0 0 May 07 ? 352:39 cluster

root 111 1 0 May 07 ? 0:00 /usr/cluster/lib/sc/qd_userd

root 120 1 0 May 07 ? 0:00 /usr/cluster/lib/sc/failfastd

root 123 1 0 May 07 ? 0:00 /usr/cluster/lib/sc/clexecd

root 124 123 0 May 07 ? 0:00 /usr/cluster/lib/sc/clexecd

root 1183 1 0 May 07 ? 46:45 /usr/cluster/lib/sc/rgmd

root 1154 1 0 May 07 ? 0:07 /usr/cluster/lib/sc/rpc.fed

root 1125 1 0 May 07 ? 23:49 /usr/cluster/lib/sc/sparcv9/rpc.pmfd

root 1153 1 0 May 07 ? 0:03 /usr/cluster/lib/sc/cl_eventd

root 1152 1 0 May 07 ? 0:04 /usr/cluster/lib/sc/cl_eventlogd

root 1336 1 0 May 07 ? 2:17 /var/cluster/spm/bin/scguieventd -d

root 1174 1 0 May 07 ? 0:03 /usr/cluster/bin/pnmd

root 1330 1 0 May 07 ? 0:01 /usr/cluster/lib/sc/scdpmd

root 1339 1 0 May 07 ? 0:00 /usr/cluster/lib/sc/cl_ccrad

FF Panic rule — failfast will shutdown the node (panic the kernel) if specified daemon is not restarted within

30 seconds

cluster — System proc created by the kernel to encap kernel threads that make up the core kernel range of

operations. It directly panics the kernel if it’s sent a KILL signal (SIGKILL). Other signals have no effect.

clexecd — This is used by cluster kernel threads to execute userland cmds (such as run_reserve and dofsck

cmds). It is also used to run cluster cmds remotely (eg: scshutdown).A failfast driver panics the kernel if this daemon

is killed and not restarted in 30 seconds.

cl_eventd — This daemon registers and forwards cluster events s(eg: nodes entering and leaving the cluster). With

a min of SC 3.1 10/03, user apps can register themselves to receive cluster events. The daemon automatically gets

respawned by rpc.pmfd if it is killed.

rgmd — This is the resource group mgr, which manages the state of all cluster-unaware applications. A failfast driver

panics the kernel if this daemon is killed by not started in 30 seconds.

rpc.fed — This is the “fork-and-exec” daemon, which handles reqs from rgmd to spawn methods for specific data

services. failfast will hose the box if this is killed and not restarted in 30 seconds.

scguieventd — This daemon processes cluster events for the SunPlex or Sun Cluster Mgr GUI, so that the display

can be updated in real time. It’s not automatically started if it stops. If you are having trouble with SunPlex or Sun

Cluster Mgr, might have to restart the daemon or reboot the specific node.

rpc.pmfd — This is the process monitoring facility. It is i used as a general mech to initiate restarts and failure

action scripts for some cluster f/w daemons, and for most app daemons and app fault monitors. FF panic rule holds

good.

pnmd — This is the public Network mgt daemon, and manages n/w status info received from the local IPMP

(in.mpathd) running on each node in the cluster. It is automatically restarted by rpc.pmfd if it dies.

scdpmd — multi-threaded DPM daemon runs on each node. DPM daemon is started by an rc script when a node

boots. It montiors the availability of logical path that is visible thru various multipath drivers (MPxIO), HDLM,

Powerpath, etc. Automatically restarted by rpc.pmfd if it dies.

Validating basic cluster config

 The sccheck (/usr/cluster/bin/sccheck) cmd validates the cluster configuration:

/var/cluster/sccheck is the repository where it stores the reports generated.

Disk Path Monitoring

 scdpm -p all:all prints all disk paths in the cluster and their status

scinstall -pv checks the cluster installation status — package revisions, patches applied, etc.

Cluster release file: /etc/cluster/release

Shuting down cluster

 scshutdown -y -g 30

Booting nodes in non-cluster mode

 boot -x

 Placing node in maintenance mode

 scconf -c -q node=,maintstate

 Reset the maintenance mode by rebooting the node or running

 scconf -c -q reset By placing a node in a cluster in maintenance mode, we reduce the number of reqd. quorum

votes and ensure that cluster operation is not disrupted as a result thereof).

Sunplex or Sun Cluster Manager is available on https\:\:3000.

VxVM Rootdg requirements for Sun Cluster

 vxio major number has to be identical on all nodes of the cluster (check for vxio entry in /etc/name_to_major)

vxvm installed on all nodes physically connected to shared storage — on non-storage nodes, yvxvm can be used to

encapsulate and mirror the boot disk. If not using VxVM on a non-storage node, use SVM. All is required in such a

case is the vxio major number be identical to all other nodes of the cluster (add an entry in /etc/name_to_major

file).

VxVM license is reqd. on all nodes not connected to a A5x00 StorEdge array.

Std rootdg created on all nodes where vxVM is installed. Options to initialize rootdg on each node are:

Encap boot disk so it can be mirroered. Preserve all data and creating volumes inside rootdg to encap

/global/.devices/node@#

If disk has more than 5 slices on it, it cannot be encap’ed.

Initialize other local disks into rootdg.

Unique volume name and minor number across the nodes for the /global/.devices/node@# file system if

the boot disk is encap’ed — the /global/.devices/node@# file system must be on devices with a unique name

on

each node, because it’s mounted on each node for the same reason. The normal Solaris OS /etc/mnttab logic

redates global fs and still demands that each device have a unique major/minor number. VxVM doesn’t support

changing minor numbers of individual volumes. The entire disk group has to be re-minored.

Use the following command:

# vxdg [ -g diskgroup ] [ -f ] reminor [diskgroup ] new-base-minor

From the vxdg man pages:

reminor Changes the base minor number for a disk group, and renumbers all devices in

the disk group to a range starting at that number. If the device for a volume is open,

then the old device number remains in effect until the system is rebooted or until the

disk group is deported and re-imported.

Also, if you close an open volume, then the user can execute vxdg reminor again to

cause the renumbering to take effect without rebooting or reimporting.

A new device number may also overlap with a temporary renumbering for a volume device.

This also requires a reboot or reimport for the new device numbering to take effect. A

temporary renumbering can happen in the following situations:

when two volumes (for example, volumes in two different disk groups) share the same

permanently assigned device number, in which case one of the volumes is renumbered

temporarily to use an alternate device number;

or when the persistent device number for a volume was changed, but the active device

number could not be changed to match.

The active number may be left unchanged after a persistent device number change either

because the volume device was open, or because the new number was in use as the active

device number for another volume.

vxdg fails if you try to use a range of numbers that is currently in use as a

persistent (not a temporary) device number. You can force use of the number range with

use of the -f option. With -f, some device renumberings may not take effect until a

reboot or a re-import (just as with open volumes). Also, if you force volumes in two

disk groups to use the same device number, then one of the volumes is temporarily

renumbered on the next reboot. Which volume device is renumbered should be considered

random, except that device numberings in the rootdg disk group take precedence over

all others.

The -f option should be used only when swapping the device number ranges used by two or

more disk groups. To swap the number ranges for two disk groups, you would use -f when

renumbering the first disk group to use the range of the second disk group. Renumbering

the second disk group to the first range does not require the use of -f.

Sun Cluster does not work with Veritas DMP. DMP can be disabled before installing the software by putting in

dummy symlinks, etc.

scvxinstall is a shell script that automates VxVM installation in a Sun Clustered environment

scvxinstall automates the following things:

tries to disable DMP (vxdmp)

installs correct cluster package

automatically negotiates a vxio major number and properly edits /etc/name_to_major

automates rootdg initialization process and encapsulates boot disk

gives different device names for the /global/.devices/node@# volumes on each side

edits the vfstab properly for this same volume. The problem is this particular line has DID device on it,

and VxVM doesn’t understand DID devices.

installs a script to “reminor” the rootdg on the reboot

reboots the node so that VxVM operates properly

Displays existing device group resources in the Cluster

 scstat -D

 Registering VxVM device groups

 scconf -a -D type=vxvm,name=. \

nodelist=:, \

preferenced=true,failback=enabled

nodelist should contain only nodes that are physically connected to the disks of that device group.

preferenced=true/false affects whether nodelist indiciates an order of failover preference. On a two-node

cluster, this options is only meaningful if failback is enabled.

failback=disabled/enabled affects whether a preferred node “takes back” its device group when it joins the

cluster. The default value is disabled. When faileback is disabled, preferenced is set to false. If it is enabled,

preferenced also must be set to true.

Moving device groups across nodes of a cluster

 When VxVM device groups are registered as Sun Cluster resources, NEVER USE vxdg import/deport commands to

change ownership (node-wise) of the device group. This will cause Sun Cluster to treat device group as failed

resource.

Use the following command instead:

# scswitch -z -D -h

 Resyncing device groups

 scconf -c -D name=,sync

 Changing device group configuration

 scconf -c -D name=,preferenced=,failback=

 Maintenance mode

 scswitch -m -D

all volumes in the device group must be unopened or unmounted (not being used) in order to do that.

To come back out of maintenance mode

 scswitch -z -D -h

 Repairing DID device database after replacing JBOD disks

 Make sure you know which disk to update …

scdidadm -l c1t1d0

returns node1:/dev/rdsk/c1t1d0 /dev/did/rdsk/d7

scdidadm -l d7

returns node1:/dev/rdsk/c1t1d0 /dev/did/rdsk/d7

Then use following cmds to update and verify the DID info:

scdidadm -R d7

scdidadm -l -o diskid d7

returns a large string with disk id.

Replacing a failed disk in a A5200 Array (similar concept with other FC disk arrays)

 vxdisk list #get the failed disk name

vxprint -g dgname #determine state of the volume(s) that might be affected

On the hosting node, replace the failed disk:

luxadm remove enclosure,position

luxadm insert enclosure,position

On either node of the cluster (that hosts the device group):

scdidadm -l c#t#d#

scdidadm -R d#

On the hosting node:

vxdctl enable

vxdiskadm #replace failed disk in vxvm

vxprint -g

vxtask list #ensure that resyncing is completed

Remove any relocated submirrors/plexes (if hot-relocation had to move something out of the way):

vxunreloc repaired-diskname

 Solaris Vol Mgr (SDS) in Sun Clustered Env

 Preferred method of using Soft partitions is to use single slices to create mirrors and then create volumes (soft

partitions) from that (kind of similar to VxVM public region in an initialized disk).

Shared Disksets and Local Disksets

 Only disks that are physically located in the multi-ported storage will be members of shared disksets. Only disks that

are in the same diskset operate as a unit; they can be used together to build mirrored volumes, and primary ownership

of the diskset transfers as a while from node to node.

Boot disks are the local disksets. This is a pre-requisite in order to have shared disksets.

Replica management

 Add local replicas manually.

Put local state db replicas on slice 7 of disks (as a convention) in order to maintain uniformity. Shared disksets

have to have replicas on slice 7.

Spread local replicas evenly across disks and controllers.

Support for Shared disksets is provided by Pkg SUNWmdm

Modifying /kernel/drv/md.conf

 nmd \=\= max num of volumes (default 128)

md_nsets \=\= max is 32, default 4.

 Creating shared disksets and mediators

 scdidadm -l c1t3d0

Returns d17 as DID device

scdidadm -l d17

metaset -s -a -h # creates metaset

metaset -s -a -m # creates mediator

metaset -s -s /dev/did/rdsk/d9 /dev/did/rdsk/d17

metaset # returns values

metadb -s

medstat -s # reports mediator status

Remaining syntax vis-a-vis Sun Cluster is identical to that for VxVM.

IPMP and Sun Cluster

 IPMP is cluster un-aware. To work around that, Sun Cluster uses Cluster-specific public network mgr daemon (pnmd)

to integrate IPMP into the cluster. pmnd daemon has two capabilities:

populate CCR with public network adapter status

facilitate application failover

When pnmd detects all members of a local IPMP group have failed, it consults a file called

/var/cluster/run/pnm_callbacks. This file contains entries that would have been created by the activation of

LogicalHostname and SharedAddress resources. It is the job of hafoip_ipmp_callback to device whether

to migrate resources to another node.

scstat -i #view IPMP configuration

file systems (to failover, local file system must reside on global device

groups with affinity switchovers enabled)

Data Service Agent — is a specially written software that allows a data service in a cluster to operate properly.

Data Service Agent (or Agent) does the following to a standard application:

stops/starts an application

monitors faults

validates configuration

provides a registration information file that allows Sun Cluster to store all the info about the methods

Sun Cluster 2.x runs Fault Monitoring components on failover node, and can initiate a takeover. On Cluster 3.x

software, it is not allowed. Monitor can either monitor to restart or failover on primary (active host) node.

Failover resource groups:

Logical host resource — SUNW.Logicalhostname Data Storage Resource — SUNW.HAStoragePlus NFS resource —

SUNW.nfs

Shutdown a resource group

 scswitch -F -g

Turn on a resource group

 scswitch -Z -g

 Switch a failover group over to another node

 scswitch -z -g -h

 Restart a resource group

 scswitch -R -h -g

 Evacuate all resources and rgs from a node

 scswitch -S -h node

 Disable a res and its fault monitor

 scswitch -n -j

 Enable a resource and it’s fault monitor

 scswitch -e -j

 Clear the STOP_FAILED flag

 scswitch -c -j -h -f STOP_FAILED

 How to add a diskgroup and volume to Cluster configuration

 Create the disk group and volume.

  1. Register the local disk group with the cluster.

root@aesnsra1:../ # scconf -a -D type=vxvm,name=patroldg2,nodelist=aesnsra2

root@aesnsra2:../ # scswitch -z -h aesnsra2 -D patroldg2

  1. Create your file system.
  2. Update /etc/vfstab to change ‘-‘ boot options

Example:

/dev/vx/dsk/patroldg2/patroldg02 /dev/vx/rdsk/patroldg2/patroldg02 \

/patrol02 vxfs 3 no suid

  1. Set up a resource group with a HAStoragePlus resource for local filesystem:

root@aesnsra2:../ # scrgadm -a -g aescib1-hastp-rg -h aescib1

root@aesnsra2:../ # scrgadm -a -g aescib1-hastp-rg -j sapmntdg01-rs \

-t SUNW.HAStoragePlus -x FilesystemMountPoints=/sapmnt

  1. Bring the resource group online which will mount the specified filesystem:

root@aesnsra2:../ # scswitch -Z -g hastp-aesnsra2-rg

  1. Enable resource.

root@aesnsra2:../# scswitch -e -j osdumps-dev-rs

  1. (Optional) Reboot and test.

 

Fault monitor operations

Disable the fault monitor for a resource

 scswitch -n -M -j

 Enable the fault monitor for a resource

 scswitch -e -M -j

scstat -g #shows status of all resource groups

Using scrgadm to register and configure Data service software

 scrgadm -a -t SUNW.nfs

scrgadm -a -t SUNW.HAStoragePlus

scrgadm -p

 Create a failover resource

 scrgadm -a -f nfs-rg -h node1,node2 -y Pathprefix=/global/nfs/admin

 Add logical host name resource to resource group

 scrgadm -a -L -g nfs-rg -l clustername-nfs

 Create a HAStoragePlus resource

 scrgadm -a -j nfs-stor -g nfs-rg *-t SUNW.HAStoragePlus* \

-x FilesystemMountpoints=/global/nfs -x AffinityOn=True

 Create SUNW.nfs resource

 scrgadm -a -j nfs-res -g nfs-rg -t SUNW.nfs -y Resource_dependencies=nfs-stor

 Print the various resource/resource group dependencies via scrgadm

 scrgadm -pvv|grep -i depend #And then parse this output

Enable resource and resource monitors, manage resource group and switch resource group to online state

 scswitch -Z -f nfs-rg

scstat -g

 Show current resource group configuration

scrgadm -p[v[v]] [ -t resource_type_name ] [ -g resgrpname ] [ -j resname ]

 Resizing a VxVM/VxFS vol/fs under Sun Cluster

 # vxassist -g aesnfsp growby saptrans 5g

# scconf -c -D name=aesnfsp,sync

root@aesrva1:../ # vxprint -g aesnfsp -v saptrans

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

v saptrans fsgen ENABLED 188743680 – ACTIVE – –

root@aesrva1:../ # fsadm -F vxfs -b 188743680 /saptrans

UX:vxfs fsadm: INFO: /dev/vx/rdsk/aesnfsp/saptrans is currently 178257920 sector

s – size will be increased

# root@aesrva1:../ # scconf -c -D name=aesnfsp,sync

 Command Quick Reference

 scstat

scconf

scrgadm

scha_

scdidadm

Sun Terminal Concentrator (Annex NTS)

 Enable setup mode by pressing TC test button until TC power indicator starts to blink rapidly, then release the button

and press it briefly. On entering the Setup mode, a monitor: prompt is displayed.

Set up IP address using:

monitor::addr

Setting up Load source:

monitor::seq

Specifying image:

monitor::image

Telnet into the TC IP address:

Enter cli

Elevate to privileged acct using su

Run admin at the TC OS prompt:

get admin: subprompt:

show port=1 type mode

set port= type mode #Choose various options

quit (to exit the boot prompt)

boot