qdisk man page on Scientific

Man page or keyword search:  
man Server   26626 pages
apropos Keyword Search (all sections)
Output format
Scientific logo
[printable version]

QDisk(5)		      Cluster Quorum Disk		      QDisk(5)

NAME
       qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster

1. Overview
1.1 Problem
       In  some	 situations,  it  may  be  necessary or desirable to sustain a
       majority node failure of a cluster without  introducing	the  need  for
       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
       weighted voting nodes).

1.2. Design Requirements
       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
       danger  of  a simple network partition causing a split brain.  That is,
       we need to be able to ensure that the  majority	failure	 case  is  not
       merely the result of a network partition.

       *  Ability  to use external reasons for deciding which partition is the
       the quorate partition in a partitioned cluster.	For  example,  a  user
       may  have  a  service running on one node, and that node must always be
       the master in the event of a network partition.	Or, a node might  lose
       all  network  connectivity  except  the cluster communication path - in
       which case, a user may wish that node to be evicted from the cluster.

       * Integration with CMAN.	 We must not require CMAN to run with  us  (or
       without	us).   Linux-Cluster does not require a quorum disk normally -
       introducing new requirements on the base of how Linux-Cluster  operates
       is not allowed.

       * Data integrity.  In order to recover from a majority failure, fencing
       is required.  The fencing subsystem is already provided by  Linux-Clus‐
       ter.

       *  Non-reliance	on  hardware  or  protocol specific methods (i.e. SCSI
       reservations).  This ensures the quorum disk algorithm can be  used  on
       the widest range of hardware configurations possible.

       *  Little  or  no  memory allocation after initialization.  In critical
       paths during failover, we do not want to	 have  to  worry  about	 being
       killed  during  a  memory  pressure situation because we request a page
       fault, and the Linux OOM killer responds...

1.3. Hardware Considerations and Requirements
1.3.1. Concurrent, Synchronous, Read/Write Access
       This quorum daemon requires  a  shared  block  device  with  concurrent
       read/write  access  from	 all  nodes  in the cluster.  The shared block
       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
       write to the device.

1.3.2. Bargain-basement JBODs need not apply
       There is a minimum performance requirement inherent  when  using	 disk-
       based  cluster  quorum  algorithms, so design your cluster accordingly.
       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
       cause problems at the first load spike.	Plan your loads accordingly; a
       node's inability to write to the quorum disk in a  timely  manner  will
       cause  the cluster to evict the node.  Using host-RAID or multi-initia‐
       tor parallel SCSI configurations with the qdisk daemon is  unlikely  to
       work,  and  will	 probably  cause  administrators a lot of frustration.
       That having been said, because  the  timeouts  are  configurable,  most
       hardware should work if the timeouts are set high enough.

1.3.3. Fencing is Required
       In order to maintain data integrity under all failure scenarios, use of
       this quorum daemon requires adequate  fencing,  preferably  power-based
       fencing.	  Watchdog  timers  and software-based solutions to reboot the
       node internally, while possibly sufficient, are not  considered	'fenc‐
       ing' for the purposes of using the quorum disk.

1.4. Limitations
       *  At  this  time, this daemon supports a maximum of 16 nodes.  This is
       primarily a scalability issue:  As  we  increase	 the  node  count,  we
       increase	 the amount of synchronous I/O contention on the shared quorum
       disk.

       * Cluster node IDs must be statically configured	 in  cluster.conf  and
       must be numbered from 1..16 (there can be gaps, of course).

       * Cluster node votes must all be 1.

       *  CMAN	must  be  running before the qdisk program can operate in full
       capacity.  If CMAN is not running, qdisk will wait for it.

       * CMAN's eviction timeout should be at least 2x the quorum daemon's  to
       give  the  quorum daemon adequate time to converge on a master during a
       failure +  load	spike  situation.   See	 section  3.3.1	 for  specific
       details.

       *  For  'all-but-one'  failure  operation,  the	total  number of votes
       assigned to the quorum device should be equal to or  greater  than  the
       total  number  of  node-votes  in the cluster.  While it is possible to
       assign only one (or a few) votes to the quorum device, the  effects  of
       doing so have not been explored.

       *  For  'tiebreaker'  operation	in  a  two-node	 cluster, unset CMAN's
       two_node flag (or set it to 0), set CMAN's expected votes to  '3',  set
       each node's vote to '1', and leave qdisk's vote count unset.  This will
       allow the cluster to operate if either both nodes are online, or a sin‐
       gle node & the heuristics.

       *  Currently,  the  quorum disk daemon is difficult to use with CLVM if
       the quorum disk resides on a CLVM logical volume.  CLVM requires a quo‐
       rate  cluster  to correctly operate, which introduces a chicken-and-egg
       problem for starting the cluster: CLVM needs  quorum,  but  the	quorum
       daemon  needs  CLVM (if and only if the quorum device lies on CLVM-man‐
       aged storage).  One way to work around this is to *not* set  the	 clus‐
       ter's  expected	votes to include the quorum daemon's votes.  Bring all
       nodes online, and start the quorum daemon *after* the whole cluster  is
       running.	 This will allow the expected votes to increase naturally.

2. Algorithms
2.1. Heartbeating & Liveliness Determination
       Nodes  update  individual  status  blocks on the quorum disk at a user-
       defined rate.  Each write of a status block alters the timestamp, which
       is  what other nodes use to decide whether a node has hung or not.  If,
       after a user-defined number of 'misses' (that is, failure to  update  a
       timestamp),  a  node  is	 declared  offline.  After a certain number of
       'hits' (changed timestamp + "i am alive" state), the node  is  declared
       online.

       The  status block contains additional information, such as a bitmask of
       the nodes that node believes are online.	 Some of this  information  is
       used  by the master - while some is just for performance recording, and
       may be used at a later time.  The most important pieces of  information
       a node writes to its status block are:

	    - Timestamp
	    - Internal state (available / not available)
	    - Score
	    -  Known  max  score  (may be used in the future to detect invalid
	    configurations)
	    - Vote/bid messages
	    - Other nodes it thinks are online

2.2. Scoring & Heuristics
       The administrator can configure up to 10 purely	arbitrary  heuristics,
       and  must  exercise  caution  in doing so.  At least one administrator-
       defined heuristic is required for operation, but it is generally a good
       idea  to	 have more than one heuristic.	By default, only nodes scoring
       over 1/2 of the total maximum score will claim they are	available  via
       the quorum disk, and a node (master or otherwise) whose score drops too
       low will remove itself (usually, by rebooting).

       The heuristics themselves can be any command  executable	 by  'sh  -c'.
       For example, in early testing the following was used:

	    <heuristic program="[ -f /quorum ]" score="10" interval="2"/>

       This is a literal sh-ism which tests for the existence of a file called
       "/quorum".  Without that file, the node would claim it was unavailable.
       This is an awful example, and should never, ever be used in production,
       but is provided as an example as to what one could do...

       Typically, the heuristics should be snippets of shell code or  commands
       which  help  determine  a  node's usefulness to the cluster or clients.
       Ideally, you want to add traces for all of  your	 network  paths	 (e.g.
       check  links,  or  ping routers), and methods to detect availability of
       shared storage.

2.3. Master Election
       Only one master is present at any one time in the  cluster,  regardless
       of  how many partitions exist within the cluster itself.	 The master is
       elected by a simple voting  scheme  in  which  the  lowest  node	 which
       believes	 it  is	 capable of running (i.e. scores high enough) bids for
       master status.  If the other nodes agree, it becomes the master.	  This
       algorithm is run whenever no master is present.

       If another node comes online with a lower node ID while a node is still
       bidding for master status, it will rescind its bid  and	vote  for  the
       lower  node  ID.	  If  a master dies or a bidding node dies, the voting
       algorithm is started over.  The voting algorithm	 typically  takes  two
       passes to complete.

       Master  deaths  take  marginally longer to recover from than non-master
       deaths, because a new master must be elected before the old master  can
       be evicted & fenced.

2.4. Master Duties
       The  master  node  decides who is or is not in the master partition, as
       well as handles eviction of dead nodes (both via the  quorum  disk  and
       via  the	 linux-cluster	fencing	 system	 by using the cman_kill_node()
       API).

2.5. How it All Ties Together
       When a master is present, and if the  master  believes  a  node	to  be
       online, that node will advertise to CMAN that the quorum disk is avail‐
       able.  The master will only grant a node membership if:

	    (a) CMAN believes the node to be online, and  (b)  that  node  has
	    made enough consecutive, timely writes
		to the quorum disk, and
	    (c) the node has a high enough score to consider itself online.

3. Configuration
3.1. The <;quorumd> tag
       This tag is a child of the top-level <cluster> tag.

	<quorumd
	 interval="1"
	    This is the frequency of read/write cycles, in seconds.

	 tko="10"
	    This  is  the  number  of  cycles  a node must miss in order to be
	    declared dead.  The default for this number is  dependent  on  the
	    configured token timeout.

	 tko_up="X"
	    This  is  the  number of cycles a node must be seen in order to be
	    declared online.  Default is floor(tko/3).

	 upgrade_wait="2"
	    This is the number of cycles a node must wait before initiating  a
	    bid	 for master status after heuristic scoring becomes sufficient.
	    The default is 2.  This can not be set to 0, and should not exceed
	    tko.

	 master_wait="X"
	    This  is  the  number  of cycles a node must wait for votes before
	    declaring  itself  master  after  making  a	  bid.	  Default   is
	    floor(tko/2).   This  can not be less than 2, must be greater than
	    tko_up, and should not exceed tko.

	 votes="3"
	    This is the number of votes the quorum daemon advertises  to  CMAN
	    when  it  has  a  high enough score.  The default is the number of
	    nodes in the cluster minus 1.  For example, in a 4	node  cluster,
	    the	 default is 3.	This value may change during normal operation,
	    for example when adding or removing a node from the cluster.

	 log_level="4"
	    This controls the verbosity of the quorum  daemon  in  the	system
	    logs.  0 = emergencies; 7 = debug.	This option is deprecated.

	 log_facility="daemon"
	    This  controls  the syslog facility used by the quorum daemon when
	    logging.  For a complete list of available	facilities,  see  sys‐
	    log.conf(5).  The default value for this is 'daemon'.  This option
	    is deprecated.

	 status_file="/foo"
	    Write internal states out to this file  periodically  ("-"	=  use
	    stdout).  This is primarily used for debugging.  The default value
	    for this attribute is undefined.  This option can be changed while
	    qdiskd is running.

	 min_score="3"
	    Absolute  minimum  score  to  be  consider one's self "alive".  If
	    omitted, or set to 0, the  default	function  "floor((n+1)/2)"  is
	    used,  where  n  is	 the total of all of defined heuristics' score
	    attribute.	This must  never  exceed  the  sum  of	the  heuristic
	    scores, or else the quorum disk will never be available.

	 reboot="1"
	    If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
	    sition as a result in a change in score (see  section  2.2).   The
	    default  for  this	value  is  1 (on).  This option can be changed
	    while qdiskd is running.

	 master_wins="0"
	    If set to 1 (on), only the qdiskd master will advertise its	 votes
	    to	CMAN.  In a network partition, only the qdisk master will pro‐
	    vide votes to CMAN.	 Consequently, that  node  will	 automatically
	    "win" in a fence race.

	    This  option  requires  careful  tuning  of	 the CMAN timeout, the
	    qdiskd timeout, and CMAN's quorum_dev_poll value.  As  a  rule  of
	    thumb,  CMAN's  quorum_dev_poll  value  should be equal to Totem's
	    token timeout and qdiskd's timeout (interval*tko) should  be  less
	    than  half	of  Totem's token timeout.  See section 3.3.1 for more
	    information.

	    This option only takes effect if there are no  heuristics  config‐
	    ured  and  it  is  valid  only for 2 node cluster.	This option is
	    automatically disabled if heuristics are defined  or  cluster  has
	    more than 2 nodes configured.

	    In a two-node cluster with no heuristics and no defined vote count
	    (see above), this mode is turned by default.  If enabled  in  this
	    way at startup and a node is later added to the cluster configura‐
	    tion or the vote count is set to a value other than 1,  this  mode
	    will be disabled.

	 allow_kill="1"
	    If	set  to	 0  (off), qdiskd will *not* instruct to kill nodes it
	    thinks are dead (as a result of not writing to the	quorum	disk).
	    The	 default for this value is 1 (on).  This option can be changed
	    while qdiskd is running.

	 paranoid="0"
	    If set to 1 (on), qdiskd will watch internal timers and reboot the
	    node  if it takes more than (interval * tko) seconds to complete a
	    quorum disk pass.  The default for this value is  0	 (off).	  This
	    option can be changed while qdiskd is running.

	 io_timeout="0"
	    If set to 1 (on), qdiskd will watch internal timers and reboot the
	    node if qdisk is not able to write to disk after (interval *  tko)
	    seconds.   The default for this value is 0 (off). If io_timeout is
	    active max_error_cycles is overridden and set to off.

	 scheduler="rr"
	    Valid values are 'rr', 'fifo', and 'other'.	 Selects the  schedul‐
	    ing	 queue	in  the Linux kernel for operation of the main & score
	    threads (does not affect the heuristics; they are  always  run  in
	    the	 'other'  queue).  Default is 'rr'.  See sched_setscheduler(2)
	    for more details.

	 priority="1"
	    Valid values for 'rr' and 'fifo' are 1..100 inclusive.  Valid val‐
	    ues	 for  'other' are -20..20 inclusive.  Sets the priority of the
	    main & score threads.  The default value is 1 (in the RR and  FIFO
	    queues,  higher  numbers  denote  higher priority; in OTHER, lower
	    values denote higher priority).  This option can be changed	 while
	    qdiskd is running.

	 stop_cman="0"
	    Ordinarily,	 cluster membership is left up to CMAN, not qdisk.  If
	    this parameter is set to 1 (on), qdiskd will tell  CMAN  to	 leave
	    the	 cluster  if it is unable to initialize the quorum disk during
	    startup.  This can be used to prevent cluster participation	 by  a
	    node  which	 has  been disconnected from the SAN.  The default for
	    this value is 0 (off).  This option can be changed while qdiskd is
	    running.

	 use_uptime="1"
	    If	this  parameter	 is set to 1 (on), qdiskd will use values from
	    /proc/uptime for internal timings.	This is	 a  bit	 less  precise
	    than  gettimeofday(2), but the benefit is that changing the system
	    clock will not affect qdiskd's behavior  -	even  if  paranoid  is
	    enabled.   If  set to 0, qdiskd will use gettimeofday(2), which is
	    more precise.  The default for this value is 1 (on / use uptime).

	 device="/dev/sda1"
	    This is the device the quorum daemon will use.  This  device  must
	    be the same on all nodes.

	 label="mylabel"
	    This  overrides  the  device  field if present.  If specified, the
	    quorum daemon will read /proc/partitions and check for qdisk  sig‐
	    natures  on	 every block device found, comparing the label against
	    the specified label.  This is useful in configurations  where  the
	    block device name differs on a per-node basis.

	 cman_label="mylabel"
	    This overrides the label advertised to CMAN if present.  If speci‐
	    fied, the quorum daemon will register with this  name  instead  of
	    the actual device name.

	 max_error_cycles="0"/>
	    If we receive an I/O error during a cycle, we do not poll CMAN and
	    tell it we are alive.  If specified, this value will cause	qdiskd
	    to	exit  after  the specified number of consecutive cycles during
	    which I/O errors occur.  The default  is  0	 (no  maximum).	  This
	    option  can	 be  changed  while qdiskd is running.	This option is
	    ignored if io_timeout is set to 1.

	/>

3.3.1.	Quorum Disk Timings
       Qdiskd should not be used in environments requiring  failure  detection
       times of less than approximately 10 seconds.

       Qdiskd  will  attempt  to  automatically configure timings based on the
       totem timeout and the TKO.   If	configuring  manually,	Totem's	 token
       timeout must be set to a value at least 1 interval greater than the the
       following function:

	 interval * (tko + master_wait + upgrade_wait)

       So, if you have an interval of 2, a tko of  7,  master_wait  of	2  and
       upgrade_wait  of	 2,  the  token	 timeout should be at least 24 seconds
       (24000 msec).

       It is recommended to have at least 3 intervals to reduce	 the  risk  of
       quorum  loss  during heavy I/O load.  As a rule of thumb, using a totem
       timeout more than 2x of qdiskd's timeout will result in good behavior.

       An improper timing configuration will cause CMAN to give up on  qdiskd,
       causing a temporary loss of quorum during master transition.

3.2.  The <;heuristic> tag
       This  tag  is  a	 child	of  the	 <quorumd> tag.	 Heuristics may not be
       changed while qdiskd is running.

	<heuristic
	 program="/test.sh"
	    This is the program used to determine if this heuristic is	alive.
	    This  can  be  anything  which  may	 be executed by /bin/sh -c.  A
	    return value of zero indicates success;  anything  else  indicates
	    failure.  This is required.

	 score="1"
	    This is the weight of this heuristic.  Be careful when determining
	    scores for heuristics.  The default score for each heuristic is 1.

	 interval="2"
	    This is the frequency (in seconds) at which we poll the heuristic.
	    The default interval is determined by the qdiskd timeout.

	 tko="1"
	    After  this	 many failed attempts to run the heuristic, it is con‐
	    sidered DOWN, and its score is removed.  The default tko for  each
	    heuristic is determined by the qdiskd timeout.
	/>

3.3. Examples
3.3.1. 3 cluster nodes & 3 routers
	<cman expected_votes="6" .../>
	<clusternodes>
	    <clusternode name="node1" votes="1" ... />
	    <clusternode name="node2" votes="1" ... />
	    <clusternode name="node3" votes="1" ... />
	</clusternodes>
	<quorumd interval="1" tko="10" votes="3" label="testing">
	    <heuristic	 program="ping	 A  -c1	 -w1"  score="1"  interval="2"
	    tko="3"/>
	    <heuristic	program="ping  B  -c1  -w1"   score="1"	  interval="2"
	    tko="3"/>
	    <heuristic	 program="ping	 C  -c1	 -w1"  score="1"  interval="2"
	    tko="3"/>
	</quorumd>

3.3.2. 2 cluster nodes & 1 IP tiebreaker
	<cman two_node="0" expected_votes="3" .../>
	<clusternodes>
	    <clusternode name="node1" votes="1" ... />
	    <clusternode name="node2" votes="1" ... />
	</clusternodes>
	<quorumd interval="1" tko="10" votes="1" label="testing">
	    <heuristic	program="ping  A  -c1  -w1"   score="1"	  interval="2"
	    tko="3"/>
	</quorumd>

3.4. Heuristic score considerations
       *  Heuristic  timeouts  should be set high enough to allow the previous
       run of a given heuristic to complete.

       * Heuristic scripts returning anything except 0 as  their  return  code
       are considered failed.

       *  The worst-case for improperly configured quorum heuristics is a race
       to fence where two partitions simultaneously try to kill each other.

3.5. Creating a quorum disk partition
       The mkqdisk utility can create and  list	 currently  configured	quorum
       disks visible to the local node; see mkqdisk(8) for more details.

SEE ALSO
       mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)

				  12 Oct 2011			      QDisk(5)
[top]

List of man pages available for Scientific

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net