ocfs2 man page on OpenSuSE

Man page or keyword search:  
man Server   25941 pages
apropos Keyword Search (all sections)
Output format
OpenSuSE logo
[printable version]

OCFS2(7)		      OCFS2 Manual Pages		      OCFS2(7)

NAME
       OCFS2 - A Shared-Disk Cluster File System for Linux

INTRODUCTION
       OCFS2 is a file system. It allows users to store and retrieve data. The
       data is stored in files that are organized in a hierarchical  directory
       tree.  It  is  a POSIX compliant file system that supports the standard
       interfaces and the behavioral semantics as spelled out by that specifi‐
       cation.

       It  is also a shared disk cluster file system, one that allows multiple
       nodes to access the same disk at the same time. This is where  the  fun
       begins  as  allowing  a	file system to be accessible on multiple nodes
       opens a can of worms. What if the nodes are of different architectures?
       What if a node dies while writing to the file system? What data consis‐
       tency can one expect if processes on two nodes are reading and  writing
       concurrently?  What  if one node removes a file while it is still being
       used on another node?

       Unlike most shared file systems where the answer is fuzzy,  the	answer
       in  OCFS2  is very well defined. It behaves on all nodes exactly like a
       local file system. If a file is removed, the directory entry is removed
       but  the inode is kept as long as it is in use across the cluster. When
       the last user closes the descriptor, the inode is marked for deletion.

       The data consistency model follows the same principle. It works	as  if
       the  two	 processes that are running on two different nodes are running
       on the same node. A read on a node gets the last write irrespective  of
       the  IO	mode  used.  The  modes can be buffered, direct, asynchronous,
       splice or memory mapped IOs. It is fully cache coherent.

       Take for example the REFLINK feature that allows a user to create  mul‐
       tiple write-able snapshots of a file. This feature, like all others, is
       fully cluster-aware. A file being written to on multiple nodes  can  be
       safely  reflinked  on  another. The snapshot created is a point-in-time
       image of the file  that	includes  both	the  file  data	 and  all  its
       attributes (including extended attributes).

       It  is  a  journaling  file  system. When a node dies, a surviving node
       transparently replays the journal of the dead node. This	 ensures  that
       the  file  system  metadata  is	always consistent. It also defaults to
       ordered data journaling to ensure the file  data	 is  flushed  to  disk
       before  the  journal  commit,  to remove the small possibility of stale
       data appearing in files after a crash.

       It is architecture and endian neutral. It allows concurrent  mounts  on
       nodes  with  different  processors like x86, x86_64, IA64 and PPC64. It
       handles little and big endian, 32-bit and 64-bit architectures.

       It is feature rich. It supports indexed	directories,  metadata	check‐
       sums,  extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files,
       unwritten extents and inline-data.

       It is fully integrated with the mainline Linux kernel. The file	system
       was merged into Linux kernel 2.6.16 in early 2006.

       It  is quickly installed. It is available with almost all Linux distri‐
       butions.	 The file system is on-disk compatible across all of them.

       It is modular. The file system can be configured to operate with	 other
       cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.

       It  is easily configured. The O2CB cluster stack configuration involves
       editing two files, one for cluster layout and  the  other  for  cluster
       timeouts.

       It  is  very efficient. The file system consumes very little resources.
       It is used to store virtual machine images in limited  memory  environ‐
       ments like Xen and KVM.

       In  summary, OCFS2 is an efficient, easily configured, modular, quickly
       installed, fully integrated and compatible, feature-rich,  architecture
       and endian neutral, cache coherent, ordered data journaling, POSIX-com‐
       pliant, shared disk cluster file system.

OVERVIEW
       OCFS2 is a general-purpose shared-disk cluster file  system  for	 Linux
       capable of providing both high performance and high availability.

       As  it provides local file system semantics, it can be used with almost
       all applications.  Cluster-aware applications can make  use  of	cache-
       coherent	 parallel  I/Os	 from multiple nodes to scale out applications
       easily. Other applications can make use of the clustering facilities to
       fail-over running application in the event of a node failure.

       The notable features of the file system are:

       Tunable Block size
	      The  file	 system	 supports  block  sizes	 of 512, 1K, 2K and 4K
	      bytes. 4KB is almost always recommended. This feature is	avail‐
	      able in all releases of the file system.

       Tunable Cluster size
	      A	 cluster  size	is also referred to as an allocation unit. The
	      file system supports cluster sizes of 4K,	 8K,  16K,  32K,  64K,
	      128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recom‐
	      mended. However, a larger value is recommended for volumes host‐
	      ing mostly very large files like database files, virtual machine
	      images, etc. A large cluster size	 allows	 the  file  system  to
	      store large files more efficiently. This feature is available in
	      all releases of the file system.

       Endian and Architecture neutral
	      The file system can be mounted concurrently on nodes having dif‐
	      ferent  architectures.  Like 32-bit, 64-bit, little-endian (x86,
	      x86_64, ia64) and big-endian (ppc64, s390x).   This  feature  is
	      available in all releases of the file system.

       Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
	      The  file system supports all modes of I/O for maximum flexibil‐
	      ity and  performance.   It  also	supports  cluster-wide	shared
	      writeable	 mmap(2).  The support for bufferred, direct and asyn‐
	      chronous I/O is available	 in  all  releases.  The  support  for
	      splice  I/O  was	added  in  Linux  kernel 2.6.20 and for shared
	      writeable map(2) in 2.6.23.

       Multiple Cluster Stacks
	      The file system includes a flexible framework  to	 allow	it  to
	      function with userspace cluster stacks like Pacemaker (pcmk) and
	      CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
	      stack.

	      The support for o2cb cluster stack is available in all releases.

	      The  support  for no cluster stack, or local mount, was added in
	      Linux kernel 2.6.20.

	      The support for userspace cluster stack was added in Linux  ker‐
	      nel 2.6.26.

       Journaling
	      The  file	 system	 supports both ordered (default) and writeback
	      data journaling modes to provide file system consistency in  the
	      event  of	 power failure or system crash.	 It uses JBD2 in Linux
	      kernel 2.6.28 and later. It used JBD in earlier kernels.

       Extent-based Allocations
	      The file system allocates and tracks space in  ranges  of	 clus‐
	      ters. This is unlike block based file systems that have to track
	      each and every block. This feature allows the file system to  be
	      very  efficient  when  dealing with both large volumes and large
	      files.  This feature is available in all releases	 of  the  file
	      system.

       Sparse files
	      Sparse  files  are files with holes. With this feature, the file
	      system delays allocating space until a  write  is	 issued	 to  a
	      cluster.	This  feature  was  added  in  Linux kernel 2.6.22 and
	      requires enabling on-disk feature sparse.

       Unwritten Extents
	      An unwritten extent is also referred to as user  pre-allocation.
	      It  allows  an  application to request a range of clusters to be
	      allocated, but not initialized, within a	file.	Pre-allocation
	      allows  the  file system to optimize the data layout with fewer,
	      larger extents. It also provides a performance  boost,  delaying
	      initialization  until the user writes to the clusters. This fea‐
	      ture was added in Linux kernel 2.6.23 and requires enabling  on-
	      disk feature unwritten.

       Hole Punching
	      Hole  punching  allows  an application to remove arbitrary allo‐
	      cated regions within a file. Creating holes,  essentially.  This
	      is  more	efficient than zeroing the same extents.  This feature
	      is especially useful in virtualized environments as it allows  a
	      block  discard  in a guest file system to be converted to a hole
	      punch in the host file system thus allowing users to reduce disk
	      space  usage.  This feature was added in Linux kernel 2.6.23 and
	      requires enabling on-disk features sparse and unwritten.

       Inline-data
	      Inline data is also referred to as data-in-inode	as  it	allows
	      storing small files and directories in the inode block. This not
	      only saves space but also has a positive	impact	on  cold-cache
	      directory	 and  file operations. The data is transparently moved
	      out to an extent when it no longer fits inside the inode	block.
	      This  feature  was  added	 in  Linux  kernel 2.6.24 and requires
	      enabling on-disk feature inline-data.

       REFLINK
	      REFLINK is also referred to as fast copy.	 It  allows  users  to
	      atomically  (and	instantly) copy regular files. In other words,
	      create multiple writeable snapshots of  regular  files.	It  is
	      called  REFLINK  because	it  looks and feels more like a (hard)
	      link(2) than a traditional snapshot. Like a link, it is a	 regu‐
	      lar  user	 operation,  subject to the security attributes of the
	      inode being reflinked and not to the super user privileges typi‐
	      cally  required  to  create a snapshot. Like a link, it operates
	      within a file system. But unlike a link, it links the inodes  at
	      the  data	 extent	 level	allowing  each reflinked inode to grow
	      independently as and when written to. Up to four billion	inodes
	      can share a data extent.	This feature was added in Linux kernel
	      2.6.32 and requires enabling on-disk feature refcount.

       Allocation Reservation
	      File contiguity plays an important role in file  system  perfor‐
	      mance. When a file is fragmented on disk, reading and writing to
	      the file involves many seeks, leading to lower throughput.  Con‐
	      tiguous  files,  on the other hand, minimize seeks, allowing the
	      disks to perform IO at the maximum rate.

	      With allocation reservation, the file system reserves  a	window
	      in  the  bitmap for all extending files allowing each to grow as
	      contiguously as possible. As this extra space  is	 not  actually
	      allocated,  it  is  available for use by other files if the need
	      arises.  This feature was added in Linux kernel 2.6.35  and  can
	      be tuned using the mount option resv_level.

       Indexed Directories
	      An  indexed directory allows users to perform quick lookups of a
	      file in very large directories. It also results in  faster  cre‐
	      ates  and	 unlinks and thus provides better overall performance.
	      This feature was added  in  Linux	 kernel	 2.6.30	 and  requires
	      enabling on-disk feature indexed-dirs.

       File Attributes
	      This  refers  to	EXT2-style file attributes, such as immutable,
	      modified using chattr(1) and queried using lsattr(1). This  fea‐
	      ture was added in Linux kernel 2.6.19.

       Extended Attributes
	      An  extended  attribute  refers to a name:value pair than can be
	      associated with file system objects like regular files, directo‐
	      ries, symbolic links, etc. OCFS2 allows associating an unlimited
	      number of attributes per object. The attribute names can	be  up
	      to  255  bytes in length, terminated by the first NUL character.
	      While it is not required, printable  names  (ASCII)  are	recom‐
	      mended.  The  attribute  values  can be up to 64 KB of arbitrary
	      binary data. These attributes can be modified and	 listed	 using
	      standard	Linux utilities setfattr(1) and getfattr(1). This fea‐
	      ture was added in Linux kernel 2.6.29 and requires enabling  on-
	      disk feature xattr.

       Metadata Checksums
	      This feature allows the file system to detect silent corruptions
	      in all metadata blocks like inodes and directories. This feature
	      was  added  in Linux kernel 2.6.29 and requires enabling on-disk
	      feature metaecc.

       POSIX ACLs and Security Attributes
	      POSIX ACLs allows assigning  fine-grained	 discretionary	access
	      rights  for files and directories. This security scheme is a lot
	      more flexible than the traditional file access permissions  that
	      imposes a strict user-group-other model.

	      Security attributes allow the file system to support other secu‐
	      rity regimes like SELinux, SMACK, AppArmor, etc.

	      Both these security extensions were added in Linux kernel 2.6.29
	      and requires enabling on-disk feature xattr.

       User and Group Quotas
	      This  feature  allows  setting up usage quotas on user and group
	      basis  by	 using	 the   standard	  utilities   like   quota(1),
	      setquota(8),  quotacheck(8),  and	 quotaon(8).  This feature was
	      added in Linux kernel 2.6.29 and requires enabling on-disk  fea‐
	      tures usrquota and grpquota.

       Unix File Locking
	      The  Unix	 operating system has historically provided two system
	      calls to lock files.  flock(2) or BSD locking  and  fcntl(2)  or
	      POSIX  locking.  OCFS2  extends  both file locks to the cluster.
	      File locks taken on one node interact with those taken on	 other
	      nodes.

	      The  support  for	 clustered  flock(2) was added in Linux kernel
	      2.6.26.  All flock(2) options are supported, including the  ker‐
	      nels  ability  to cancel a lock request when an appropriate kill
	      signal is received by the user. This feature is  supported  with
	      all cluster-stacks including o2cb.

	      The  support  for	 clustered  fcntl(2) was added in Linux kernel
	      2.6.28.  But because it requires group communication to make the
	      locks  coherent,	it  is	only  supported with userspace cluster
	      stacks, pcmk and cman and not with  the  default	cluster	 stack
	      o2cb.

       Comprehensive Tools Support
	      The  file	 system	 has  a	 comprehensive EXT3-style toolset that
	      tries to use similar parameters  for  ease-of-use.  It  includes
	      mkfs.ocfs2(8)  (format),	tunefs.ocfs2(8)	 (tune), fsck.ocfs2(8)
	      (check), debugfs.ocfs2(8) (debug), etc.

       Online Resize
	      The file system can be dynamically grown using  tunefs.ocfs2(8).
	      This feature was added in Linux kernel 2.6.25.

RECENT CHANGES
       The  O2CB cluster stack has a global heartbeat mode. It allows users to
       specify heartbeat regions that are consistent  across  all  nodes.  The
       cluster stack also allows online addition and removal of both nodes and
       heartbeat regions.

       o2cb(8) is the new cluster configuration utility. It is an easy to  use
       utility that allows users to create the cluster configuration on a node
       that is not  part  of  the  cluster.  It	 replaces  the	older  utility
       o2cb_ctl(8) which has being deprecated.

       ocfs2console(8) has been obsoleted.

       o2info(8)  is  a	 new  utility  that can be used to provide file system
       information.  It allows non-priviledged users to see the	 enabled  file
       system  features,  block	 and  cluster  sizes, extended file stat, free
       space fragmentation, etc.

       o2hbmonitor(8) is a o2hb heartbeat monitor. It is  an  extremely	 light
       weight  utility that logs messages to the system logger once the heart‐
       beat delay exceeds the warn threshold. This utility is useful in	 iden‐
       tifying volumes encountering I/O delays.

       debugfs.ocfs2(8)	 has some new commands. net_stats shows the o2net mes‐
       sage times between various nodes. This is useful in indentifying	 nodes
       are  that  slowing  down the cluster operations. stat_sysdir allows the
       user to dump the entire system directory that  can  be  used  to	 debug
       issues.	grpextents  dumps the complete free space fragmentation in the
       cluster group allocator.

       mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg,  refcount,
       extended-slotmap	 and clusterinfo feature flags by default, in addition
       to the older defaults, sparse, unwritten and inline-data.

       mount.ocfs2(8) allows users to specify the  level  of  cache  coherency
       between	nodes.	 By default the file system operates in full coherency
       mode that also serializes the direct I/Os. While this mode  is  techni‐
       cally  correct, it limits the I/O thruput in a clustered database. This
       mount option allows the user to limit the cache coherency to  only  the
       buffered I/Os to allow multiple nodes to do concurrent direct writes to
       the same file. This feature works with Linux kernel 2.6.37 and later.

COMPATIBILITY
       The OCFS2 development teams goes to great lengths to maintain  compati‐
       bility.	It attempts to maintain both on-disk and network protocol com‐
       patibility across all releases of the file  system.  It	does  so  even
       while adding new features that entail on-disk format and network proto‐
       col changes. To do this successfully, it follows a few rules:

	   1. The on-disk format changes are managed by a set of feature flags
	   that	 can  be  turned on and off. The file system in kernel detects
	   these features during mount and continues only  if  it  understands
	   all the features. Users encountering this have the option of either
	   disabling that feature or upgrading the  file  system  to  a	 newer
	   release.

	   2.  The  latest  release of ocfs2-tools is compatible with all ver‐
	   sions of the file system. All utilities detect the features enabled
	   on disk and continue only if it understands all the features. Users
	   encountering this have to upgrade the tools to a newer release.

	   3. The network protocol version  is	negotiated  by	the  nodes  to
	   ensure all nodes understand the active protocol version.

       FEATURE FLAGS
	      The  feature flags are split into three categories, namely, Com‐
	      pat, Incompat and RO Compat.

	      Compat, or compatible, is a feature that the  file  system  does
	      not need to fully understand to safely read/write to the volume.
	      An example of this is the backup-super feature  that  added  the
	      capability  to  backup  the super block in multiple locations in
	      the file system. As the backup super blocks  are	typically  not
	      read nor written to by the file system, an older file system can
	      safely mount a volume with this feature enabled.

	      Incompat, or incompatible, is a feature  that  the  file	system
	      needs to fully understand to read/write to the volume. Most fea‐
	      tures fall under this category.

	      RO Compat, or read-only compatible, is a feature that  the  file
	      system  needs  to fully understand to write to the volume. Older
	      software can safely read a volume with this feature enabled.  An
	      example  of  this	 would be user and group quotas. As quotas are
	      manipulated only when the file system is written to, older soft‐
	      ware can safely mount such volumes in read-only mode.

	      The  list	 of  feature  flags,  the version of the kernel it was
	      added in, the earliest version of the tools that understands it,
	      etc., is as follows:

      ┌─────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
      │Feature Flags	    │ Kernel Version │ Tools Version   │ Category  │ Hex Value │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │backup-super	    │	   All	     │ ocfs2-tools 1.2 │  Compat   │	 1     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │strict-journal-super │	   All	     │	     All       │  Compat   │	 2     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │local		    │  Linux 2.6.20  │ ocfs2-tools 1.2 │ Incompat  │	 8     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │sparse		    │  Linux 2.6.22  │ ocfs2-tools 1.4 │ Incompat  │	10     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │inline-data	    │  Linux 2.6.24  │ ocfs2-tools 1.4 │ Incompat  │	40     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │extended-slotmap	    │  Linux 2.6.27  │ ocfs2-tools 1.6 │ Incompat  │	100    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │xattr		    │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │	200    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │indexed-dirs	    │  Linux 2.6.30  │ ocfs2-tools 1.6 │ Incompat  │	400    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │metaecc		    │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │	800    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │refcount		    │  Linux 2.6.32  │ ocfs2-tools 1.6 │ Incompat  │   1000    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │discontig-bg	    │  Linux 2.6.35  │ ocfs2-tools 1.6 │ Incompat  │   2000    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │clusterinfo	    │  Linux 2.6.37  │ ocfs2-tools 1.8 │ Incompat  │   4000    │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │unwritten	    │  Linux 2.6.23  │ ocfs2-tools 1.4 │ RO Compat │	 1     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │grpquota		    │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │	 2     │
      ├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
      │usrquota		    │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │	 4     │
      └─────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘

	      To query the features enabled on a volume, do:

	      $ o2info --fs-features /dev/sdf1
	      backup-super strict-journal-super sparse extended-slotmap inline-data xattr
	      indexed-dirs refcount discontig-bg clusterinfo unwritten

       ENABLING AND DISABLING FEATURES

	      The  format  utility, mkfs.ocfs2(8), allows a user to enable and
	      disable specific features using the fs-features option. The fea‐
	      tures  are  provided as a comma separated list. The enabled fea‐
	      tures are listed as is. The disabled features are prefixed  with
	      no.   The	 example  below	 shows the file system being formatted
	      with sparse disabled and inline-data enabled.

	      # mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1

	      After formatting, the users can toggle features using  the  tune
	      utility,	tunefs.ocfs2(8).   This	 is  an offline operation. The
	      volume needs to be umounted across  the  cluster.	  The  example
	      below  shows  the	 sparse	 feature being enabled and inline-data
	      disabled.

	      # tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1

	      Care should be taken before  enabling  and  disabling  features.
	      Users planning to use a volume with an older version of the file
	      system will be better of not enabling newer features as  turning
	      disabling may not succeed.

	      An  example would be disabling the sparse feature; this requires
	      filling every hole.  The operation can only succeed if the  file
	      system has enough free space.

       DETECTING FEATURE INCOMPATIBILITY

	      Say  one	tries  to mount a volume with an incompatible feature.
	      What happens then? How does one detect the problem? How does one
	      know the name of that incompatible feature?

	      To  begin	 with, one should look for error messages in dmesg(8).
	      Mount failures that are due  to  an  incompatible	 feature  will
	      always result in an error message like the following:

	      ERROR: couldn't mount because of unsupported optional features (200).

	      Here  the	 file  system  is unable to mount the volume due to an
	      unsupported optional feature. That means that that feature is an
	      Incompat	feature. By referring to the table above, one can then
	      deduce that the user failed to mount a  volume  with  the	 xattr
	      feature enabled. (The value in the error message is in hexadeci‐
	      mal.)

	      Another example of an error message due to incompatibility is as
	      follows:

	      ERROR: couldn't mount RDWR because of unsupported optional features (1).

	      Here  the	 file  system  is unable to mount the volume in the RW
	      mode. That means that that  feature  is  a  RO  Compat  feature.
	      Another  look at the table and it becomes apparent that the vol‐
	      ume had the unwritten feature enabled.

	      In both cases, the user has the option of disabling the feature.
	      In the second case, the user has the choice of mounting the vol‐
	      ume in the RO mode.

GETTING STARTED
       The OCFS2 software is split into two  components,  namely,  kernel  and
       tools. The kernel component includes the core file system and the clus‐
       ter stack, and is packaged along with the kernel. The  tools  component
       is  packaged  as ocfs2-tools and needs to be specifically installed. It
       provides utilities to format, tune, mount, debug	 and  check  the  file
       system.

       To  install  ocfs2-tools,  refer	 to the package handling utility in in
       your distributions.

       The next step is selecting a cluster stack. The options include:

	   A. No cluster stack, or local mount.

	   B. In-kernel o2cb cluster stack with local or global heartbeat.

	   C. Userspace cluster stacks pcmk or cman.

       The  file  system  allows  changing   cluster   stacks	easily	 using
       tunefs.ocfs2(8).	  To list the cluster stacks stamped on the OCFS2 vol‐
       umes, do:

       # mounted.ocfs2 -d
       Device	  Stack	 Cluster     F	UUID				  Label
       /dev/sdb1  o2cb	 webcluster  G	DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
       /dev/sdc1  None			23878C320CF3478095D1318CB5C99EED  localmount
       /dev/sdd1  o2cb	 webcluster  G	8AB016CD59FC4327A2CDAB69F08518E3  webvol
       /dev/sdg1  o2cb	 webcluster  G	77D95EF51C0149D2823674FCC162CF8B  logsvol
       /dev/sdh1  o2cb	 webcluster  G	BBA1DBD0F73F449384CE75197D9B7098  scratch

       NON-CLUSTERED OR LOCAL MOUNT

	      To format a OCFS2 volume as a non-clustered (local) volume, do:

	      # mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1

	      To convert an existing clustered volume to a non-clustered  vol‐
	      ume, do:

	      # tunefs.ocfs2 --fs-features=local /dev/sda1

	      Non-clustered  volumes  do  not interact with the cluster stack.
	      One can have both clustered and non-clustered volumes mounted at
	      the same time.

	      While  formating	a  non-clustered volume, users should consider
	      the possibility of later converting that volume to  a  clustered
	      one. If there is a possibility of that, then the user should add
	      enough node-slots using the -N option. Adding node-slots	during
	      format  creates  journals	 with large extents. If created later,
	      then the journals will be fragmented which is not good for  per‐
	      formance.

       CLUSTERED MOUNT WITH O2CB CLUSTER STACK

	      Only  one	 of  the  two  heartbeat mode can be active at any one
	      time. Changing heartbeat modes is an offline operation.

	      Both  heartbeat  modes   require	 /etc/ocfs2/cluster.conf   and
	      /etc/sysconfig/o2cb  to be populated as described in ocfs2.clus‐
	      ter.conf(5) and o2cb.sysconfig(5) respectively. The only differ‐
	      ence  in	set  up	 between the two modes is that global requires
	      heartbeat devices to be configured whereas local does not.

	      Refer o2cb(7) for more information.

	      LOCAL HEARTBEAT
		     This is the default heartbeat mode.  The  user  needs  to
		     populate	the   configuration   files  as	 described  in
		     ocfs2.cluster.conf(5)  and	 o2cb.sysconfig(5).  In	  this
		     mode,  the	 cluster  stack heartbeats on all mounted vol‐
		     umes. Thus,  one  does  not  have	to  specify  heartbeat
		     devices in cluster.conf.

		     Once  configured,	the  o2cb cluster stack can be onlined
		     and offlined as follows:

		     # service o2cb online
		     Setting cluster stack "o2cb": OK
		     Registering O2CB cluster "webcluster": OK
		     Setting O2CB cluster timeouts : OK

		     # service o2cb offline
		     Clean userdlm domains: OK
		     Stopping O2CB cluster webcluster: OK
		     Unregistering O2CB cluster "webcluster": OK

	      GLOBAL HEARTBEAT
		     The configuration is similar to local heartbeat. The  one
		     additional	 step  in this mode is that it requires heart‐
		     beat devices to be also configured.

		     These heartbeat devices are OCFS2 formatted volumes  with
		     global heartbeat enabled on disk. These volumes can later
		     be mounted and used as clustered file systems.

		     The steps	to  format  a  volume  with  global  heartbeat
		     enabled is listed in o2cb(7).  Also listed there is list‐
		     ing all volumes with the cluster stack stamped on disk.

		     In this mode, the heartbeat is started when  the  cluster
		     is onlined and stopped when the cluster is offlined.

		     # service o2cb online
		     Setting cluster stack "o2cb": OK
		     Registering O2CB cluster "webcluster": OK
		     Setting O2CB cluster timeouts : OK
		     Starting global heartbeat for cluster "webcluster": OK

		     # service o2cb offline
		     Clean userdlm domains: OK
		     Stopping global heartbeat on cluster "webcluster": OK
		     Stopping O2CB cluster webcluster: OK
		     Unregistering O2CB cluster "webcluster": OK

		     # service o2cb status
		     Driver for "configfs": Loaded
		     Filesystem "configfs": Mounted
		     Stack glue driver: Loaded
		     Stack plugin "o2cb": Loaded
		     Driver for "ocfs2_dlmfs": Loaded
		     Filesystem "ocfs2_dlmfs": Mounted
		     Checking O2CB cluster "webcluster": Online
		       Heartbeat dead threshold: 31
		       Network idle timeout: 30000
		       Network keepalive delay: 2000
		       Network reconnect delay: 2000
		       Heartbeat mode: Global
		     Checking O2CB heartbeat: Active
		       77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
		     Nodes in O2CB cluster: 92 96

       CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK

	      Configure	 and  online  the  userspace stack pcmk or cman before
	      using tunefs.ocfs2(8) to update the cluster stack on disk.

	      # tunefs.ocfs2 --update-cluster-stack /dev/sdd1
	      Updating on-disk cluster information to match the running cluster.
	      DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
	      FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
	      Update the on-disk cluster information? y

	      Refer to the cluster  stack  documentation  for  information  on
	      starting and stopping the cluster stack.

FILE SYSTEM UTILITIES
       This  sections  lists  the  utilities that are used to manage the OCFS2
       file systems.  This includes tools to format, tune, check, mount, debug
       the  file  system. Each utility has a man page that lists its capabili‐
       ties in detail.

       mkfs.ocfs2(8)
	      This is the file system format utility. All volumes have	to  be
	      formatted prior to its use.  As this utility overwrites the vol‐
	      ume, use it with care. Double check to ensure the volume is  not
	      in use on any node in the cluster.

	      As a precaution, the utility will abort if the volume is locally
	      mounted. It also detects use  across  the	 cluster  if  used  by
	      OCFS2.  But  these checks are not comprehensive and can be over‐
	      ridden. So use it with care.

	      While it is not always required, the cluster should be online.

       tunefs.ocfs2(8)
	      This is the file system tune utility. It allows users to	change
	      certain  on-disk	parameters  like  label, uuid, number of node-
	      slots, volume size and the size of the journals. It also	allows
	      turning on and off the file system features as listed above.

	      This utility requires the cluster to be online.

       fsck.ocfs2(8)
	      This  is the file system check utility. It detects and fixes on-
	      disk errors. All the check codes and their fixes are  listed  in
	      fsck.ocfs2.checks(8).

	      This  utility  requires  the  cluster to be online to ensure the
	      volume is not in use on another node and to prevent  the	volume
	      from being mounted for the duration of the check.

       mount.ocfs2(8)
	      This  is the file system mount utility. It is invoked indirectly
	      by the mount(8) utility.

	      This utility detects the cluster status and aborts if the	 clus‐
	      ter is offline or does not match the cluster stamped on disk.

       o2cluster(8)
	      This  is the file system cluster stack update utility. It allows
	      the users to update the on-disk cluster stack to	the  one  pro‐
	      vided.

	      This  utility only updates the disk if the utility is reasonably
	      assured that the file system is not in use on any node.

       o2info(1)
	      This is the file system information utility. It provides	infor‐
	      mation  like  the	 features enabled on disk, block size, cluster
	      size, free space fragmentation, etc.

	      It can be used by both priviledged  and  non-priviledged	users.
	      Users  having read permission on the device can provide the path
	      to the device. Other users can provide the path to a file	 on  a
	      mounted file system.

       debugfs.ocfs2(8)
	      This  is the file system debug utility. It allows users to exam‐
	      ine all  file  system  structures	 including  walking  directory
	      structures,  displaying  inodes, backing up files, etc., without
	      mounting the file system.

	      This utility requires the user to have read  permission  on  the
	      device.

       o2image(8)
	      This  is	the file system image utility. It allows users to copy
	      the file system metadata skeleton, including the inodes,	direc‐
	      tories,  bitmaps,	 etc. As it excludes data, it shrinks the size
	      of the file system tremendously.

	      The image file created can be used in debugging on-disk  corrup‐
	      tions.

       mounted.ocfs2(8)
	      This  is	the  file  system detect utility. It detects all OCFS2
	      volumes in the system and lists  its  label,  uuid  and  cluster
	      stack.

O2CB CLUSTER STACK UTILITIES
       This  sections lists the utilities that are used to manage O2CB cluster
       stack.  Each utility has a man page  that  lists	 its  capabilities  in
       detail.

       o2cb(8)
	      This  is	the  cluster configuration utility. It allows users to
	      update the cluster configuration by adding  and  removing	 nodes
	      and  heartbeat  regions.	This  utility is used by the o2cb init
	      script to online and offline the cluster.

	      This is a new utility and replaces o2cb_ctl(8)  which  has  been
	      deprecated.

       ocfs2_hb_ctl(8)
	      This  is the cluster heartbeat utility. It allows users to start
	      and  stop	 local	heartbeat.  This   utility   is	  invoked   by
	      mount.ocfs2(8) and should not be invoked directly by the user.

       o2hbmonitor(8)
	      This  is	the disk heartbeat monitor. It tracks the elapsed time
	      since the last  heartbeat	 and  logs  warnings  once  that  time
	      exceeds the warn threshold.

FILE SYSTEM NOTES
       This  section  includes some useful notes that may prove helpful to the
       user.

       BALANCED CLUSTER
	      A cluster is a computer. This is a fact and not a	 slogan.  What
	      this  means is that an errant node in the cluster can affect the
	      behavior of other nodes. If one node is slow, the cluster opera‐
	      tions  will  slow down on all nodes. To prevent that, it is best
	      to have a balanced cluster. This is a cluster that  has  equally
	      powered and loaded nodes.

	      The standard recommendation for such clusters is to have identi‐
	      cal hardware and software across all the nodes. However, that is
	      not a hard and fast rule. After all, we have taken the effort to
	      ensure that OCFS2 works in a mixed architecture environment.

	      If one uses OCFS2 in a mixed architecture	 environment,  try  to
	      ensure that the nodes are equally powered and loaded. The use of
	      a load balancer can assist with the latter. Power refers to  the
	      number  of  processors, speed, amount of memory, I/O throughput,
	      network bandwidth, etc. In reality, having equally powered  het‐
	      erogeneous nodes is not always practical. In that case, make the
	      lower node numbers more powerful than the higher	node  numbers.
	      The  O2CB	 cluster stack favors lower node numbers in all of its
	      tiebreaking logic.

	      This is not to suggest you should add a single core  node	 in  a
	      cluster  of  quad	 cores. No amount of node number juggling will
	      help you there.

       FILE DELETION
	      In Linux, rm(1) removes the directory entry. It does not	neces‐
	      sarily  delete  the  corresponding  inode.  But  by removing the
	      directory entry, it gives the illusion that the inode  has  been
	      deleted.	 This puzzles users when they do not see a correspond‐
	      ing up-tick in the reported free	space.	 The  reason  is  that
	      inode deletion has a few more hurdles to cross.

	      First  is	 the  hard  link  count,  that indicates the number of
	      directory entries pointing to that inode. As long	 as  an	 inode
	      has  one	or more directory entries pointing to it, it cannot be
	      deleted.	The file system has to wait for	 the  removal  of  all
	      those  directory entries. In other words, wait for that count to
	      drop to zero.

	      The second hurdle is the POSIX semantics allowing	 files	to  be
	      unlinked	even  while they are in-use. In OCFS2, that translates
	      to in-use across the cluster. The file system has	 to  wait  for
	      all processes across the cluster to stop using the inode.

	      Once  these  conditions  are  met,  the inode is deleted and the
	      freed space is visible after the next sync.

	      Now the amount of space freed depends on	the  allocation.  Only
	      space  that  is  actually	 allocated to that inode is freed. The
	      example below shows a sparsely allocated file of	size  51TB  of
	      which only 2.4GB is actually allocated.

	      $ ls -lsh largefile
	      2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile

	      Furthermore,  for	 reflinked  files,  only  private  extents are
	      freed. Shared extents are freed when the	last  inode  accessing
	      it,  is  deleted. The example below shows a 4GB file that shares
	      3GB with other reflinked files. Deleting it  will	 increase  the
	      free  space  by  1GB.  However, if it is the only remaining file
	      accessing the shared extents, the full 4G will be freed.	 (More
	      information on the shared-du(1) utility is provided below.)

	      $ shared-du -m -c --shared-size reflinkedfile
	      4000    (3000)  reflinkedfile

	      The  deletion itself is a multi-step process. Once the hard link
	      count falls to zero, the inode is moved to the orphan_dir system
	      directory	 where	it  remains until the last process, across the
	      cluster, stops using the inode. Then the file system  frees  the
	      extents  and adds the freed space count to the truncate_log sys‐
	      tem file where it remains until the next sync.  The freed	 space
	      is made visible to the user only after that sync.

       DIRECTORY LISTING
	      ls(1)  may  be  a	 simple	 command, but it is not cheap. What is
	      expensive is not the part where it reads the directory  listing,
	      but the second part where it reads all the inodes, also referred
	      as an inode stat(2). If the inodes are not in  cache,  this  can
	      entail  disk  I/O.   Now,	 while	a  cold cache inode stat(2) is
	      expensive in all file systems, it is especially so  in  a	 clus‐
	      tered  file  system  as  it needs to take a cluster lock on each
	      inode.

	      A hot cache stat(2), on the other hand, has shown to perform  on
	      OCFS2 like it does on EXT3.

	      In other words, the second ls(1) will be quicker than the first.
	      However, it is not guaranteed. Say you have a million files in a
	      file  system  and	 not  enough  kernel  memory  to cache all the
	      inodes. In that case, each ls(1) will involve  some  cold	 cache
	      stat(2)s.

       ALLOCATION RESERVATION
	      Allocation  reservation  allows  multiple concurrently extending
	      files to grow as contiguously as possible.  One  way  to	demon‐
	      strate  its functioning is to run a script that extends multiple
	      files in a circular order. The script below does that by writing
	      one hundred 4KB chunks to four files, one after another.

	      $ for i in $(seq 0 99);
	      > do
	      >	  for j in $(seq 4);
	      >	  do
	      >	    dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
	      >	  done;
	      > done;

	      When  run on a system running Linux kernel 2.6.34 or earlier, we
	      end up with files with 100 extents each. That is full fragmenta‐
	      tion. As the files are being extended one after another, the on-
	      disk allocations are fully interleaved.

	      $ filefrag file1 file2 file3 file4
	      file1: 100 extents found
	      file2: 100 extents found
	      file3: 100 extents found
	      file4: 100 extents found

	      When run on a system running Linux kernel 2.6.35	or  later,  we
	      see  files with 7 extents each. That is a lot fewer than before.
	      Fewer extents mean more on-disk contiguity and that always leads
	      to better overall performance.

	      $ filefrag file1 file2 file3 file4
	      file1: 7 extents found
	      file2: 7 extents found
	      file3: 7 extents found
	      file4: 7 extents found

       REFLINK OPERATION
	      This  feature  allows a user to create a writeable snapshot of a
	      regular file. In this operation, the file system creates	a  new
	      inode  with the same extent pointers as the original inode. Mul‐
	      tiple inodes are thus able to share data extents.	 This  adds  a
	      twist in file system administration because none of the existing
	      file system utilities in Linux expect this  behavior.  du(1),  a
	      utility  to  used	 to  compute file space usage, simply adds the
	      blocks allocated to each inode. As it does not know about shared
	      extents,	it  over estimates the space used.  Say, we have a 5GB
	      file in a volume having 42GB free.

	      $ ls -l
	      total 5120000
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:15 myfile

	      $ du -m myfile*
	      5000    myfile

	      $ df -h .
	      Filesystem	    Size  Used Avail Use% Mounted on
	      /dev/sdd1		    50G	  8.2G	 42G  17% /ocfs2

	      If we were to reflink it 4 times, we would expect the  directory
	      listing  to  report  five	 5GB files, but the df(1) to report no
	      loss of available space. du(1), on the other hand, would	report
	      the disk usage to climb to 25GB.

	      $ reflink myfile myfile-ref1
	      $ reflink myfile myfile-ref2
	      $ reflink myfile myfile-ref3
	      $ reflink myfile myfile-ref4

	      $ ls -l
	      total 25600000
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:15 myfile
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:16 myfile-ref1
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:16 myfile-ref2
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:16 myfile-ref3
	      -rw-r--r--  1 jeff jeff	5242880000 Sep 24 17:16 myfile-ref4

	      $ df -h .
	      Filesystem	    Size  Used Avail Use% Mounted on
	      /dev/sdd1		    50G	  8.2G	 42G  17% /ocfs2

	      $ du -m myfile*
	      5000    myfile
	      5000    myfile-ref1
	      5000    myfile-ref2
	      5000    myfile-ref3
	      5000    myfile-ref4
	      25000 total

	      Enter  shared-du(1),  a  shared  extent-aware  du.  This utility
	      reports the shared extents per file in parenthesis and the over‐
	      all  footprint.  As  expected, it lists the overall footprint at
	      5GB. One can view the details of the extents using  shared-file‐
	      frag(1).	 Both these utilities are available at http://oss.ora‐
	      cle.com/~smushran/reflink-tools/.	  We  are  currently  in   the
	      process  of  pushing  the changes to the upstream maintainers of
	      these utilities.

	      $ shared-du -m -c --shared-size myfile*
	      5000    (5000)  myfile
	      5000    (5000)  myfile-ref1
	      5000    (5000)  myfile-ref2
	      5000    (5000)  myfile-ref3
	      5000    (5000)  myfile-ref4
	      25000 total
	      5000 footprint

	      # shared-filefrag -v myfile
	      Filesystem type is: 7461636f
	      File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
	      ext logical physical expected length flags
	      0		0  2247937	      8448
	      1	     8448  2257921  2256384  30720
	      2	    39168  2290177  2288640  30720
	      3	    69888  2322433  2320896  30720
	      4	   100608  2354689  2353152  30720
	      7	   192768  2451457  2449920  30720
	       . . .
	      37  1073408  2032129  2030592  30720 shared
	      38  1104128  2064385  2062848  30720 shared
	      39  1134848  2096641  2095104  30720 shared
	      40  1165568  2128897  2127360  30720 shared
	      41  1196288  2161153  2159616  30720 shared
	      42  1227008  2193409  2191872  30720 shared
	      43  1257728  2225665  2224128  22272 shared,eof
	      myfile: 44 extents found

       DATA COHERENCY
	      One of the challenges in a shared file system is data  coherency
	      when  multiple  nodes are writing to the same set of files. NFS,
	      for example, provides close-to-open data coherency that  results
	      in  the data being flushed to the server when the file is closed
	      on the client.  This leaves open a wide window  for  stale  data
	      being read on another node.

	      A	 simple test to check the data coherency of a shared file sys‐
	      tem involves concurrently appending the same file. Like  running
	      "uname  -a  >>/dir/file" using a parallel distributed shell like
	      dsh or pconsole. If coherent, the file will contain the  results
	      from all nodes.

	      # dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
	      # cat /ocfs2/test
	      Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
	      Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
	      Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
	      Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

	      OCFS2 is a fully cache coherent cluster file system.

       DISCONTIGUOUS BLOCK GROUP
	      Most  file  systems pre-allocate space for inodes during format.
	      OCFS2 dynamically allocates this space when required.

	      However, this dynamic allocation has been problematic  when  the
	      free  space is very fragmented, because the file system required
	      the inode and extent allocators to grow in contiguous fixed-size
	      chunks.

	      The discontiguous block group feature takes care of this problem
	      by allowing the allocators to grow  in  smaller,	variable-sized
	      chunks.

	      This  feature  was  added	 in  Linux  kernel 2.6.35 and requires
	      enabling on-disk feature discontig-bg.

       BACKUP SUPER BLOCKS
	      A file system super block stores critical	 information  that  is
	      hard  to	recreate.  In OCFS2, it stores the block size, cluster
	      size, and the locations of  the  root  and  system  directories,
	      among  other  things. As this block is close to the start of the
	      disk, it is very susceptible to being overwritten by  an	errant
	      write.  Say, dd if=file of=/dev/sda1.

	      Backup  super blocks are copies of the super block. These blocks
	      are dispersed in the volume to minimize  the  chances  of	 being
	      overwritten.  On	the  small  chance that the original gets cor‐
	      rupted, the backups are available to scan and  fix  the  corrup‐
	      tion.

	      mkfs.ocfs2(8) enables this feature by default. Users can disable
	      this by specifying --fs-features=nobackup-super during format.

	      o2info(1) can be used to	view  whether  the  feature  has  been
	      enabled on a device.

	      # o2info --fs-features /dev/sdb1
	      backup-super strict-journal-super sparse extended-slotmap inline-data xattr
	      indexed-dirs refcount discontig-bg clusterinfo unwritten

	      In OCFS2, the super block is on the third block. The backups are
	      located at the 1G, 4G, 16G, 64G, 256G and 1T byte	 offsets.  The
	      actual  number  of  backup  blocks  depends  on  the size of the
	      device. The super block is not backed up on devices smaller than
	      1GB.

	      fsck.ocfs2(8)  refers  to	 these six offsets by numbers, 1 to 6.
	      Users can specify any backup with the -r option to  recover  the
	      volume. The example below uses the second backup. If successful,
	      fsck.ocfs2(8) overwrites the  corrupted  super  block  with  the
	      backup.

	      # fsck.ocfs2 -f -r 2 /dev/sdb1
	      fsck.ocfs2 1.8.0
	      [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
	      Checking OCFS2 filesystem in /dev/sdb1:
		Label:		    webhome
		UUID:		    B3E021A2A12B4D0EB08E9E986CDC7947
		Number of blocks:   13107196
		Block size:	    4096
		Number of clusters: 13107196
		Cluster size:	    4096
		Number of slots:    8

	      /dev/sdb1 was run with -f, check forced.
	      Pass 0a: Checking cluster allocation chains
	      Pass 0b: Checking inode allocation chains
	      Pass 0c: Checking extent block allocation chains
	      Pass 1: Checking inodes and blocks.
	      Pass 2: Checking directory entries.
	      Pass 3: Checking directory connectivity.
	      Pass 4a: checking for orphaned inodes
	      Pass 4b: Checking inodes link counts.
	      All passes succeeded.

       SYNTHETIC FILE SYSTEMS
	      The  OCFS2  development  effort included two synthetic file sys‐
	      tems, configfs and dlmfs. It also makes use of a third, debugfs.

	      configfs
		     configfs has since been accepted as a generic kernel com‐
		     ponent  and  is also used by netconsole and fs/dlm. OCFS2
		     tools use it to communicate the  list  of	nodes  in  the
		     cluster,  details	of the heartbeat device, cluster time‐
		     outs, and so on to the in-kernel node manager.  The  o2cb
		     init  script  mounts this file system at /sys/kernel/con‐
		     fig.

	      dlmfs  dlmfs exposes the	in-kernel  o2dlm  to  the  user-space.
		     While  it was developed primarily for OCFS2 tools, it has
		     seen usage by others looking to  add  a  cluster  locking
		     dimension	in  their  applications.  Users	 interested in
		     doing the same should look at the libo2dlm	 library  pro‐
		     vided  by	ocfs2-tools.  The o2cb init script mounts this
		     file system at /dlm.

	      debugfs
		     OCFS2 uses debugfs to expose its in-kernel information to
		     user  space. For example, listing the file system cluster
		     locks, dlm locks, dlm state, o2net state, etc. Users  can
		     access  the  information  by  mounting the file system at
		     /sys/kernel/debug. To automount,  add  the	 following  to
		     /etc/fstab:  debugfs /sys/kernel/debug debugfs defaults 0
		     0

       DISTRIBUTED LOCK MANAGER
	      One of the key technologies in a cluster is  the	lock  manager,
	      which  maintains	the  locking state of all resources across the
	      cluster. An easy implementation of a lock manager involves  des‐
	      ignating one node to handle everything. In this model, if a node
	      wanted to acquire a lock, it would send the request to the  lock
	      manager.	However,  this	model  has  a weakness: lock manager’s
	      death causes the cluster to seize up.

	      A better model is one where all nodes manage  a  subset  of  the
	      lock  resources.	Each node maintains enough information for all
	      the lock resources it is interested  in.	On  event  of  a  node
	      death,  the  remaining  nodes  pool in the information to recon‐
	      struct the lock state maintained	by  the	 dead  node.  In  this
	      scheme,  the  locking  overhead  is  distributed amongst all the
	      nodes. Hence, the term distributed lock manager.

	      O2DLM is a distributed lock manager. It is based on the specifi‐
	      cation  titled  "Programming  Locking  Application"  written  by
	      Kristin  Thomas  and  is	available  at  the   following	 link.
	      http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlm‐
	      book_final.pdf

       DLM DEBUGGING
	      O2DLM has a rich debugging infrastructure that allows it to show
	      the  state  of  the  lock manager, all the lock resources, among
	      other things.  The figure below shows the dlm state of  a	 nine-
	      node  cluster that has just lost three nodes: 12, 32, and 35. It
	      can be ascertained that node 7, the  recovery  master,  is  cur‐
	      rently  recovering  node	12 and has received the lock states of
	      the dead node from all other live nodes.

	      # cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
	      Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001	Key: 0x10748e61
	      Thread Pid: 24542	 Node: 7  State: JOINED
	      Number of Joins: 1  Joining Node: 255
	      Domain Map: 7 31 33 34 40 50
	      Live Map: 7 31 33 34 40 50
	      Lock Resources: 48850 (439879)
	      MLEs: 0 (1428625)
		Blocking: 0 (1066000)
		Mastery: 0 (362625)
		Migration: 0 (0)
	      Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
	      Purge Count: 0  Refs: 1
	      Dead Node: 12
	      Recovery Pid: 24543  Master: 7  State: ACTIVE
	      Recovery Map: 12 32 35
	      Recovery Node State:
		      7 - DONE
		      31 - DONE
		      33 - DONE
		      34 - DONE
		      40 - DONE
		      50 - DONE

	      The figure below shows the state of a dlm lock resource that  is
	      mastered	(owned)	 by node 25, with 6 locks in the granted queue
	      and node 26 holding the EX (writelock) lock on that resource.

	      # debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
	      Lockres: M000000000000000022d63c00000000	 Owner: 25    State: 0x0
	      Last Used: 0	ASTs Reserved: 0    Inflight: 0	   Migration Pending: No
	      Refs: 8	 Locks: 6    On Lists: None
	      Reference Map: 26 27 28 94 95
	       Lock-Queue  Node	 Level	Conv  Cookie	       Refs  AST  BAST	Pending-Action
	       Granted	   94	 NL	-1    94:3169409       2     No	  No	None
	       Granted	   28	 NL	-1    28:3213591       2     No	  No	None
	       Granted	   27	 NL	-1    27:3216832       2     No	  No	None
	       Granted	   95	 NL	-1    95:3178429       2     No	  No	None
	       Granted	   25	 NL	-1    25:3513994       2     No	  No	None
	       Granted	   26	 EX	-1    26:3512906       2     No	  No	None

	      The figure below shows a lock from the file system  perspective.
	      Specifically,  it	 shows	a lock that is in the process of being
	      upconverted from a NL  to	 EX.  Locks  in	 this  state  are  are
	      referred	to  in the file system as busy locks and can be listed
	      using the debugfs.ocfs2 command, "fs_locks -B".

	      # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
	      Lockres: M000000000000000000000b9aba12ec	Mode: No Lock
	      Flags: Initialized Attached Busy
	      RO Holders: 0  EX Holders: 0
	      Pending Action: Convert  Pending Unlock Action: None
	      Requested Mode: Exclusive	 Blocking Mode: No Lock
	      PR > Gets: 0  Fails: 0	Waits Total: 0us  Max: 0us  Avg: 0ns
	      EX > Gets: 1  Fails: 0	Waits Total: 544us  Max: 544us	Avg: 544185ns
	      Disk Refreshes: 1

	      With this debugging infrastructure in  place,  users  can	 debug
	      hang issues as follows:

		  *  Dump  the	busy fs locks for all the OCFS2 volumes on the
		  node with hanging processes. If no locks are found, then the
		  problem is not related to O2DLM.

		  * Dump the corresponding dlm lock for all the busy fs locks.
		  Note down the owner (master) of all the locks.

		  * Dump the dlm locks on the master node for each lock.

	      At this stage, one should note that the hanging node is  waiting
	      to  get  an  AST from the master. The master, on the other hand,
	      cannot send the AST until the current holder has down  converted
	      that  lock, which it will do upon receiving a Blocking AST. How‐
	      ever, a node can only down convert if all the lock holders  have
	      stopped using that lock.	After dumping the dlm lock on the mas‐
	      ter node, identify the current lock holder and dump both the dlm
	      and fs locks on that node.

	      The  trick  here	is to see whether the Blocking AST message has
	      been relayed to file system. If not, the problem is in  the  dlm
	      layer.  If  it  has, then the most common reason would be a lock
	      holder, the count for which is maintained in the fs lock.

	      At this stage, printing the list of process helps.

	      $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

	      Make a note of all D state processes. At least one  of  them  is
	      responsible for the hang on the first node.

	      The  challenge  then  is	to  figure out why those processes are
	      hanging. Failing that, at least  get  enough  information	 (like
	      alt-sysrq	 t  output) for the kernel developers to review.  What
	      to do next depends on where the process is  hanging.  If	it  is
	      waiting  for  the I/O to complete, the problem could be anywhere
	      in the I/O subsystem, from the block device  layer  through  the
	      drivers  to  the	disk  array.  If the hang concerns a user lock
	      (flock(2)), the problem could be in the  user’s  application.  A
	      possible	solution  could	 be to kill the holder. If the hang is
	      due to tight or  fragmented  memory,  free  up  some  memory  by
	      killing non-essential processes.

	      The thing to note is that the symptom for the problem was on one
	      node but the cause is on another. The issue can only be resolved
	      on  the node holding the lock. Sometimes, the best solution will
	      be to reset that node. Once killed, the O2DLM  recovery  process
	      will  clear all locks owned by the dead node and let the cluster
	      continue to operate. As harsh as that sounds, at times it is the
	      only  solution.  The  good news is that, by following the trail,
	      you now have enough information to file a bug and get  the  real
	      issue resolved.

       NFS EXPORTING
	      OCFS2  volumes  can  be exported as NFS volumes. This support is
	      limited to NFS version 3, which translates to Linux kernel  ver‐
	      sion 2.4 or later.

	      If  the  version of the Linux kernel on the system exporting the
	      volume is older than 2.6.30, then the NFS clients must mount the
	      volumes  using  the  nordirplus  mount option. This disables the
	      READDIRPLUS RPC call to workaround a bug in  NFSD,  detailed  in
	      the following link:

	      http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html

	      Users  running  NFS version 2 can export the volume after having
	      disabled subtree checking (mount	option	no_subtree_check).  Be
	      warned,  disabling  the  check  has security implications (docu‐
	      mented in the exports(5) man page) that users must  evaluate  on
	      their own.

       FILE SYSTEM LIMITS
	      OCFS2  has  no  intrinsic limit on the total number of files and
	      directories in the file system. In general, it is	 only  limited
	      by the size of the device. But there is one limit imposed by the
	      current filesystem. It can address at most  four	billion	 clus‐
	      ters.  A	file  system  with  1MB cluster size can go up to 4PB,
	      while a file system with a 4KB cluster size can  address	up  to
	      16TB.

       SYSTEM OBJECTS
	      The  OCFS2  file system stores its internal meta-data, including
	      bitmaps, journals, etc., as system files. These are grouped in a
	      system directory. These files and directories are not accessible
	      via the file system  interface  but  can	be  viewed  using  the
	      debugfs.ocfs2(8) tool.

	      To list the system directory (referred to as double-slash), do:

	      # debugfs.ocfs2 -R "ls -l //" /dev/sde1
		      66     drwxr-xr-x	 10  0	0	  3896 19-Jul-2011 13:36 .
		      66     drwxr-xr-x	 10  0	0	  3896 19-Jul-2011 13:36 ..
		      67     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 bad_blocks
		      68     -rw-r--r--	  1  0	0      1179648 19-Jul-2011 13:36 global_inode_alloc
		      69     -rw-r--r--	  1  0	0	  4096 19-Jul-2011 14:35 slot_map
		      70     -rw-r--r--	  1  0	0      1048576 19-Jul-2011 13:36 heartbeat
		      71     -rw-r--r--	  1  0	0  53686960128 19-Jul-2011 13:36 global_bitmap
		      72     drwxr-xr-x	  2  0	0	  3896 25-Jul-2011 15:05 orphan_dir:0000
		      73     drwxr-xr-x	  2  0	0	  3896 19-Jul-2011 13:36 orphan_dir:0001
		      74     -rw-r--r--	  1  0	0      8388608 19-Jul-2011 13:36 extent_alloc:0000
		      75     -rw-r--r--	  1  0	0      8388608 19-Jul-2011 13:36 extent_alloc:0001
		      76     -rw-r--r--	  1  0	0    121634816 19-Jul-2011 13:36 inode_alloc:0000
		      77     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 inode_alloc:0001
		      77     -rw-r--r--	  1  0	0    268435456 19-Jul-2011 13:36 journal:0000
		      79     -rw-r--r--	  1  0	0    268435456 19-Jul-2011 13:37 journal:0001
		      80     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 local_alloc:0000
		      81     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 local_alloc:0001
		      82     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 truncate_log:0000
		      83     -rw-r--r--	  1  0	0	     0 19-Jul-2011 13:36 truncate_log:0001

	      The  file	 names that end with numbers are slot specific and are
	      referred to as node-local system files. The  set	of  node-local
	      files  used  by  a  node can be determined from the slot map. To
	      list the slot map, do:

	      # debugfs.ocfs2 -R "slotmap" /dev/sde1
		  Slot#	   Node#
		      0	      32
		      1	      35
		      2	      40
		      3	      31
		      4	      34
		      5	      33

	      For more information, refer to the OCFS2 support	guides	avail‐
	      able   in	  the	Documentation	section	  at   http://oss.ora‐
	      cle.com/projects/ocfs2.

       HEARTBEAT, QUORUM, AND FENCING
	      Heartbeat is an  essential  component  in	 any  cluster.	It  is
	      charged  with  accurately	 designating nodes as dead or alive. A
	      mistake here could lead to a cluster hang or a corruption.

	      o2hb is the disk heartbeat component of  o2cb.  It  periodically
	      updates a timestamp on disk, indicating to others that this node
	      is alive. It also reads all the  timestamps  to  identify	 other
	      live  nodes. Other cluster components, like o2dlm and o2net, use
	      the o2hb service to get node up and down events.

	      The quorum is the group of nodes in a cluster that is allowed to
	      operate  on  the	shared storage. When there is a failure in the
	      cluster, nodes may be split into groups that can communicate  in
	      their groups and with the shared storage but not between groups.
	      o2quo determines which group is allowed to continue  and	initi‐
	      ates fencing of the other group(s).

	      Fencing is the act of forcefully removing a node from a cluster.
	      A node with OCFS2 mounted will fence  itself  when  it  realizes
	      that it does not have quorum in a degraded cluster. It does this
	      so that  other  nodes  won’t  be	stuck  trying  to  access  its
	      resources.

	      o2cb  uses  a machine reset to fence. This is the quickest route
	      for the node to rejoin the cluster.

       PROCESSES

	      [o2net]
		     One per node. It is a work-queue thread started when  the
		     cluster  is  brought  on-line and stopped when it is off-
		     lined. It handles network communication for  all  mounts.
		     It	 gets the list of active nodes from O2HB and sets up a
		     TCP/IP communication channel  with	 each  live  node.  It
		     sends  regular keep-alive packets to detect any interrup‐
		     tion on the channels.

	      [user_dlm]
		     One per node. It is  a  work-queue	 thread	 started  when
		     dlmfs is loaded and stopped when it is unloaded (dlmfs is
		     a synthetic file system that allows user space  processes
		     to access the in-kernel dlm).

	      [ocfs2_wq]
		     One  per node. It is a work-queue thread started when the
		     OCFS2 module is loaded and stopped when it	 is  unloaded.
		     It is assigned background file system tasks that may take
		     cluster locks like	 flushing  the	truncate  log,	orphan
		     directory recovery and local alloc recovery. For example,
		     orphan directory recovery runs in the background so  that
		     it does not affect recovery time.

	      [o2hb-14C29A7392]
		     One  per  heartbeat device. It is a kernel thread started
		     when the heartbeat region is populated  in	 configfs  and
		     stopped  when  it is removed. It writes every two seconds
		     to a block in the heartbeat region, indicating that  this
		     node is alive. It also reads the region to maintain a map
		     of live nodes. It notifies	 subscribers  like  o2net  and
		     o2dlm of any changes in the live node map.

	      [ocfs2dc]
		     One  per mount. It is a kernel thread started when a vol‐
		     ume is mounted and stopped when it is unmounted. It down‐
		     grades   locks  in	 response  to  blocking	 ASTs  (BASTs)
		     requested by other nodes.

	      [jbd2/sdf1-97]
		     One per mount. It is part of JBD2, which OCFS2  uses  for
		     journaling.

	      [ocfs2cmt]
		     One  per mount. It is a kernel thread started when a vol‐
		     ume is mounted and stopped when it is unmounted. It works
		     with kjournald2.

	      [ocfs2rec]
		     It	 is  started whenever a node has to be recovered. This
		     thread performs file system  recovery  by	replaying  the
		     journal  of  the  dead node. It is scheduled to run after
		     dlm recovery has completed.

	      [dlm_thread]
		     One per dlm domain. It is a kernel thread started when  a
		     dlm  domain  is created and stopped when it is destroyed.
		     This thread sends ASTs and blocking ASTs in  response  to
		     lock  level  convert  requests. It also frees unused lock
		     resources.

	      [dlm_reco_thread]
		     One per dlm domain. It is a kernel	 thread	 that  handles
		     dlm  recovery when another node dies. If this node is the
		     dlm recovery master, it re-masters	 every	lock  resource
		     owned by the dead node.

	      [dlm_wq]
		     One  per dlm domain. It is a work-queue thread that o2dlm
		     uses to queue blocking tasks.

       FUTURE WORK
	      File system development is a  never  ending  cycle.  Faster  and
	      larger  disks,  faster  and  more	 number	 of processors, larger
	      caches, etc. keep changing the sweet spot for performance	 forc‐
	      ing developers to rethink long held beliefs. Add to that new use
	      cases, which forces developers to	 be  innovative	 in  providing
	      solutions that melds seamlessly with existing semantics.

	      We  are  currently looking to add features like transparent com‐
	      pression, transparent  encryption,  delayed  allocation,	multi-
	      device support, etc. as well as work on improving performance on
	      newer generation machines.

	      If you are interested in	contributing,  email  the  development
	      team at ocfs2-devel@oss.oracle.com.

ACKNOWLEDGEMENTS
       The  principal  developers  of the OCFS2 file system, its tools and the
       O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara,
       Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.

       Other developers who have contributed to the file system via bug fixes,
       testing, etc.  are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney,
       Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.

       The  members  of	 the Linux Cluster community including Andrew Beekhof,
       Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.

       The members of the Linux	 File  system  community  including  Christoph
       Hellwig and Chris Mason.

       The  corporations  that	have  contributed  resources  for this project
       including Oracle, SUSE Labs, EMC, Emulex, HP, IBM,  Intel  and  Network
       Appliance.

SEE ALSO
       debugfs.ocfs2(8)	  fsck.ocfs2(8)	  fsck.ocfs2.checks(8)	 mkfs.ocfs2(8)
       mount.ocfs2(8)  mounted.ocfs2(8)	 o2cluster(8)	o2image(8)   o2info(1)
       o2cb(7)	o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.cluster.conf(5)
       tunefs.ocfs2(8)

AUTHOR
       Oracle Corporation

COPYRIGHT
       Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2			 January 2012			      OCFS2(7)
[top]

List of man pages available for OpenSuSE

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net