Case Study: Installation and configuration of the GlusterFS cluster

I would like to share my (perhaps a bit unusual) configuration of glusterFS cluster.

Warning: This text should not be treated as ‘how to’, precisely and literally describing all steps necessary to carry out the whole installation. I have tried to describe all important steps, but I focused on more on the idea behind them, so there may be certain simplifications in the text.

I have considered a highly available storage for quite a while now. Choosing gluster resulted from, first of all, low ‘treshold of entry’. In comparison to a potential alternative - ceph definitely less equipment is needed to startup a minimal reasonable configuration.

I needed a gluster for 2 reasons:

as capacious (not necessarily very efficient, but with high availability) data warehouse (e.g. installation figures of systems/applications, repository/archive for builds of software created in a company)
in a time perspective, I was about to run a local OKD installation , for which I needed durable storage for containers.

Such requirements caused project decisions: integration with OKD enforced the use of heketi (through it, OKD integrates with the gluster), applications outside OKD, in turn, enforced “non-container” setup of the gluster on machines separate from the OKD.

Equipment

I used 4 machines in similar configurations to build it:

2U housing with room for 12 3.5" hard drives, with 11 8T or 10T disks + 2 sata ssd 2.5" drives per system (in separate slots at the back of the server)
Supermicro X11-SSL-F mainboard (I really like this board – in the area of 'economic servers', in which it’s not necessary to have many PCI-E lines and lots of CPU cores)
Intel Xeon E3-12xx v6 processor
DDR4 ECC (2x 16GB) memory
RAID controller with cache and battery backup/flash (different models - Microsemi(Adaptec) or Broadcom(LSI))
additional network card 2x10Gbit

I configured system drives as mirror, with the use of software-raid on the mainboard, I connected 2 disks for the main storage to the raid and I configured RAID-10 on them, with one spare disk.

I chose CentOS 7 in quite a minimal configuration as the operating system, located it on dedicated SSD drives.

On 2 network cards embedded on the main board, I configured bond0 interface, used mainly for administrative purposes, on 10Gbit interfaces I configured the second interface bond1 for the needs of SAN/gluster network.

Each machine is visible as glusterN.softax (1Gbit network) and glusterN.san (10Gbit network).

Install the necessary cluster software:

# yum install centos-release-gluster6
# yum install glusterfs-server nfs-ganesha-gluster glusterfs-resource-agents
# yum install heketi heketi-client

A default cluster solution proposed together with glusterfs is ctdb. I did not quite like this solution (perhaps because I do not know it very well), it is not also vital for me to balance the load on the level of SMB between the machines of the cluster (where using CTDB was profitable). I definitely preferred to use pacemaker, which I know very well too and which I have been using for many years in different applications.

Unfortunately, the choice of pacemaker results in the fact that we cannot use an ‘automagic’ storhaug tool for configurating glusterfs cluster, but everything will have to be manually configured.

# yum install corosync pacemaker pacemaker-cli

To (re)configure pacemaker clusters, my favourite tool is crmsh

# rpm -ihv crmsh-3.0.0-6.2.noarch.rpm python-parallax-1.0.1-29.1.noarch.rpm

Because heketi likes to be self sustained within the whole device/partition it will receive, on the main data storage I created a partition for different than heketi needs. For now, I will use only a small part of it, but it is easier to leave a bit of space than struggle later on with resizing of the partition.

The final effect of partitioning of the main storage (not mentioning the system discs) is as follows:

# parted /dev/sdc
  GNU Parted 3.1
  Using /dev/sdc
  Welcome to GNU Parted! Type 'help' to view a list of commands.
  (parted) print
  Model: ASR8160 storage (scsi)
  Disk /dev/sdc: 50.0TB
  Sector size (logical/physical): 512B/512B
  Partition Table: gpt

  Disk Flags: 
  Number  Start   End     Size    File system  Name   Flags 
  1       1049kB  137GB   137GB   infra        lvm 
  2       137GB   50.0TB  49.8TB  data         lvm

Partition sdc2 will be for the needs of heketi – I do not configure anything manually on it.
On sdc1 partition I set up a volume group vgclusterinfoN and gluster_shared_storage volumen, which I next mount in directory /var/lib/glusterd/ss_brick :

# pvcreate /dev/sdc1
# vgcreate vgclusterinfoN /dev/sdc1
# lvcreate --type thin-pool -l100%free -n thinpool vgclusterinfoN
# lvcreate -V1G -n gluster_shared_storage vgclusterinfoN/thinpool
# mkfs.xfs -i size=512 /dev/vgclusterinfoN/gluster_shared_storage
# mount /dev/vgclusterinfoN/gluster_shared_storage /var/lib/glusterd/ss_brick

(of course, we also must remember to add a new entry to /etc/fstab)

After making the above configurations on all 4 machines, we can start gluster configuration.

We have got 4 machines, and for making quorum an odd number is much better, so that is why I added the 5th machine to the 4 (superstorage server, used for other purposes) – I will not run any services/volumes on it, it will only be used while setting quorum.

We create a gluster cluster (we connect the machines with one another):

# gluster peer probe gluster2.san
# gluster peer probe gluster3.san
# gluster peer probe gluster4.san
# gluster peer probe superstorage.san

The command gluster pool list should at this moment show us all the 5 connected machines.

We prepare configuration for corosync (/etc/corosync/corosync.conf ):

totem {
    version: 2
    cluster_name: gluster
    transport: udpu
    secauth: on
}

nodelist {
    node {
        ring0_addr: gluster1.san
        nodeid: 1
    }
    node {
        ring0_addr: gluster2.san
        nodeid: 2
    }
    node {
        ring0_addr: gluster3.san
        nodeid: 3
    }
    node {
        ring0_addr: gluster4.san
        nodeid: 4
    }
    node {
        ring0_addr: superstorage.san
        nodeid: 99
    }

}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

We start corosync first, and then pacemaker on all machines.
Having waited long enough for the cluster to constitute, we have a look at crm-a and we make the initial configuration:
- We have not got shared disk resources, we can safely switch off stonith.
- We do not want any cluster services to be run on superstorage.

# crm
  crm(live) configure
  crm(live) property stonith-enabled=false
  crm(live) commit
  crm(live) cd
  crm(live) node standby superstorage.san

As a result we should get a cluster ready for services configuration:

# crm node list
  gluster1.san(1): normal
  standby=off
  gluster2.san(2): normal
  standby=off
  gluster3.san(3): normal
  standby=off
  gluster4.san(4): normal
  standby=off
  superstorage.san(99): normal
  standby=on

Next, we make a volume for cluster configurations:

# gluster volume create  gluster_shared_storage replica 4 gluster1.san:/var/lib/glusterd/ss_brick gluster2.san:/var/lib/glusterd/ss_brick gluster3.san:/var/lib/glusterd/ss_brick gluster4.san:/var/lib/glusterd/ss_brick

And next on all machines:

# mount -t glusterfs -o _netdev gluster.san /vol/gluster_shared_storage

(plus a corresponding to the above listing in /etc/fstab)

This way in the path /vol/gluster_shared_storage we get filesystem shared between all nodes of the cluster.

I move to it (replacing the original files/catalogues with links):

corosync (/etc/corosync) configuration
samba (/etc/samba) configuration
nfs-ganesha (/etc/ganesha) configuration
heketi (/etc/heketi/heketi.json) configuration

At this moment we can configure application services for the cluster – in this case they will be:

virtual IP for both networks
heketi
samba and nfs-ganesha services

# crm configure
  crm(live) primitive gluster_san_trigger ganesha_trigger
  crm(live) primitive gluster_softax_trigger ganesha_trigger
  crm(live) primitive heketi systemd:heketi op monitor interval=60
  crm(live) primitive nfs-ganesha systemd:nfs-ganesha op monitor interval=61
  crm(live) primitive smb systemd:smb op monitor interval=59
  crm(live) primitive vip_gluster_san IPaddr2 params ip=1.1.1.2 cidr_netmask=22 nic=bond1 \
  op monitor interval=60
  crm(live) primitive vip_gluster_softax IPaddr2 params ip=2.2.2.180 cidr_netmask=24 nic=bond0 \
  op monitor interval=60
  crm(live) group gluster_san vip_gluster_san gluster_san_trigger heketi
  crm(live) group gluster_softax vip_gluster_softax gluster_softax_trigger vol_images smb
  crm(live) clone nfs-ganesha-clone nfs-ganesha meta notify=true target-role=Started
  crm(live) commit

(I ‘borrowed’ resource ganesh_trigger from storhaug – it ensures grace-period service after switching nfs server address to another machine)

The last element that requires configuring is heketi.

Heketi is a service enabling automation of glusterfs cluster management – after configuring resources it is going to manage (cluster machines, disks for storage) operations such as creating or deleting a volume are available (via REST API and the command-line client) as single actions. All necessary technical operations (e.g. in the case of setting up a new volume: the choice of machines, setting up volumes on the level of local LVMs, formatting filesystems, creating bricks of glusterfs volume and finally creating glusterfs volume) happen ‘in the background’.

The main configuration of heketi etc/heketi/heketi.json looks like this:

{
    "port": "8080",
    "use_auth": true,
    "jwt": {
        "admin": {
            "key": "<AdminSecret>"
        },
        "user": {
            "key": "<UserSecret>"
        }
    },
    "glusterfs": {
        "executor": "ssh",
        "sshexec": {
            "keyfile": "/vol/gluster_shared_storage/data/heketi/ssh/id_rsa",
            "user": "root"
        },
        "db": "/vol/gluster_shared_storage/data/heketi/heketi.db",
        "_loglevel_comment": [
            "Set log level. Choices are:",
            "  none, critical, error, warning, info, debug",
            "Default is warning"
        ],
        "loglevel": "info"
    }
}

To make heketi work, it is also necessary to generate ssh key, to which we refer in the configuration and copying the public key to /root/.ssh/authorized_key on all glusterX machines - using this key heketi will log onto particular servers and execute commands managing disk resources of servers.

To facilitate the use of heketi-cli, I defined on the user from which I run this tool HEKETI_CLI_USER and HEKETI_CLI_KEY environment variables,

(indicating the user and the key according to heketi included in the configuration) and HEKETI_CLI_SERVER=http://gluster.san:8080.

Configuration of the cluster and nodes in heketi:

$ heketi-cli cluster create --block=false --file=true
  Id:cc6615faec3607dd6680f637f65ce920
$ heketi-cli node add --zone=1 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster1.softax --storage-host-name=gluster1.san
  Id:<...>
$ heketi-cli node add --zone=2 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster2.softax --storage-host-name=gluster2.san
  Id:<...>
$ heketi-cli node add --zone=3 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster3.softax --storage-host-name=gluster3.san
  Id:<...>
$ heketi-cli node add --zone=4 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster4.softax --storage-host-name=gluster4.san
  Id:<...>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster1 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster2 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster3 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster4 node id>

Initially, I tried to use, instead of the above, "heketi-cli topology load", but I got discouraged - precise documentation of the file format with topology is poor, and errors returned by heketi-cli gently speaking did not facilitate error diagnosis and it was easier and faster for me to create infrastructure configuration in heketi with single commands.

And this finishes configuration of the whole gluster infrastructure.
From now on we can create new volumens from heketi-cli level, the example of which is below:

$ heketi-cli volume create --replica 3 --size 1T --name=n_imagesId:36e5608e8d85472822f0c60ee19515bf

Check what has happened:

$ gluster volume info n_images
  Volume Name: n_images
  Type: Replicate
  Volume ID: 209975a3-eb56-458e-a702-daf1c66cd37b
  Status: Started
  Snapshot Count: 0
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Bricks:
  Brick1: gluster4.san:/var/lib/heketi/mounts /vg_d37a8ea80f7ceb5512427f4aae9c826c/brick_77ecb6b56a12ae7be6d3aed808a0a6f7/brick
  Brick2: gluster1.san:/var/lib/heketi/mounts/vg_d511c3f47c665e45449df6c5f3636f95/brick_615c967efe3fe8f77010d4314262b4fb/brick
  Brick3: gluster2.san:/var/lib/heketi/mounts/vg_059e336ddf97785018aaa3b5126217db/brick_c73e55a25326dcfa041d44cf2f57a028/brick
  Options Reconfigured:
  nfs.disable: on
  transport.address-family: inet
  features.barrier: disable
  cluster.enable-shared-storage: enable
  auto-delete: enable

Success!

Volumes created like this can be shared directly (mounting them on the target machine with glusterfs client), or by using samba or NFS operating on glusterfs cluster machines.

The integration of gluster cluster configured like this with OKD (Openshift Origin) has brought about some more difficulties and surprises, but I will write about it in the next article, in which I will describe my struggling to run OKD on the physical infrastructure in Softax.

Products

All services