Case Study: Installation and configuration of the GlusterFS cluster
I would like to share my (perhaps a bit unusual) configuration of glusterFS cluster.
Warning: This text should not be treated as ‘how to’, precisely and literally describing all steps necessary to carry out the whole installation. I have tried to describe all important steps, but I focused on more on the idea behind them, so there may be certain simplifications in the text.
I have considered a highly available storage for quite a while now. Choosing gluster resulted from, first of all, low ‘treshold of entry’. In comparison to a potential alternative - ceph definitely less equipment is needed to startup a minimal reasonable configuration.
I needed a gluster for 2 reasons:
- as capacious (not necessarily very efficient, but with high availability) data warehouse (e.g. installation figures of systems/applications, repository/archive for builds of software created in a company)
- in a time perspective, I was about to run a local OKD installation , for which I needed durable storage for containers.
Such requirements caused project decisions: integration with OKD enforced the use of heketi (through it, OKD integrates with the gluster), applications outside OKD, in turn, enforced “non-container” setup of the gluster on machines separate from the OKD.
Equipment
I used 4 machines in similar configurations to build it:
- 2U housing with room for 12 3.5" hard drives, with 11 8T or 10T disks + 2 sata ssd 2.5" drives per system (in separate slots at the back of the server)
- Supermicro X11-SSL-F mainboard (I really like this board – in the area of 'economic servers', in which it’s not necessary to have many PCI-E lines and lots of CPU cores)
- Intel Xeon E3-12xx v6 processor
- DDR4 ECC (2x 16GB) memory
- RAID controller with cache and battery backup/flash (different models - Microsemi(Adaptec) or Broadcom(LSI))
- additional network card 2x10Gbit
I configured system drives as mirror, with the use of software-raid on the mainboard, I connected 2 disks for the main storage to the raid and I configured RAID-10 on them, with one spare disk.
I chose CentOS 7 in quite a minimal configuration as the operating system, located it on dedicated SSD drives.
On 2 network cards embedded on the main board, I configured bond0 interface, used mainly for administrative purposes, on 10Gbit interfaces I configured the second interface bond1 for the needs of SAN/gluster network.
Each machine is visible as glusterN.softax
(1Gbit network) and glusterN.san
(10Gbit network).
Install the necessary cluster software:
# yum install centos-release-gluster6
# yum install glusterfs-server nfs-ganesha-gluster glusterfs-resource-agents
# yum install heketi heketi-client
A default cluster solution proposed together with glusterfs is ctdb. I did not quite like this solution (perhaps because I do not know it very well), it is not also vital for me to balance the load on the level of SMB between the machines of the cluster (where using CTDB was profitable). I definitely preferred to use pacemaker, which I know very well too and which I have been using for many years in different applications.
Unfortunately, the choice of pacemaker results in the fact that we cannot use an ‘automagic’ storhaug tool for configurating glusterfs cluster, but everything will have to be manually configured.
# yum install corosync pacemaker pacemaker-cli
To (re)configure pacemaker clusters, my favourite tool is crmsh
# rpm -ihv crmsh-3.0.0-6.2.noarch.rpm python-parallax-1.0.1-29.1.noarch.rpm
Because heketi likes to be self sustained within the whole device/partition it will receive, on the main data storage I created a partition for different than heketi needs. For now, I will use only a small part of it, but it is easier to leave a bit of space than struggle later on with resizing of the partition.
The final effect of partitioning of the main storage (not mentioning the system discs) is as follows:
# parted /dev/sdc
GNU Parted 3.1
Using /dev/sdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ASR8160 storage (scsi)
Disk /dev/sdc: 50.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 137GB 137GB infra lvm
2 137GB 50.0TB 49.8TB data lvm
Partition sdc2
will be for the needs of heketi – I do not configure anything manually on it.
On sdc1
partition I set up a volume group vgclusterinfoN
and gluster_shared_storage volumen, which I next mount in directory /var/lib/glusterd/ss_brick :
# pvcreate /dev/sdc1
# vgcreate vgclusterinfoN /dev/sdc1
# lvcreate --type thin-pool -l100%free -n thinpool vgclusterinfoN
# lvcreate -V1G -n gluster_shared_storage vgclusterinfoN/thinpool
# mkfs.xfs -i size=512 /dev/vgclusterinfoN/gluster_shared_storage
# mount /dev/vgclusterinfoN/gluster_shared_storage /var/lib/glusterd/ss_brick
(of course, we also must remember to add a new entry to /etc/fstab
)
After making the above configurations on all 4 machines, we can start gluster configuration.
We have got 4 machines, and for making quorum an odd number is much better, so that is why I added the 5th machine to the 4 (superstorage server, used for other purposes) – I will not run any services/volumes on it, it will only be used while setting quorum.
We create a gluster cluster (we connect the machines with one another):
# gluster peer probe gluster2.san
# gluster peer probe gluster3.san
# gluster peer probe gluster4.san
# gluster peer probe superstorage.san
The command gluster pool list
should at this moment show us all the 5 connected machines.
We prepare configuration for corosync (/etc/corosync/corosync.conf
):
totem {
version: 2
cluster_name: gluster
transport: udpu
secauth: on
}
nodelist {
node {
ring0_addr: gluster1.san
nodeid: 1
}
node {
ring0_addr: gluster2.san
nodeid: 2
}
node {
ring0_addr: gluster3.san
nodeid: 3
}
node {
ring0_addr: gluster4.san
nodeid: 4
}
node {
ring0_addr: superstorage.san
nodeid: 99
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
We start corosync first, and then pacemaker on all machines.
Having waited long enough for the cluster to constitute, we have a look at crm-a and we make the initial configuration:
- We have not got shared disk resources, we can safely switch off stonith.
- We do not want any cluster services to be run on superstorage.
# crm
crm(live) configure
crm(live) property stonith-enabled=false
crm(live) commit
crm(live) cd
crm(live) node standby superstorage.san
As a result we should get a cluster ready for services configuration:
# crm node list
gluster1.san(1): normal
standby=off
gluster2.san(2): normal
standby=off
gluster3.san(3): normal
standby=off
gluster4.san(4): normal
standby=off
superstorage.san(99): normal
standby=on
Next, we make a volume for cluster configurations:
# gluster volume create gluster_shared_storage replica 4 gluster1.san:/var/lib/glusterd/ss_brick gluster2.san:/var/lib/glusterd/ss_brick gluster3.san:/var/lib/glusterd/ss_brick gluster4.san:/var/lib/glusterd/ss_brick
And next on all machines:
# mount -t glusterfs -o _netdev gluster.san /vol/gluster_shared_storage
(plus a corresponding to the above listing in /etc/fstab
)
This way in the path /vol/gluster_shared_storage
we get filesystem shared between all nodes of the cluster.
I move to it (replacing the original files/catalogues with links):
- corosync (/etc/corosync) configuration
- samba (/etc/samba) configuration
- nfs-ganesha (/etc/ganesha) configuration
- heketi (/etc/heketi/heketi.json) configuration
At this moment we can configure application services for the cluster – in this case they will be:
- virtual IP for both networks
- heketi
- samba and nfs-ganesha services
# crm configure
crm(live) primitive gluster_san_trigger ganesha_trigger
crm(live) primitive gluster_softax_trigger ganesha_trigger
crm(live) primitive heketi systemd:heketi op monitor interval=60
crm(live) primitive nfs-ganesha systemd:nfs-ganesha op monitor interval=61
crm(live) primitive smb systemd:smb op monitor interval=59
crm(live) primitive vip_gluster_san IPaddr2 params ip=1.1.1.2 cidr_netmask=22 nic=bond1 \
op monitor interval=60
crm(live) primitive vip_gluster_softax IPaddr2 params ip=2.2.2.180 cidr_netmask=24 nic=bond0 \
op monitor interval=60
crm(live) group gluster_san vip_gluster_san gluster_san_trigger heketi
crm(live) group gluster_softax vip_gluster_softax gluster_softax_trigger vol_images smb
crm(live) clone nfs-ganesha-clone nfs-ganesha meta notify=true target-role=Started
crm(live) commit
(I ‘borrowed’ resource ganesh_trigger from storhaug – it ensures grace-period service after switching nfs server address to another machine)
The last element that requires configuring is heketi.
Heketi is a service enabling automation of glusterfs cluster management – after configuring resources it is going to manage (cluster machines, disks for storage) operations such as creating or deleting a volume are available (via REST API and the command-line client) as single actions. All necessary technical operations (e.g. in the case of setting up a new volume: the choice of machines, setting up volumes on the level of local LVMs, formatting filesystems, creating bricks of glusterfs volume and finally creating glusterfs volume) happen ‘in the background’.
The main configuration of heketi etc/heketi/heketi.json
looks like this:
{
"port": "8080",
"use_auth": true,
"jwt": {
"admin": {
"key": "<AdminSecret>"
},
"user": {
"key": "<UserSecret>"
}
},
"glusterfs": {
"executor": "ssh",
"sshexec": {
"keyfile": "/vol/gluster_shared_storage/data/heketi/ssh/id_rsa",
"user": "root"
},
"db": "/vol/gluster_shared_storage/data/heketi/heketi.db",
"_loglevel_comment": [
"Set log level. Choices are:",
" none, critical, error, warning, info, debug",
"Default is warning"
],
"loglevel": "info"
}
}
To make heketi work, it is also necessary to generate ssh key, to which we refer in the configuration and copying the public key to /root/.ssh/authorized_key
on all glusterX machines - using this key heketi will log onto particular servers and execute commands managing disk resources of servers.
To facilitate the use of heketi-cli, I defined on the user from which I run this tool HEKETI_CLI_USER
and HEKETI_CLI_KEY
environment variables,
(indicating the user and the key according to heketi included in the configuration) and HEKETI_CLI_SERVER=http://gluster.san:8080
.
Configuration of the cluster and nodes in heketi:
$ heketi-cli cluster create --block=false --file=true
Id:cc6615faec3607dd6680f637f65ce920
$ heketi-cli node add --zone=1 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster1.softax --storage-host-name=gluster1.san
Id:<...>
$ heketi-cli node add --zone=2 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster2.softax --storage-host-name=gluster2.san
Id:<...>
$ heketi-cli node add --zone=3 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster3.softax --storage-host-name=gluster3.san
Id:<...>
$ heketi-cli node add --zone=4 --cluster=cc6615faec3607dd6680f637f65ce920 --management-host-name=gluster4.softax --storage-host-name=gluster4.san
Id:<...>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster1 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster2 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster3 node id>
$ heketi-cli device add --name=/dev/sdc2 --node=<gluster4 node id>
Initially, I tried to use, instead of the above, "heketi-cli topology load", but I got discouraged - precise documentation of the file format with topology is poor, and errors returned by heketi-cli gently speaking did not facilitate error diagnosis and it was easier and faster for me to create infrastructure configuration in heketi with single commands.
And this finishes configuration of the whole gluster infrastructure.
From now on we can create new volumens from heketi-cli level, the example of which is below:
$ heketi-cli volume create --replica 3 --size 1T --name=n_imagesId:36e5608e8d85472822f0c60ee19515bf
Check what has happened:
$ gluster volume info n_images
Volume Name: n_images
Type: Replicate
Volume ID: 209975a3-eb56-458e-a702-daf1c66cd37b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster4.san:/var/lib/heketi/mounts /vg_d37a8ea80f7ceb5512427f4aae9c826c/brick_77ecb6b56a12ae7be6d3aed808a0a6f7/brick
Brick2: gluster1.san:/var/lib/heketi/mounts/vg_d511c3f47c665e45449df6c5f3636f95/brick_615c967efe3fe8f77010d4314262b4fb/brick
Brick3: gluster2.san:/var/lib/heketi/mounts/vg_059e336ddf97785018aaa3b5126217db/brick_c73e55a25326dcfa041d44cf2f57a028/brick
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
features.barrier: disable
cluster.enable-shared-storage: enable
auto-delete: enable
Success!
Volumes created like this can be shared directly (mounting them on the target machine with glusterfs client), or by using samba or NFS operating on glusterfs cluster machines.
The integration of gluster cluster configured like this with OKD (Openshift Origin) has brought about some more difficulties and surprises, but I will write about it in the next article, in which I will describe my struggling to run OKD on the physical infrastructure in Softax.