Setting up a high availability storage cluster for VMWare with Ubuntu 14.04 LTS

Double, double toil and trouble,
Fire burn, and cauldron bubble.
– Macbeth

As I’ve mentioned before, I’m currently working on a virtualization project. VMWare vSphere Essentials Plus is the virtualization software of choice, and the goal is to build an affordable redundant/clustered storage solution for the VMWare cluster.

After a lot of experimenting with Ubuntu 12.04 and Ubuntu 14.04, DRBD, Heartbeat, LIO and NFS, I finally managed to configure a WORKING high availability storage cluster based upon a daily development build of the forthcoming Ubuntu Server 14.04 LTS x64, DRBD, Heartbeat and NFS.

Why NFS and not a LIO-based iSCSI solution? Because it was much easier to set up and, when you believe certain benchmarks on the web, an NFS-based solution is even a bit faster and more responsive than an iSCSI-based one. At this point, I have not run any benchmarks myself and all I can say is that they feel almost the same.

The following hardware is involved in this test environment:

  • A Cisco Gigabit switch with “system mtu jumbo 9000” configured, so that we can use jumbo packets to improve the data transfer between the storage units and the VMWare hosts. We just have to make sure to use a good data management service (read more)
  • Two non-identical x64 servers with at least three Gigabit Ethernet ports.
  • Two identical x64 servers for the VMWare cluster, each with 24 GB of RAM and 6 Gigabit Ethernet ports.

Like I’ve said, this is a TEST environment. When you can AFFORD (financially, I mean) to be really serious about this, then you would of course have redundant 10 Gig Ethernet switches and 10 Gig Ethernet NICs in place. But in the real world where I live, there are always financial constraints and I have to make do with what I have.

I have three networks in place, each assigned to one NIC on the storage servers:

  • eth0: 192.168.0.0/24 — This is the “management” network
  • eth1: 10.99.99.0/24 — This is the actual storage/data transfer network
  • eth2: 172.16.99.0/24 — This is the synchronization network for the storage servers and the DRBD daemon running on them

The first storage server has the hostname storage01 and uses the IP addresses 192.168.0.34, 10.99.99.31 and 172.16.99.1.

The second storage server has the hostname storage02 and uses the IP addresses 192.168.0.35, 10.99.99.32 and 172.16.99.2.

I won’t cover the actual setup on the VMWare hosts in this little post. I have one VMWare machine using the IP address 10.99.99.11 and the other one uses 10.99.99.12 to communicate with the storage servers. Both VMWare hosts communicate with the virtual/floating IP address of the storage machines that is generated and assigned by the Heartbeat daemon running on the storage servers. That floating IP address is 10.99.99.30.

Both storage servers are using a default Ubuntu Server 14.04 LTS x64 installation with only the OpenSSH service installed on them. I used a “guided use entire disk” partition layout on both machines, in case if you wonder. For testing purposes, the only thing that matters is that both servers have sufficient free disk space available to mirror the data between them.

Once Ubuntu is up and running on both machines, open a terminal on both of them and run the following commands (unless explicitly stated otherwise, you will always execute the SAME commands on BOTH machines):

Being the superuser will make life much easier, so enter a superuser shell on both storage servers:

# sudo -s

I grew up with Wordstar on CP/M and MS-DOS, so joe with its Wordstar-compatible commands is my preferred console editor. I install it now along with a few other missing packages:
# apt-get install joe traceroute python-software-properties build-essential ntp

Configure NTP to use the network time servers of your own network. Yes, you should have two of those running in your network. If not, let’s say I didn’t hear that and you are now quietly installing NTP servers before you go on with this setup.

# joe /etc/ntp.conf

server 91.151.144.1
server 91.151.144.2

Now we configure static IP addresses on each node respectively – and make sure that the data and sync NICs use Jumbo frames:

On storage01:

#  joe /etc/network/interfaces

# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.34
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.1
# dns-* options are implemented by the resolvconf package, if installed
dns-nameservers 192.168.0.41
dns-search ce-tel.net
auto eth1
iface eth1 inet static
address 10.99.99.31
netmask 255.255.255.0
network 10.99.99.0
broadcast 10.99.99.255
mtu 9000
auto eth2
iface eth2 inet static
address 172.16.99.1
netmask 255.255.255.0
network 172.16.99.0
broadcast 172.16.99.255
mtu 9000

On storage02:

#  joe /etc/network/interfaces

# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.35
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.1
# dns-* options are implemented by the resolvconf package, if installed
dns-nameservers 192.168.0.41
dns-search ce-tel.net
auto eth1
iface eth1 inet static
address 10.99.99.32
netmask 255.255.255.0
network 10.99.99.0
broadcast 10.99.99.255
mtu 9000
auto eth2
iface eth2 inet static
address 172.16.99.2
netmask 255.255.255.0
network 172.16.99.0
broadcast 172.16.99.255
mtu 9000

On both storage servers, the following modifications need to be done to /etc/hosts:

# joe /etc/hosts

192.168.0.34    storage01.ce-tel.net

192.168.0.35  storage02.ce-tel.net

10.99.99.31     storage01
10.99.99.32     storage02
172.16.99.1     storage01-sync
172.16.99.2     storage02-sync

We won’t be using physical disk drives for DRBD in our setup. Instead, we will be using loopback devices that point to disk image files. In this scenario, I will be using a 300 GB image for our NFS storage. Modify that to your own needs and possibilities:

# mkdir -p /var/mystorage/img /var/mystorage/mnt
# dd if=/dev/zero of=/var/mystorage/img/meta.img bs=1024 count=250000
# dd if=/dev/zero of=/var/mystorage/img/data.img bs=1 seek=300G count=0

This script (found somewhere on the web) will be used to bind the image files to the loopback devices after each reboot of the servers:

# joe /etc/init.d/drbdloopbacks

#!/bin/bash
#
#Startup script to create LOFSs for drbd on ubuntu / vps.net
#
#Author: Sid Sidberry <greg@halfgray.com> http://himynameissid.com
#
#Description: This script attaches files from the file system to loopback
#       devices for use as drbd partitions.
#       Two files are required, 1 for drbd meta-data and 1 for drbd data
#
#your partion files
DRBD_METADATA_SRC="/var/mystorage/img/meta.img"
DRBD_FILEDATA_SRC="/var/mystorage/img/data.img"
#loopback devices
DRBD_METADATA_DEVICE="/dev/loop6"
DRBD_FILEDATA_DEVICE="/dev/loop7"
#losetup
LOSETUP_CMD=/sbin/losetup
#make sure the src files exist
[ -x $LOSETUP_CMD ] || exit 0
[ -e "$DRBD_METADATA_SRC" ] || exit 0;
[ -e "$DRBD_FILEDATA_SRC" ] || exit 0;
#includes lsb functions
. /lib/lsb/init-functions
function connect_lofs
{
log_daemon_msg "Connecting loop devices $DRBD_METADATA_DEVICE, $DRBD_FILEDATA_DEVICE"
$LOSETUP_CMD $DRBD_METADATA_DEVICE $DRBD_METADATA_SRC
$LOSETUP_CMD $DRBD_FILEDATA_DEVICE $DRBD_FILEDATA_SRC
}
function release_lofs
{
log_daemon_msg "Releasing loop devices $DRBD_METADATA_DEVICE, $DRBD_FILEDATA_DEVICE"
$LOSETUP_CMD -d $DRBD_METADATA_DEVICE
$LOSETUP_CMD -d $DRBD_FILEDATA_DEVICE
}
case "$1" in
start)
connect_lofs
;;
release)
release_lofs
;;
stop)
release_lofs
;;
*)
echo "Usage: /etc/init.d/drbdloopbacks {start|release}"
exit 1
;;
esac
exit 0

Now we’re going to make that script executable and configure it to be automatically launched on each system startup:

# chmod +x /etc/init.d/drbdloopbacks
# update-rc.d drbdloopbacks defaults 15 15

Now that we have a foundation for DRBD, let’s install and configure DRBD:

# apt-get install drbd8-utils

Make sure that the DRBD kernel module is loaded when the machine boots:

# echo 'drbd' >> /etc/modules

We’ll load the module now in our currently running servers:

# modprobe drbd

Now create a configuration file for DRBD:

# joe /etc/drbd.conf

global {
usage-count no;
}

common {
protocol C;
syncer {
rate 100M;
}
       startup {
                wfc-timeout  10;
                degr-wfc-timeout 8;
                outdated-wfc-timeout 5;
        }

}

resource mystorageres {
device   /dev/drbd0;
disk     /dev/loop7;
meta-disk /dev/loop6[0];
on storage01 {
address  172.16.99.1:7789;
}
on storage02 {
address  172.16.99.2:7789;
}
net {
after-sb-0pri   discard-younger-primary;
after-sb-1pri   consensus;
after-sb-2pri   disconnect;
}
}

In the following step, we will create the meta-data for our resource and then bring the resource “live”:

# drbdadm create-md mystorageres
# drbdadm up all

This commands is to be run on storage01 ONLY. It will promote storage01 to be the primary node and it will push its data to the secondary node by overwriting all the data on storage02 (which is currently an empty shell anyway):

# drbdadm -- --overwrite-data-of-peer primary mystorageres

Verify that it’s syncing:

#  cat /proc/drbd

This should yield something like this:

version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:107516 nr:0 dw:0 dr:108244 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:314465284
[>....................] sync'ed:  0.1% (307092/307200)Mfinish: 4:52:15 speed: 17,916 (17,916) K/sec

I recommend waiting for the initial synchronization to complete.

Once the synchronization process has finished, we will format the drbd0 device and mount it:

# mkfs.ext4 /dev/drbd0
# mount /dev/drbd0 /var/mystorage/mnt

It is time to configure NFS.

On both nodes:

# apt-get install nfs-kernel-server

Since Heartbeat will be controlling NFS, we need to remove the launcher scripts from our startup configuration.

# update-rc.d -f nfs-kernel-server remove
# update-rc.d nfs-kernel-server stop 20 0 1 2 3 4 5 6 .

We will move the NFS configuration information to the redundant DRBD device. This prepares storage01 for this:

# mount /dev/drbd0 /var/mystorage/mnt
# mv /var/lib/nfs/ /var/mystorage/mnt
# ln -s /var/mystorage/mnt/nfs/ /var/lib/nfs
# mv /etc/exports /var/mystorage/mnt
# ln -s /var/mystorage/mnt/exports /etc/exports

# mkdir /var/mystorage/mnt/export
# joe /etc/exports
           /var/mystorage/mnt/export 10.99.99.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)

On storage02, we only need to prepare this:

# rm -rf /var/lib/nfs
# ln -s /var/mystorage/mnt/nfs/ /var/lib/nfs
# rm /etc/exports
# ln -s /var/mystorage/mnt/exports /etc/exports

Now we will configure Heartbeat on both servers:

# apt-get install heartbeat

Heartbeat needs three configuration files to work properly, of which identical copies need to be placed on both servers.

We will begin with ha.cf:

# joe /etc/ha.d/ha.cf

use_logd yes
autojoin none
bcast eth1
warntime 5
deadtime 10
initdead 30
keepalive 2
logfacility local0
node storage01
node storage02
auto_failback on

Then we will configure the authkeys file:

# joe /etc/ha.d/authkeys

auth 1
1 sha1 thisisourlittlesecret

Important: authkeys requires special permission settings:

# chmod 600 /etc/ha.d/authkeys

Finally, we will configure the Heartbeat resources in haresources:

# joe /etc/ha.d/haresources

      storage01 IPaddr::10.99.99.30/24 drbddisk::mystorageres Filesystem::/dev/drbd0::/var/mystorage/mnt::ext4 nfs-kernel-server

In my test environment, after the NFS changes were made, the loopback devices were no longer created after a system restart. So I recommend to apply these settings again:

# update-rc.d drbdloopbacks defaults

This will start Heartbeat:

# /etc/init.d/heartbeat start

But, maybe, you rather want to reboot the machines:

# reboot

If everything works well, you can now configure the VMWare ESXi hosts to mount the NFS folder /var/mystorage/mnt/exports on server 10.99.99.30.

Good luck!

UPDATE, April 24, 2014:

I had to make the experience that using protocol C in DRBD only makes sense when both servers have equally powerful hardware. In my environment, one machine used 15k SAS disks and the other one had 7.2k SATA disks. The result was that the server with the SATA disks brutally thwarted the SAS-machine. In such cases, it might make more sense to use protocol A or B, because they don’t wait for the actual disk write on the target to be finished. I also could observe that DRBD is a very CPU intensive process; even the new 8-core Xeon server was running at load averages between 4.5 and 6.0, the 8-core target server with SATA disks went beyond 8.0.

This resulted in a very simple decision: We no longer use DRBD. High availability is nice to have, but let’s face it, in most cases it’s not a show stopper if you don’t have it. After all, we don’t run life support systems here. So we now use both servers as regular standalone storage servers and use VMWare Replication to replicate the VMs every 24 hours from the main server to the backup server. That has nothing to do with high availability, it does not even replace a backup, but at least it makes a disaster recovery much simpler if that should ever be required.

iSCSItargets on Ubuntu Server 12.04/14.04

At work, I’m currently building a software-based SAN on “commodity hardware” for our forthcoming VMWare cluster. In translation that means that we want to use a regular server instead of an overly expensive, proprietary hardware solution from HP or EMC² or whoever else builds such things.

We’re going to connect the storage server via Gigabit-Ethernet and we are going to use the iSCSI protocol to connect it to the VMWare ESXi hosts.

We first tried the Windows-based StarWind solution, and it worked great. It was easy to install, easy to configure, easy to manage and the HA feature also “just worked”. But somewhere down the road we might want to turn our SAN into a high availability solution with two or three redundant nodes, and StarWind’s price list immediately killed the idea of using their software. At the time of this writing, the StarWind software for a two-node HA cluster with 4TB storage capacity starts at around EUR 4,800. While the software is headache-free and thus absolutely worth it if you have the budget, we decided that it was not for us.

Hence, once again, we began looking for an Open Source-based solution.

I’ve spent the last days testing three options that are available for Ubuntu Linux 12.04/14.04:

#1 SCST

#2 iscsitarget

#3 LIO

SCST takes you back to the early days of Linux: You actually need to patch the Linux kernel sources and then you need to compile a custom Linux kernel in order to get it running. I did this on an older dual core Intel Xeon server with 6 GB RAM and a RAID 5 with 6 15k USCSI hard disks. The procedure took something between 6 and 8 hours on this machine. To add insult to injury, I had to do it twice because something didn’t work the first time. Once SCST was compiled, it would be a lie to say that it was easy to configure and that it installed itself properly on the Ubuntu server. It didn’t. I eventually got it to publish an iSCSItarget, but I had to launch the necessary daemons manually after each reboot. I didn’t spend any additional time to fix this. SCST worked, yes, it felt quite fast, yes, but we decided not to pursue this option any further. A solution that requires custom Linux kernels to run is not really maintainable when you are a user and not an OEM who distributes its own operating system with its own hardware. When you choose SCST on Ubuntu, I think you are actively choosing to NEVER update your server again unless you really want to build your own new kernel every single time when Canonical roll out a new operating system kernel. That certainly was not what we had in mind for our storage server, so SCST was out.

iscsitarget didn’t work on Ubuntu 12.04.4. It didn’t work on a daily build of Ubuntu 14.04 either. An “iscsi_trgt” kernel module was reported missing and the iscsitarget-dkms package did NOT deliver a proper fix for this. On Ubuntu 12.04 it was impossible to build the required kernel module for various reasons that have been published on several blogs on the web. On Ubuntu 14.04, I did not even try to fix this anymore. It shouldn’t be the user’s job to work around known problems in operating systems, and I would have expected from Canonical that their forthcoming flag ship release 14.04 LTS will come with a software repository that actually WORKS. So iscsitarget was also out.

LIO. It’s built into the kernel. Unless Linus Torvalds and his hackers screw things up, no new Linux kernel in Ubuntu should ever have the chance of breaking this solution. To get LIO to work, you only need to install a command line administration tool to create and publish your iSCSI targets. targetcli does this job and it responds to commands like create lun0 /var/mystoragefile 10g, which creates a file-backed LUN. The tool lets you navigate through a hierarchical structure of the available configuration options; you use the cd command to move in the hierarchy, ls shows you what is available at the current position and with commands like create or set you can create or modify settings. That sounds manageable. I followed a short guide that I found on www.linuxclustering.net and two minutes later I had an iSCSI target published and accessible from my VMWare hosts. I cannot yet say if LIO is faster, slower or as fast as SCST. I’m not even sure if it really matters – it’s easy to set up and by no means a maintenance nightmare. “It just works.” And it works out of the box. That is the important aspect.

My next challenge will be to use LIO in an HA environment (DRBD and Heartbeat or Pacemaker/Corosync will be involved in this).