Double, double toil and trouble,
Fire burn, and cauldron bubble.
– Macbeth
As I’ve mentioned before, I’m currently working on a virtualization project. VMWare vSphere Essentials Plus is the virtualization software of choice, and the goal is to build an affordable redundant/clustered storage solution for the VMWare cluster.
After a lot of experimenting with Ubuntu 12.04 and Ubuntu 14.04, DRBD, Heartbeat, LIO and NFS, I finally managed to configure a WORKING high availability storage cluster based upon a daily development build of the forthcoming Ubuntu Server 14.04 LTS x64, DRBD, Heartbeat and NFS.
Why NFS and not a LIO-based iSCSI solution? Because it was much easier to set up and, when you believe certain benchmarks on the web, an NFS-based solution is even a bit faster and more responsive than an iSCSI-based one. At this point, I have not run any benchmarks myself and all I can say is that they feel almost the same.
The following hardware is involved in this test environment:
- A Cisco Gigabit switch with “system mtu jumbo 9000” configured, so that we can use jumbo packets to improve the data transfer between the storage units and the VMWare hosts. We just have to make sure to use a good data management service (read more)
- Two non-identical x64 servers with at least three Gigabit Ethernet ports.
- Two identical x64 servers for the VMWare cluster, each with 24 GB of RAM and 6 Gigabit Ethernet ports.
Like I’ve said, this is a TEST environment. When you can AFFORD (financially, I mean) to be really serious about this, then you would of course have redundant 10 Gig Ethernet switches and 10 Gig Ethernet NICs in place. But in the real world where I live, there are always financial constraints and I have to make do with what I have.
I have three networks in place, each assigned to one NIC on the storage servers:
- eth0: 192.168.0.0/24 — This is the “management” network
- eth1: 10.99.99.0/24 — This is the actual storage/data transfer network
- eth2: 172.16.99.0/24 — This is the synchronization network for the storage servers and the DRBD daemon running on them
The first storage server has the hostname storage01 and uses the IP addresses 192.168.0.34, 10.99.99.31 and 172.16.99.1.
The second storage server has the hostname storage02 and uses the IP addresses 192.168.0.35, 10.99.99.32 and 172.16.99.2.
I won’t cover the actual setup on the VMWare hosts in this little post. I have one VMWare machine using the IP address 10.99.99.11 and the other one uses 10.99.99.12 to communicate with the storage servers. Both VMWare hosts communicate with the virtual/floating IP address of the storage machines that is generated and assigned by the Heartbeat daemon running on the storage servers. That floating IP address is 10.99.99.30.
Both storage servers are using a default Ubuntu Server 14.04 LTS x64 installation with only the OpenSSH service installed on them. I used a “guided use entire disk” partition layout on both machines, in case if you wonder. For testing purposes, the only thing that matters is that both servers have sufficient free disk space available to mirror the data between them.
Once Ubuntu is up and running on both machines, open a terminal on both of them and run the following commands (unless explicitly stated otherwise, you will always execute the SAME commands on BOTH machines):
Being the superuser will make life much easier, so enter a superuser shell on both storage servers:
# sudo -s
I grew up with Wordstar on CP/M and MS-DOS, so joe with its Wordstar-compatible commands is my preferred console editor. I install it now along with a few other missing packages:
# apt-get install joe traceroute python-software-properties build-essential ntp
Configure NTP to use the network time servers of your own network. Yes, you should have two of those running in your network. If not, let’s say I didn’t hear that and you are now quietly installing NTP servers before you go on with this setup.
# joe /etc/ntp.conf
server 91.151.144.1
server 91.151.144.2
Now we configure static IP addresses on each node respectively – and make sure that the data and sync NICs use Jumbo frames:
On storage01:
# joe /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.34
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.1
# dns-* options are implemented by the resolvconf package, if installed
dns-nameservers 192.168.0.41
dns-search ce-tel.net
auto eth1
iface eth1 inet static
address 10.99.99.31
netmask 255.255.255.0
network 10.99.99.0
broadcast 10.99.99.255
mtu 9000
auto eth2
iface eth2 inet static
address 172.16.99.1
netmask 255.255.255.0
network 172.16.99.0
broadcast 172.16.99.255
mtu 9000
On storage02:
# joe /etc/network/interfaces
# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.35
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.1
# dns-* options are implemented by the resolvconf package, if installed
dns-nameservers 192.168.0.41
dns-search ce-tel.net
auto eth1
iface eth1 inet static
address 10.99.99.32
netmask 255.255.255.0
network 10.99.99.0
broadcast 10.99.99.255
mtu 9000
auto eth2
iface eth2 inet static
address 172.16.99.2
netmask 255.255.255.0
network 172.16.99.0
broadcast 172.16.99.255
mtu 9000
On both storage servers, the following modifications need to be done to /etc/hosts:
# joe /etc/hosts
192.168.0.34 storage01.ce-tel.net
192.168.0.35 storage02.ce-tel.net
10.99.99.31 storage01
10.99.99.32 storage02
172.16.99.1 storage01-sync
172.16.99.2 storage02-sync
We won’t be using physical disk drives for DRBD in our setup. Instead, we will be using loopback devices that point to disk image files. In this scenario, I will be using a 300 GB image for our NFS storage. Modify that to your own needs and possibilities:
# mkdir -p /var/mystorage/img /var/mystorage/mnt
# dd if=/dev/zero of=/var/mystorage/img/meta.img bs=1024 count=250000
# dd if=/dev/zero of=/var/mystorage/img/data.img bs=1 seek=300G count=0
This script (found somewhere on the web) will be used to bind the image files to the loopback devices after each reboot of the servers:
# joe /etc/init.d/drbdloopbacks
#!/bin/bash
#
#Startup script to create LOFSs for drbd on ubuntu / vps.net
#
#Author: Sid Sidberry <greg@halfgray.com> http://himynameissid.com
#
#Description: This script attaches files from the file system to loopback
# devices for use as drbd partitions.
# Two files are required, 1 for drbd meta-data and 1 for drbd data
#
#your partion files
DRBD_METADATA_SRC="/var/mystorage/img/meta.img"
DRBD_FILEDATA_SRC="/var/mystorage/img/data.img"
#loopback devices
DRBD_METADATA_DEVICE="/dev/loop6"
DRBD_FILEDATA_DEVICE="/dev/loop7"
#losetup
LOSETUP_CMD=/sbin/losetup
#make sure the src files exist
[ -x $LOSETUP_CMD ] || exit 0
[ -e "$DRBD_METADATA_SRC" ] || exit 0;
[ -e "$DRBD_FILEDATA_SRC" ] || exit 0;
#includes lsb functions
. /lib/lsb/init-functions
function connect_lofs
{
log_daemon_msg "Connecting loop devices $DRBD_METADATA_DEVICE, $DRBD_FILEDATA_DEVICE"
$LOSETUP_CMD $DRBD_METADATA_DEVICE $DRBD_METADATA_SRC
$LOSETUP_CMD $DRBD_FILEDATA_DEVICE $DRBD_FILEDATA_SRC
}
function release_lofs
{
log_daemon_msg "Releasing loop devices $DRBD_METADATA_DEVICE, $DRBD_FILEDATA_DEVICE"
$LOSETUP_CMD -d $DRBD_METADATA_DEVICE
$LOSETUP_CMD -d $DRBD_FILEDATA_DEVICE
}
case "$1" in
start)
connect_lofs
;;
release)
release_lofs
;;
stop)
release_lofs
;;
*)
echo "Usage: /etc/init.d/drbdloopbacks {start|release}"
exit 1
;;
esac
exit 0
Now we’re going to make that script executable and configure it to be automatically launched on each system startup:
# chmod +x /etc/init.d/drbdloopbacks
# update-rc.d drbdloopbacks defaults 15 15
Now that we have a foundation for DRBD, let’s install and configure DRBD:
# apt-get install drbd8-utils
Make sure that the DRBD kernel module is loaded when the machine boots:
# echo 'drbd' >> /etc/modules
We’ll load the module now in our currently running servers:
# modprobe drbd
Now create a configuration file for DRBD:
# joe /etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;
syncer {
rate 100M;
}
startup {
wfc-timeout 10;
degr-wfc-timeout 8;
outdated-wfc-timeout 5;
}
}
resource mystorageres {
device /dev/drbd0;
disk /dev/loop7;
meta-disk /dev/loop6[0];
on storage01 {
address 172.16.99.1:7789;
}
on storage02 {
address 172.16.99.2:7789;
}
net {
after-sb-0pri discard-younger-primary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
}
}
In the following step, we will create the meta-data for our resource and then bring the resource “live”:
# drbdadm create-md mystorageres
# drbdadm up all
This commands is to be run on storage01 ONLY. It will promote storage01 to be the primary node and it will push its data to the secondary node by overwriting all the data on storage02 (which is currently an empty shell anyway):
# drbdadm -- --overwrite-data-of-peer primary mystorageres
Verify that it’s syncing:
# cat /proc/drbd
This should yield something like this:
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
ns:107516 nr:0 dw:0 dr:108244 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:314465284
[>....................] sync'ed: 0.1% (307092/307200)Mfinish: 4:52:15 speed: 17,916 (17,916) K/sec
I recommend waiting for the initial synchronization to complete.
Once the synchronization process has finished, we will format the drbd0 device and mount it:
# mkfs.ext4 /dev/drbd0
# mount /dev/drbd0 /var/mystorage/mnt
It is time to configure NFS.
On both nodes:
# apt-get install nfs-kernel-server
Since Heartbeat will be controlling NFS, we need to remove the launcher scripts from our startup configuration.
# update-rc.d -f nfs-kernel-server remove
# update-rc.d nfs-kernel-server stop 20 0 1 2 3 4 5 6 .
We will move the NFS configuration information to the redundant DRBD device. This prepares storage01 for this:
# mount /dev/drbd0 /var/mystorage/mnt
# mv /var/lib/nfs/ /var/mystorage/mnt
# ln -s /var/mystorage/mnt/nfs/ /var/lib/nfs
# mv /etc/exports /var/mystorage/mnt
# ln -s /var/mystorage/mnt/exports /etc/exports
# mkdir /var/mystorage/mnt/export
# joe /etc/exports
/var/mystorage/mnt/export 10.99.99.0/24(rw,async,no_root_squash,no_subtree_check,fsid=1)
On storage02, we only need to prepare this:
# rm -rf /var/lib/nfs
# ln -s /var/mystorage/mnt/nfs/ /var/lib/nfs
# rm /etc/exports
# ln -s /var/mystorage/mnt/exports /etc/exports
Now we will configure Heartbeat on both servers:
# apt-get install heartbeat
Heartbeat needs three configuration files to work properly, of which identical copies need to be placed on both servers.
We will begin with ha.cf:
# joe /etc/ha.d/ha.cf
use_logd yes
autojoin none
bcast eth1
warntime 5
deadtime 10
initdead 30
keepalive 2
logfacility local0
node storage01
node storage02
auto_failback on
Then we will configure the authkeys file:
# joe /etc/ha.d/authkeys
auth 1
1 sha1 thisisourlittlesecret
Important: authkeys requires special permission settings:
# chmod 600 /etc/ha.d/authkeys
Finally, we will configure the Heartbeat resources in haresources:
# joe /etc/ha.d/haresources
storage01 IPaddr::10.99.99.30/24 drbddisk::mystorageres Filesystem::/dev/drbd0::/var/mystorage/mnt::ext4 nfs-kernel-server
In my test environment, after the NFS changes were made, the loopback devices were no longer created after a system restart. So I recommend to apply these settings again:
# update-rc.d drbdloopbacks defaults
This will start Heartbeat:
# /etc/init.d/heartbeat start
But, maybe, you rather want to reboot the machines:
# reboot
If everything works well, you can now configure the VMWare ESXi hosts to mount the NFS folder /var/mystorage/mnt/exports on server 10.99.99.30.
Good luck!
UPDATE, April 24, 2014:
I had to make the experience that using protocol C in DRBD only makes sense when both servers have equally powerful hardware. In my environment, one machine used 15k SAS disks and the other one had 7.2k SATA disks. The result was that the server with the SATA disks brutally thwarted the SAS-machine. In such cases, it might make more sense to use protocol A or B, because they don’t wait for the actual disk write on the target to be finished. I also could observe that DRBD is a very CPU intensive process; even the new 8-core Xeon server was running at load averages between 4.5 and 6.0, the 8-core target server with SATA disks went beyond 8.0.
This resulted in a very simple decision: We no longer use DRBD. High availability is nice to have, but let’s face it, in most cases it’s not a show stopper if you don’t have it. After all, we don’t run life support systems here. So we now use both servers as regular standalone storage servers and use VMWare Replication to replicate the VMs every 24 hours from the main server to the backup server. That has nothing to do with high availability, it does not even replace a backup, but at least it makes a disaster recovery much simpler if that should ever be required.