Novell Home

SAN Storage Design for Xen Virtualization Based on Block Devices

Novell Cool Solutions: Feature
By Ivan Vari

Digg This - Slashdot This

Updated: 28 Sep 2007
 

This document solely focuses on a design where virtual machines are manually managed/migrated by the administrator (xm interface) and running on block devices. (logically managed hard drive partitions)

Scenario: HP EVA6000 FC SAN and 2 HP DL360G5 servers and the requirement is to create an environment where the administrator can manually (xm) migrate virtual machines between the 2 servers. We didn't really need very high availability but flexibility. For instance we wanted to take 1 server down for maintenance, we could do it without causing too much outage to services running on virtual machines.

After browsing the Internet for a week I had to face the truth that there's no such information, howto out there which would suit my needs. I was able find some VMware ESX related papers and Novell's ultimate HA storage foundation. The problem with both is that they run virtual machines on file images what we cannot do due to the nature of our virtual machines (high I/O).

Question I didn't have answer for:

Q: What is really needed for being able to migrate a Xen virtual machine (running off a block device) from one host to another?
A: Nothing really just the block device to be available on both servers.

Even with "--live" option the migration can be done without having any kind of cluster aware file system on the 2 servers or on the (shared) block devices, I did test it throughly. However we want to use it in production environment therefore for extra safety I do not recommend doing migrations with "--live" option without cluster aware file system even though Xen is very sophisticated software. There might be situations when the buffer is not cleared and the VM is already migrated over causing race conditions where 2 dom0s might try to write to the same (shared) block device corrupting your filesystem. According to my tests a virtual machine with 512M of RAM and normal I/O load migrates over ~6-8 seconds and only 8 ping packets get lost which is fairly affordable.

I have ended up with 2 possible solutions:

  1. Create a separate LUN for each virtual machine
    • The problem of this is that due to multipathing every LUN creates 2 devices. After a while it becomes unmanageable, the multipath configuration can become fairly big and complex along the SAN configuration as each LUN needs to be presented to all dom0 hosts not to mention that every time you need a LUN you have to nag the SAN administrator.

    • The advantage is that it doesn't need any further software (except multipathd) which makes this solution very feasible for small systems. (5-10 VMs)

  2. Use some sort of cluster volume management
    • The issue with this is that it involves some level of complexity therefore requires knowledge of several software products. It can also be overwhelming and unnecessary for small systems.

    • The advantage is that you need just one big LUN which is managed by the Xen administrators providing ultimate flexibility over your storage.
    We need the second option so this document explains how to achieve that. I assume you already have the LUN created and presented to both (or more) dom0 hosts.
  1. NTP Setup
  2. The time on the two physical machines needs to be synchronized. Several components in the HASF stack require this. I have configured both nodes to use our internal ntp servers (3 of them) in addition to the other node which would give us fairly decent redundancy.

    host1:~ # vi /etc/sysconfig/ntp
    NTPD_INITIAL_NTPDATE="ntp2.domain.co.nz ntp3.domain.co.nz ntp1.domain.co.nz"
    NTPD_ADJUST_CMOS_CLOCK="no"
    NTPD_OPTIONS="-u ntp"
    NTPD_RUN_CHROOTED="yes"
    NTPD_CHROOT_FILES=""
    NTP_PARSE_LINK=""
    NTP_PARSE_DEVICE=""

    Remember that making changes to the /etc/sysconfig directory you need to run SuSEconfig:

    host1:~ # SuSEconfig

    The server setup:

    host1:~ # vi /etc/ntp.conf
    server 127.127.1.0 
    fudge 127.127.1.0  flag1 0 flag2 0 flag3 0 flag4 0 stratum 5
    driftfile /var/lib/ntp/drift/ntp.drift     
    logfile /var/log/ntp                
    server ntp2.domain.co.nz 
    server ntp3.domain.co.nz 
    server ntp1.domain.co.nz 
    server host2.domain.co.nz

    It was set by the YaST GUI module, includes mainly the defaults. I added the servers and changed the local NTP server to be stratum 5.

    Ensure that both nodes could reach each other without DNS:

    host1:~ # vi /etc/hosts
    10.0.0.1  host1.domain.co.nz host1
    10.0.0.2  host2.domain.co.nz host2

    Certainly these are needed to be done on the other node as well the same way.

  3. Multipathing
  4. It has to be done for proper redundancy by the way it could confuse EVMS. There's a nice guide from HP but it requires the HP drivers to be installed. I prefer using the SuSE stock kernel drivers because it is maintained. Using the HP one will require you to re-install or update the HP drivers too every time you receive a kernel update. The HP HBA drivers take more options, here I present the setup which includes the modified HP setup which suits the stock kernel drivers.

    Tools we need:

    host1:~ # rpm -qa | grep -E 'mapper|multi'
    device-mapper-1.02.13-6.9
    multipath-tools-0.4.7-34.18

    Find out what the stock kernel driver supports.

    host1:~ # modinfo qla2xxx

    It shows that only 1 option is supported (from the HP guide) by the stock driver. I bet it's not a crucial but I thought if my driver supports and HP recommends it then might as well do it:

    host1:~ # echo "options qla2xxx qlport_down_retry=1" >> /etc/modprobe.conf.local

    Update the ramdisk image then reboot the server:

    host1:~ # mkinitrd && reboot

    After reboot ensure that modules for multipathing are loaded:

    host1:~ # lsmod | grep 'dm'
    dm_multipath           24456  0 
    dm_mod                 66384  7 dm_multipath

    Your SAN devices should be visible by now, in my case /dev/sda and /dev/sdc. Note: this may change when you add additional LUNs to the machine!

    Find out your WWID number, it's needed for multipath configuration:

    host1:~ # scsi_id -g -s /block/sda
    3600508b4001046490000700000360000

    Configure multipathd according to your WWID:

    host1:~ # vi /etc/multipath.conf
    defaults {
            multipath_tool          "/sbin/multipath -v0"
            udev_dir                /dev
            polling_interval        5
            default_selector        "round-robin 0"
            default_path_grouping_policy    multibus
            default_getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
            default_prio_callout    /bin/true
            default_features        "0"
            rr_min_io               100
            failback                immediate
    }
    
    multipaths {
            multipath {
                    wwid                    3600508b4001046490000700000360000
                    alias                   mpath2
                    path_grouping_policy    multibus
                    path_checker            readsector0
                    path_selector           "round-robin 0"
            }
    
    devices {
            device {
                    vendor                  "HP"
                    product                 "HSV200"
                    path_grouping_policy    group_by_prio
                    getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                    path_checker            tur
                    path_selector           "round-robin 0"
                    prior_callout           "/sbin/mpath_prio_alua %n"
                    failback                immediate
                    rr_weight               uniform
                    rr_min_io               100
                    no_path_retry           60
            }
    }

    Enable services upon reboot:

    host1:~ # insserv boot.device-mapper boot.multipath multipathd
    host1:~ # reboot

    After reboot, everything should be back. You can check your multipaths:

    host1:~ # multipath -l
    mpath2 (3600508b4001046490000700000360000) dm-0 HP,HSV200
    [size=100G][features=0][hwhandler=0]
    \_ round-robin 0 [prio=0][active]
     \_ 0:0:0:1 sda 8:0   [active][undef]
     \_ 0:0:1:1 sdc 8:32  [active][undef]

    For further information please refer the original HP guide:

    http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00814876/c00814876.pdf ?HPBCMETA::doctype=file

    Do exactly the same on the other node as well. The only difference you may have is the /dev/sd* devices but they don't count. I copied the multipath.conf over to the other host followed by setting up the services.

  5. Heartbeat
  6. EVMS what we will configure in a minute doesn't maintain cluster memberships. We need heartbeat to actually maintain memberships and activate EVMS volumes upon startup on every member nodes.

    Install heartbeat package first:

    host1:~ # yast2 sw_single &

    Select the filter "pattern":

    Configuration:

    host1:~ # vi /etc/ha.d/ha.cf
    autojoin any
    crm true
    auto_failback off
    ucast eth0 10.0.0.2
    node host1
    node host2
    respawn root /sbin/evmsd
    apiauth evms uid=hacluster,root

    I configured UNICAST simply because I prefer it over broadcast and multicast as well as we get heartbeat to start evmsd on the nodes.

    Configure authentication:

    host1:~ # sha1sum
    yoursecretpassword
    7769bf61f294d7bb91dd3583198d2e16acd8cd76  -
    host1:~ # vi /etc/ha.d/authkeys
    auth 1
    1 sha1 7769bf61f294d7bb91dd3583198d2e16acd8cd76

    Set logging:

    host1:~ # vi /etc/ha.d/ha_logd.cf
    logfacility     daemon
    
    host1:~ # ln -s /etc/ha.d/ha_logd.cf /etc/logd.cf

    Start it up:

    host1:~ # rcheartbeat start

    The other node (everything is the same except the IP of the other node):

    host2:~ # vi /etc/ha.d/ha.cf
    autojoin any
    crm true
    auto_failback off
    ucast eth0 10.0.0.1
    node host1
    node host2
    respawn root /sbin/evmsd
    apiauth evms uid=hacluster,root

    Configure logging, authentication as above then:

    host2:~ # rcheartbeat start

    Ensure they see each other:

    host1:~ # crmadmin -N
    normal node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251)
    normal node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)
    
    host2:~ # crmadmin -N
    normal node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251)
    normal node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)
    Guide used:
    http://wiki.xensource.com/xenwiki/EVMS-HAwSAN-SLES10
  7. Runlevels
  8. We need to change the startup order for some services. In short: we don't want anything starting up automatically (by xendomains), we need full control over the domain starting and stopping processes to ensure that only one VM is running at a time on a certain shared block device.

    This is the primary reason for not setting up STONITH device for Xen domUs, I can't trust HA simply because it would never know surely whether the VM is really shut down on the other node cleanly or not. We need xendomains for migrating domUs over to the other node at shutdown though. We achieve this by the following:

    Startup (runlevel 3-5):

    • xend starts before heartbeat (xen changes the networking, must be finished before heartbeat starts)
    • heartbeat starts next (ensures EVMS volume discovery)
    • xendomains starts last (does nothing at startup)

    Shutdown (runlevel 6):

    • xendomains shuts down first due to it was started last (we configure it to migrate running domains over to the other node, obviously xend must be running at this stage)
    • heartbeat stops cleanly before we nuke the networking (remember that eth0 what you use for keepalive messages is virtual interface!)
    • xend stops next which shuts down the xen networking and so forth...

    Service dependencies are set by comments in the init script headers. Remove xendomains from the "Should-start line" of heartbeat:

    host1:~ # vi /etc/init.d/heartbeat

    -snip-
    ### BEGIN INIT INFO
    # Provides: heartbeat
    # Required-Start: $network $syslog $named
    # Should-Start: drbd sshd o2cb evms ocfs2 xend
    # Required-Stop:
    # Default-Start:  3 5
    # Default-Stop:   0 1 2 6
    # Description:    Start heartbeat HA services
    ### END INIT INFO
    -snip-

    Insert heartbeat to the "Required-Start" section in xendomains:

    host1:~ # vi /etc/init.d/xendomains

    -snip-
    ### BEGIN INIT INFO
    # Provides:          xendomains
    # Required-Start:    $syslog $remote_fs xend heartbeat
    # Should-Start:
    # Required-Stop:     $syslog $remote_fs xend
    # Should-Stop:
    # Default-Start:     3 5
    # Default-Stop:      0 1 2 4 6
    # Short-Description: Starts and stops Xen VMs
    # Description:       Starts and stops Xen VMs automatically when the
    #                    host starts and stops.
    ### END INIT INFO
    -snip-

    Remove these services from all runlevels in this order then re-activate them:

    host1:~ # insserv -r heartbeat
    host1:~ # insserv -r xendomains
    host1:~ # insserv -r xend
    
    host1:~ # insserv -d xend
    host1:~ # insserv -d heartbeat
    host1:~ # insserv -d xendomains

    Ensure the right order:

    host1:~ # ls -l /etc/init.d/rc3.d | grep -E 'xend|heartbeat|xendomains'
    lrwxrwxrwx 1 root root 13 Aug 31 17:42 K09xendomains -> ../xendomains
    lrwxrwxrwx 1 root root 12 Aug 31 17:42 K10heartbeat -> ../heartbeat
    lrwxrwxrwx 1 root root  7 Aug 31 17:41 K12xend -> ../xend
    lrwxrwxrwx 1 root root  7 Aug 31 17:41 S10xend -> ../xend
    lrwxrwxrwx 1 root root 12 Aug 31 17:42 S12heartbeat -> ../heartbeat
    lrwxrwxrwx 1 root root 13 Aug 31 17:42 S13xendomains -> ../xendomains

    Do the same on the other node as well. Before we proceed with HA we need EVMS to be ready.

  9. EVMS
  10. EVMS is a great enterprise class volume manager software, it has a feature called CSM (Cluster Segment Manager). We will use this feature to distribute the block devices between the dom0 nodes. On top of CSM we use LVM2 volume management which gives us the ultimate flexibility to create, resize, extend logical volumes.

    I include only the device I want to manage by EVMS at this stage. I don't want EVMS to discover other disks I am not planning to use in this setup. The "multipath -l" command above tells you the device-mapper created device you need:

    host1:~ # grep . /etc/evms.conf | grep -v \#
    -snip-
    sysfs_devices {
            include = [ dm-0 ]
            exclude = [ iseries!vcd* ]
    }
    -snip-

    I also disable LVM2 on the host machine to avoid interfering with EVMS. I will use other LUNs later on what I plan to manage by LVM2 but running on a certain VM:

    host1:~ # grep . /etc/lvm/lvm.conf | grep -v \#
    -snip-
    devices {
        dir = "/dev"
        scan = [ "/dev" ]
        filter = [ "r|.*|" ]
        cache = "/etc/lvm/.cache"
        write_cache_state = 1
        sysfs_scan = 1
        md_component_detection = 1
    }
    -snip-

    Now we can create the volumes. Note: I am going to present my configuration here just for reference. If you need step by step guide how to do this please read this document: http://wiki.novell.com/images/0/01/CHASF_preview_Nov172006.pdf

    I strongly recommend to visit this project's home: http://evms.sourceforge.net

    Disks:

    Segments:

    CSM container and LVM2 on top:

    Regions:

    Volumes:

    After you have created your EVMS volumes, save it. To activate changes (create the devices) on all nodes immediately we need to run evms_activate on every other node simply because the default behavior of EVMS is to apply changes upon the local node only.

    I have 2 nodes at this stage and I want to activate only the other node:

    host1:~ # evms_activate -n host2

    What if I had 20 nodes? It would be a bit overwhelming so I present here a solution to do this on all nodes: (there are many other ways of doing this)

    host1:~ # for node in `grep node /etc/ha.d/ha.cf | cut -d ' ' -f2`; do evms_activate -n $node; done
  11. Heartbeat and EVMS
  12. Now we will configure heartbeat to ensure that in case of node failure (meaning reboot, network issue or any occasion when heartbeat stops receiving signals from the other node) EVMS volume discovery does happen on node when it re-joins the cluster (HA starts receiving heartbeat signals again). Our heartbeat is already prepared for the new crm type configuration so we need to create and load the following xml file:

    host1:~ # vi evmscloneset.xml 
    <clone id="evmscloneset" notify="true" globally_unique="false">
    <instance_attributes id="evmscloneset">
     <attributes>
      <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
     </attributes>
    </instance_attributes>
    <primitive id="evmsclone" class="ocf" type="EvmsSCC" provider="heartbeat">
    </primitive>
    </clone>

    Load it in:

    host1:~ # cibadmin -C -o resources -x evmscloneset.xml

    Give it a bit of a time then check the proper operation:

    host1:~ # crm_mon
    ============
    Last updated: Fri Sep 14 09:20:15 2007
    Current DC: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)
    2 Nodes configured.
    1 Resources configured.
    ============
    
    Node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251): online
    Node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48): online
    
    Clone Set: evmscloneset
        evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started host1
        evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started host2

    It has to be done on only one node and once. I usually ensure that HA configuration changes are done on the "DC" node but it's not really essential.

    The configuration is easier by xml files I reckon, in fact complex scenarios can only be done this way. I use the HA GUI just to take an overview at the services. You either have to reset the the password for "hacluster" user or add yourself into the "haclient" group for being able to authenticate:

    host1:~ # hb_gui &

    Reboot the machines several times. Ensure that 1 node is always up and the DC (designated controller) role does change over in the HA cluster, volumes get discovered and activated well on all nodes.

    Novell's ultimate HA solution based on VMs running on file images and OCFS2: http://wiki.novell.com/images/3/37/Exploring_HASF.pdf

    Examples from Brainshare 2007 related to the topic above: http://wiki.novell.com/images/c/c8/Tut323_bs2007.pdf

  13. Xen configuration and xendomains
  14. I am going to present here what is need for being able to migrate running domains over the other node at shutdown but the entire configuration of Xen is beyond the scope of this document. It's quite straight forward:

    host1:~ # grep . /etc/xen/xend-config.sxp | grep -v \#
    (xen-api-server ((unix none)))
    (xend-http-server yes)
    (xend-unix-server yes)
    (xend-relocation-server yes)
    (xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^host2$ ^host2\\.domain\\.co\\.nz$')
    (network-script 'network-bridge netdev=eth0')
    (vif-script vif-bridge)
    (dom0-min-mem 196)
    (dom0-cpus 0)
    (vncpasswd '')

    The other node is the same except the relocation-hosts configuration bit, do that as well.

    Xendomains is configured at different location:

    host1:~ # grep . /etc/sysconfig/xendomains | grep -v \#
    -snip-
    XENDOMAINS_MIGRATE="10.0.0.2"
    XENDOMAINS_SAVE=""
    XENDOMAINS_AUTO_ONLY="false"
    -snip-

    The important parts: the IP of the other node and we need to force it upon any domain not just the ones that are specified at /etc/xen/auto directory (must not have anything there). I removed the setting of XENDOMAINS_SAVE as well.

    Restart xend to apply changes. To test the auto migration (assume you have a VM running) just stop xendomains even if it's not running:

    host1:~ # rcxend restart
    Restarting xend (old pid 25531 25532 25560)                           done
    
    host1:~ # rcxendomains stop

    Now you are ready to rock. Create some volumes, VMs. Each of these block devices will have their own partitioning, filesystem on them respect to the VM running on it.

  15. Proof of concept
  16. Here I explain how I tested this setup. I created a VM with 512MB of memory which runs NFS service. You may have wondered why the domU pool is called mpath2 and what is mpath1? It's another LUN but I am not using it with EVMS. Our VM is going to host user home directories exported by NFS service to store user data. The twist is that it's not part of the EVMS configuration nor part of the VM, I am just exporting this device cleanly into the VM as it comes off multipathd. Here it is my VM configuration:

    host1:~ # cat /etc/xen/vm/nfs.xm 
    ostype="sles10"
    name="nfs1"
    memory=512
    vcpus=1
    uuid="86636fde-1613-2e12-8f94-093d1e3f962e"
    on_crash="destroy"
    on_poweroff="destroy"
    on_reboot="restart"
    localtime=0
    builder="linux"
    bootloader="/usr/lib/xen/boot/domUloader.py"
    bootargs="--entry=xvda1:/boot/vmlinuz-xenpae,/boot/initrd-xenpae"
    extra="TERM=xterm "
    disk=[ 'phy:/dev/evms/san2/vm2,xvda,w', 'phy:/dev/mapper/mpath1,xvdc,w' ]
    vif=[ 'mac=00:16:3e:1e:11:87', ]
    vfb=["type=vnc,vncunused=1"]

    The VM is running on host1, everything is as presented earlier in this document.

    Test1: writing 208MB file to the NFS export from my box:

    geeko@workstation:~> ls -lh /private/ISO/i386cd-3.1.iso 
    -rw-r--r-- 1 geeko geeko 208M Nov  3  2006 /private/ISO/i386cd-3.1.iso 
    
    geeko@workstation:~> md5sum /private/ISO/i386cd-3.1.iso 
    b4d4bb353693e6008f2fc48cd25958ed  /private/ISO/i386cd-3.1.iso 
    
    geeko@workstation:~> mount -t nfs -o rsize=8196,wsize=8196 nfs1:/home/geeko /mnt 
    
    geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt
    
    real    0m20.918s 
    user    0m0.015s 
    sys     0m0.737s 

    It wasn't very fast due to my uplink was limited to 100Mbit/s but it's not what we are concerned about right now.

    Now do the same thing but migrate the domain while writing to the NFS export:

    geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt

    Meanwhile on host1:

    host1:~ # xm migrate nfs1 host2
    host1:~ # xentop
    xentop - 12:02:23   Xen 3.0.4_13138-0.47 
    2 domains: 1 running, 0 blocked, 0 paused, 0 crashed, 0 dying, 1 shutdown 
    Mem: 14677976k total, 1167488k used, 13510488k free    CPUs: 4 @ 3000MHz 
          NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID 
      Domain-0 -----r       2754   47.6     524288    3.6   no limit n/a     4    4  1282795  4132024    0        0        0        0    0 
    migrating-nfs1 -s----          8    0.0     524288    3.6     532480      3.6     1    1    17813   433132    3        0       99    10714 
    
    
    real    0m41.221s 
    user    0m0.020s 
    sys     0m0.772s 

    As you can see it was twice as long but:

    nfs1:~ # md5sum /home/geeko/i386cd-3.1.iso 
    b4d4bb353693e6008f2fc48cd25958ed  /home/geeko/i386cd-3.1.iso 
    
    The md5sum matches up and that is what I wanted to see from the NFS VM. Check out the file system just in case. 
    (on the NFS VM is used LVM2 on top of xvdc (mpath1) with XFS) 
    
    nfs1:~ # umount /home
    nfs1:~ # xfs_check /dev/mapper/san1-nfshome 
    nfs1:~ # 

    No corruption found.

Ingredients used:

  • OS: SLES10SP1
  • HW: HP DL360G5
  • SAN: HP EVA6000
  • HBA: QLA2432
  • Heartbeat 2.0.8
  • EVMS: 2.5.5


Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell