Network interface bonding in Linux

Bonding Ethernet interfaces in Linux is pretty straight forward. There is bunch of articles out there on it already, but since this is where I keep some of my notes, I decided to write a post on it. Plus I do not have to bother with Google and I can come straight here for instructions.
This was done on Poweredge running CentOS 5.2. Here are things that need to be done to make this happen:

  • tell OS to load bonding.ko module on boot
  • set up configuration files for members of the bonded interface and the bonded interface itself
  • restart networking services or reboot

The following is /etc/modprobe.conf file. To get the OS to load bonding module on boot, you will need to add the alias bond0 bonding line. You can also pass some options to the bonding module. In this case I wanted the driver to check for link loss every 100ms. I also wanted the bond0 interface to perform adaptive load balancing, hence mode=6. Adaptive load balancing does not require any configuration on the switch side. If you choose a different mode, you might have to do additional configuration on the switch.

[root@bigfoot etc]# cat /etc/modprobe.conf
alias eth0 e1000
alias eth1 e1000
alias bond0 bonding
alias scsi_hostadapter qla1280
alias scsi_hostadapter1 megaraid_mbox
alias scsi_hostadapter2 ata_piix
options bond0 miimon=100 mode=6

Next, you need to set up configuration files for physical interfaces to be included in the bond0 interface. In my case bond0 consists of eth0 and eth1. Configuration files for both interfaces are identical except for DEVICE= lines.

[root@bigfoot etc]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Intel Corporation 82541GI Gigabit Ethernet Controller
DEVICE=eth0
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes
USERCTL=no

The last step is to configure bond0 interface itself:

[root@bigfoot etc]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
IPADDR=192.168.11.200
NETMASK=255.255.255.0
NETWORK=192.168.11.0
ONBOOT=yes
USERCTL=no

That is all. You can now do either /etc/init.d/networking restart or reboot the box.

This time I actually ran into a problem, where the physical interfaces were not being “enslaved” properly:

May 4 11:17:40 bigfoot kernel: ADDRCONF(NETDEV_UP): bond0: link is not ready
May 4 11:17:40 bigfoot kernel: bonding: bond0: Adding slave eth0.
May 4 11:17:40 bigfoot kernel: bonding: bond0: enslaving eth0 as an active interface with a down link.
May 4 11:17:40 bigfoot kernel: bonding: bond0: link status definitely up for interface eth0.
May 4 11:17:40 bigfoot kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
May 4 11:17:40 bigfoot kernel: bonding: bond0: Adding slave eth1.
May 4 11:17:40 bigfoot kernel: bonding: bond0: enslaving eth1 as an active interface with a down link.
May 4 11:17:45 bigfoot kernel: bonding: bond0: Removing slave eth0
May 4 11:17:45 bigfoot kernel: bonding: bond0: Warning: the permanent HWaddr of eth0 - 00:11:43:D8:AF:63 - is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts.
May 4 11:17:45 bigfoot kernel: bonding: bond0: releasing active interface eth0
May 4 11:17:47 bigfoot kernel: bonding: bond0: Adding slave eth0.
May 4 11:17:48 bigfoot kernel: bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
May 4 11:17:48 bigfoot kernel: bonding: bond0: enslaving eth0 as an active interface with an up link.

I have never had this problem before and quick googlage revealed that I am not alone. I came across this guy who had the same problem. He also links to the solution. Basically it seems Xen is causing the issue and to fix it you will need to edit /etc/xen/xend-config.sxp and force the network device to be used for network bridge in Xen:

(network-script 'network-bridge netdev=bond0')

Once I had that in place everything worked as advertised. Oh, and for thorough documentation check out Documentation included with kernel source. The file is called bonding.txt. Here is an online version of it. Continue Reading

Getting handle on log files

Starting with Solaris 9 there is a very handy tool called logadm that makes management of any log files a breeze. Syslog and messages files, among others, are managed by logadm which is called from root’s crontab.
Logadm reads /etc/logadm.conf file to figure out what it needs to do. By default there are following entries in logadm.conf:

/var/log/syslog -C 8 -P 'Wed Apr 8 02:10:22 2009' -a 'kill -HUP `cat /var/run/syslog.pid`'
/var/adm/messages -C 4 -P 'Fri Apr 10 02:10:15 2009' -a 'kill -HUP `cat /var/run/syslog.pid`'
/var/cron/log -c -s 512k -t /var/cron/olog
/var/lp/logs/lpsched -C 2 -N -t '$file.$N'
/var/fm/fmd/errlog -M '/usr/sbin/fmadm -q rotate errlog && mv /var/fm/fmd/errlog.0- $nfile' -N -s 2m
smf_logs -C 8 -s 1m /var/svc/log/*.log
/var/adm/pacct -C 0 -N -a '/usr/lib/acct/accton pacct' -g adm -m 664 -o adm -p never
/var/log/pool/poold -N -a 'pkill -HUP poold; true' -s 512k

Logadm provides -w switch which will write an entry into logadm.conf file that reflects current command line arguments. Of course logadm.conf can be edited using text editor, if that is the preferred method. If that’s the case, -V option can validate syntax of logadm.conf for you. Another handy option is -n which will cause logadm to do a dry run without actually performing the log rotation.

Other useful switches are:

  • -b and -a which allow specification of pre and post rotation commands to execute
  • -e sends error messages to a specific address instead of sending it to the owner of the crontab
  • -r removes entry from logadm.conf for a specific log file
  • -o sets different owner for the new log file from the original
  • -g sets different group for the new log file from the original
  • -m sets different permissions for the new log file from the original

For the whole story on logadm check out logadm man page.

Continue Reading

Solaris Link Aggregation

Link aggregation takes a bunch of network interfaces and creates a big pipe out of them.

Aggregation also provides redundancy. If all interfaces but one go down, the server will remain connected to the network.

Before starting make sure that:

  • interfaces to be aggregated are of the following type: xge, e1000g, and bge
  • interfaces to be aggregated are not plumbed
  • they run in full duplex mode at the same speeds
  • eeprom’s local-mac-address? variable is set to true

The following will create aggr1 interface with bge1 as one of its members:

bash-3.00# dladm create-aggr -d bge1 1

Next plumb the aggregate interface, configure an IP address on it and bring it up:

bash-3.00# ifconfig aggr1 plumb 192.168.1.5 netmask 255.255.255.0 up

At this point you can list aggregations:

bash-3.00# dladm show-aggr
key: 1 (0x0001) policy: L4      address: 0:3:ba:56:7f:ba (auto)
           device       address                 speed           duplex  link    state
           bge1         0:3:ba:56:7f:ba   0     Mbps    unknown down    standby

Now add bge0 as a second member of aggr1 aggregation interface and list aggregate interfaces:

bash-3.00# dladm add-aggr -d bge0 1
bash-3.00# dladm show-aggr
key: 1 (0x0001) policy: L4 address: 0:3:ba:56:7f:ba (auto)
           device       address                 speed           duplex  link    state
           bge1         0:3:ba:56:7f:ba   0     Mbps    unknown down    standby
           bge0         0:3:ba:56:7f:b9   1000  Mbps    full    up      attached

To keep the configuration persistent across reboots, create /etc/hostname.aggr1 with appropriate content and remove any hostname.* files pertaining to the interfaces that are now members of aggr1.

To have link aggregation working properly you need to have the switch to which server is connected to properly configured with LACP.

Another thing to consider is load balancing policy for outgoing traffic. You can load balance on layers 2,3 and 4. Load balancing policy can be changed using dladm command. Here is a quick example that will modify load balancing policy to combination of L3 and L4:

bash-3.00# dladm modify-aggr -P L3,L4 1
bash-3.00# dladm show-aggr -L
key: 1 (0x0001) policy: L3,L4 address: 0:3:ba:56:7f:ba (auto)
                LACP mode: off  LACP timer: short
    device    activity timeout aggregatable sync  coll dist defaulted expired
    bge1      passive  short   yes          no    no   no   no        no
    bge0      passive  short   yes          no    no   no   no        no

And finally, command that will allow you to see utilisation of individual links within aggregation. Note the %ipkts column, I did not have LACP turned on on the switch at that time:

bash-3.00# dladm show-aggr -s
key: 1  ipackets  rbytes      opackets   obytes          %ipkts %opkts
           Total        2723785   2287233197  1481682   710633551
           bge1 618712    115674760   870443    636559150       22.7    58.7
           bge0 2105073   2171558437  611239    74074401        77.3    41.3

More info on link aggregation is here. Continue Reading

Growing mirrored LUN in RedHat

I was putting a RedHat server onto a SAN and I could not find any clear instructions on how to grow a single mirrored LUN on the fly. Anyway, here are some notes on the process. First the setup: Two LUN’s mirrored across two SAN’s with LVM volume on the top of it. I could have easily just presented another set of mirrored LUN’s, add them to VG and go from there. I wanted to avoid that, as that kind of setup can quickly get out of hand as the number of presented LUN’s grows. If there is a more “sensible” and flexible setup, I would most definitely want to know about it.

For sake of completeness, here are steps to recreate the initial setup I had:

  1. Create a mirror from two LUN’s
  2. Use the mirror as PV
  3. Create a VG using the PV
  4. Create LV on the top of the VG
  5. Make ext3 filesystem on the top of LV and mount it

Here are the actual steps with some output:

[root@ultra /]# mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/mapper/mpath4 /dev/mapper/mpath5
mdadm: array /dev/md10 started.
[root@
ultra /]# pvcreate /dev/md10
Physical volume "/dev/md10" successfully created
[root@
ultra /]# vgcreate testvg /dev/md10
Volume group "testvg" successfully created
[root@
ultra /]# lvcreate -l+100%FREE -n testlv testvg
Logical volume "testlv" created
[root@
ultra /]# mkfs -t ext3 /dev/testvg/testlv
[root@
ultra /]# mount /dev/testvg/testlv /tmp/test

Now the resizing part. There might be a few steps but the upshot is that the filesystem can stay mounted and in use. High level overview of steps to take:

  1. Grow the two LUN’s using SAN management software
  2. Fail and remove one of the submirrors
  3. Force the kernel to see the size increase of the submirror
  4. Flush and recreate the multipath device map so multipathing sees the new size
  5. Re-add the submirror to the mirror and let it sync
  6. Repeat 2-4 for the second submirror
  7. Resize the PV
  8. Resize the LV
  9. Resize the filesystem

First, you fail and remove the submirror:

[root@ultra /]# mdadm /dev/md10 -f /dev/mapper/mpath4 -r /dev/mapper/mpath4
mdadm: set /dev/mapper/mpath4 faulty in /dev/md10
mdadm: hot removed /dev/mapper/mpath4

Now, note all paths to the LUN. Kernel sees a separate device at the end of each path to a LUN. In this case they are sdj, sdt, sdg and sdq.

[root@ultra /]# multipath -ll mpath4
mpath4 (3600508b400011c300000f000008d0000)
[size=12 GB][features="1 queue_if_no_path"][hwhandler="0"]
_ round-robin 0 [prio=100][active]
._ 1:0:3:1 sdj 8:144 [active][ready]
._ 2:0:3:1 sdt 65:48 [active][ready]
_ round-robin 0 [prio=20][enabled]
._ 1:0:2:1 sdg 8:96 [active][ready]
._ 2:0:2:1 sdq 65:0 [active][ready]

At this point the problem is to get the kernel to recognize the new size without reboot. After a lot of trying and sifting through man pages I found that blockdev command does the magic. Then I googled “blockdev resize” and I found this confirming my finding. So, the next step is to probe all logical paths to the LUN:

[root@ultra /]# blockdev --rereadpt /dev/sdj
[root@
ultra /]# blockdev --rereadpt /dev/sdt
[root@
ultra /]# blockdev --rereadpt /dev/sdg
[root@
ultra /]# blockdev --rereadpt /dev/sdq

You should see messages in /var/log/messages about kernel seeing new size on all paths. If you were to issue multipath -ll right now you would see that multipathing is still reporting old size. To fix that, flush the device map of the LUN and then recreate it:

[root@ultra /]# multipath -f mpath4
[root@
ultra /]# multipath -v2
create: mpath4 (3600508b400011c300000f000008d0000)
[size=13 GB][features="0"][hwhandler="0"]
_ round-robin 0 [prio=100]
._ 1:0:3:1 sdj 8:144 [ready]
._ 2:0:3:1 sdt 65:48 [ready]
_ round-robin 0 [prio=20]
._ 1:0:2:1 sdg 8:96 [ready]
._ 2:0:2:1 sdq 65:0 [ready]

Multipathing should be reporting the new size. Now you are ready to put back the grown submirror and let the whole mirror sync:

[root@ultra /]# mdadm /dev/md10 -a /dev/mapper/mpath4
mdadm: hot added /dev/mapper/mpath4

When the mirror has synced up, repeat the above process for the second submirror and wait for the sync to finish. Time to grow the mirror device itself:

[root@ultra /]# mdadm --grow /dev/md10 --size=max

After the completion /proc/mdstat should report increase in size of /dev/md10. Moving on you need to grow the PV that resides on /dev/md10:

[root@ultra /]# pvresize /dev/md10
Physical volume "/dev/md10" changed
1 physical volume(s) resized / 0 physical volume(s) not resized

And finally, you need to resize the LV:

[root@ultra /]# lvresize -l+100%FREE testvg/testlv
Extending logical volume testlv to 13.00 GB
Logical volume testlv successfully resized

Of course, don’t forget to grow the filesystem itself:

[root@ultra /]# ext2online /dev/testvg/testlv
ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
[root@
ultra /]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-rootlv
.                    132304280   5104976 120478588   5% /
/dev/md0                132134     32791     92521  27% /boot
none                   8202920         0   8202920   0% /dev/shm
/dev/mapper/testvg-testlv
.                     13413488     63516  12668820   1% /tmp/test

That should be it. The sync time for huge volumes is going to be something to keep in mind. The whole setup is clean and neat without clutter. I could have opted to mirror using LVM, but there seems to be a strange requirement for third, log volume. It is possible to keep the log in memory, but that supposedly causes resync on boot. Continue Reading

Basic IPMP

Finally, I got tired of remembering which network interfaces is configured on my Netra test box. So I do not have to remember which interface to plug cable into I configured IPMP on the box. IPMP provides link redundancy among multiple network interfaces in multipathing group. IPMP is not meant to be full fledged load balancing solution, though it will spread outgoing traffic across the interfaces.

I have put my two hme interfaces into a multipathing group. The group has a failover IP address assigned to it. Initially this address will be assigned to hme0. If hme0 fails, the address will automatically move to the other interface in the failover group.

First I edited /etc/hostname.hme0:

unreal-hme0 netmask + broadcast + deprecated -failover group unrealgrp1 up addif unreal netmask + broadcast + failover up

This configures physical hme0 with IP address of 192.168.11.6, which will not fail over and puts hme0 in unrealgrp1 multipathing group. It will additionally configure a virtual IP address of 192.168.11.5 which will failover when hme0 link goes down. Deprecated means the IP address 192.168.11.5 will not be used as source address for any outgoing packets.

Then I edited /etc/hostname.hme1:

unreal-hme1 netmask + broadcast + deprecated -failover group unrealgrp1 up

Similarly, hme1 will be configured with IP address of 192.168.11.7 and as a member of unrealgrp1 multipathing group. Again, 192.168.11.7 is marked as deprecated so it will not be used for outgoing packets. Finally I made sure my hosts file is correct:

bash-3.00# cat /etc/hosts
127.0.0.1 localhost
192.168.11.5 unreal loghost
192.168.11.6 unreal-hme0
192.168.11.7 unreal-hme1

And here is the result:

bash-3.00# ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
hme0: flags=9040843 mtu 1500 index 2
inet 192.168.11.6 netmask ffffff00 broadcast 192.168.11.255
groupname unrealgrp1
ether 8:0:20:d9:ac:c
hme0:1: flags=1000843 mtu 1500 index 2
inet 192.168.11.5 netmask ffffff00 broadcast 192.168.11.255
hme1: flags=19040803 mtu 1500 index 3
inet 192.168.11.7 netmask ffffff00 broadcast 192.168.11.255
groupname unrealgrp1
ether 8:0:20:d9:ac:d

Essentially IP address 192.168.11.5 “floats” among interfaces. If I were to unplug hme0, 192.168.11.5 would fail over to hme1. Failure is detected on link loss. There are some tunable parameters in /etc/default/mpathd.

This is all I needed. Of course, there is much more to IPMP: you can setup test system, that your system will test for reachability, detection of interfaces missing on boot, etc. Sun has much more info on it here. Continue Reading

Page 4 of 9« First...23456...Last »