Solaris Containers

Solaris Containers allow partitioning of a physical server into virtual servers. Containers do not provide virtualization in sense of Xen or VMware they are more similar to jails. Zone is a container without resource control.

Solaris instance running in a non-global zone shares parts of its filesystem with the global zone. There is only one kernel running – in the global zone. The kernel handles the physical machine on behalf of non-global zones running on the system. As far as non-global zones are concerned, they appear as separate machines with they own services running, etc. Non-global zones can not “see” each other. They can not see what is going on in the global zone. The global zone, however, can see what is going on inside the non-global zones.

Setting up a zone is fairly trivial. There are two types: full and sparse zones. Here is a quick rundown of a sparse zone setup on Solaris 10 08/07:

  1. Change system-wide scheduler to FSS
  2. Create a non-global zone
  3. Install the zone
  4. Boot the created zone

First you should set the default scheduler on the system to be Fair Share Scheduler. This will allow you to assign CPU shares to individual zones. This will also prevent a zone from monopolizing CPU:

bash-3.00# dispadmin -d FSS
bash-3.00# dispadmin -d
FSS (Fair Share)

You will need to reboot in order for system to start using FSS. Note that default scheduler in Solaris 10 is TS. If you want to change scheduler without reboot, in addition to using dispadmin, you could try something like this:

bash-3.00# priocntl -s -c FSS -i class TS
bash-3.00# priocntl -s -c FSS -i pid 1

The first command will move processes from TS class to FSS class and the second command will move init process to FSS class.

To see if there are any zones installed on the system you can use zoneadm command inside the global zone:

bash-3.00# zoneadm list -v

To start seting up a zone use zonecfg command:

bash-3.00# zonecfg -z ns1
ns1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:ns1>

Continuing with the interactive zonecfg session:

zonecfg:ns1> create
zonecfg:ns1> set zonepath=/export/home/zones/ns1
zonecfg:ns1> set autoboot=true
zonecfg:ns1> set bootargs="-m verbose"
zonecfg:ns1> set scheduling-class=FSS
zonecfg:ns1> set cpu-shares=5
zonecfg:ns1> add attr
zonecfg:ns1:attr> set name=comment
zonecfg:ns1:attr> set type=string
zonecfg:ns1:attr> set value="DNS Server"
zonecfg:ns1:attr> end

Remember, nothing is set in stone until you issue commit at the end of the zonecfg session. The above will start creating new zone called ns1. The zone will be installed in /export/home/zones/ns1. It will be booted automatically when the global zone is booted. Next you can add some boot arguments and scheduling class for the zone. If scheduling class is not defined it will be inherited from the global zone. Since you already set the system scheduling class to FSS, this entry is optional. You can also add a name attribute so you know what the purpose of the zone is.

Next, you can cap the memory usage for the zone:

zonecfg:ns1> add capped-memory
zonecfg:ns1:capped-memory> set physical=512M
zonecfg:ns1:capped-memory> set swap=1024M
zonecfg:ns1:capped-memory> end

The physical keyword specifies how much physical memory the zone is allowed to consume. Total swap consumed by user processes and tmpfs mounts inside the non-global zone is set by swap attribute.

Finally add a network interface:

zonecfg:ns1> add net
zonecfg:ns1:net> set physical=aggr1
zonecfg:ns1:net> set address=10.1.1.1
zonecfg:ns1:net> end

In this case, the physical interface for the zone will be aggr1 (aggregate interface in the global zone) with IP address of 10.1.1.1.

Now you can review the settings configured and then commit them:

zonecfg:ns1> info
zonename: ns1
zonepath: /export/home/zones/ns1
brand: native
autoboot: true
bootargs:
pool:
limitpriv:
scheduling-class: FSS
ip-type: shared
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: 10.1.1.1
physical: aggr1
capped-memory:
physical: 512M
[swap: 1G]
attr:
name: comment
type: string
value: "DNS Server"
rctl:
name: zone.cpu-shares
value: (priv=privileged,limit=5,action=none)
rctl:
name: zone.max-swap
value: (priv=privileged,limit=1073741824,action=deny)
zonecfg:ns1> commit
zonecfg:ns1> exit

At this point if you list zones on the system you will see something similar:

bash-3.00# zoneadm list -vc
ID NAME             STATUS     PATH                           BRAND    IP
0 global           running    /                              native   shared
- ns1              configured /export/home/zones/ns1         native   shared

Now you can proceed with installing the zone. The install might take a little while depending on the type of the zone you are installing and the size of the global zone:

bash-3.00# zoneadm -z ns1 install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize packages on the zone.
Initialized packages on zone.
Zone is initialized.
The file contains a log of the zone installation.
bash-3.00#

It might be a good idea to set cpu-shares on the global zone so it’s not rendered unusable by a non-global zone going awol. This will require reboot:

bash-3.00# zonecfg -z global
zonecfg:ns1> set cpu-shares=10
zonecfg:ns1> commit
zonecfg:ns1> exit

Or in addition to the above you can use prctl command to avoid reboot:

bash-3.00# prctl -n zone.cpu-shares -v 10 -r -i zone global
bash-3.00# prctl -n zone.cpu-shares -i zone global
zone: 0: global
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
zone.cpu-shares
privileged         10       -   none                                   -
system          65.5K     max   none

Now you can boot the zone and log into it:

bash-3.00# zoneadm -z ns1 boot
bash-3.00# zlogin -C ns1

That’s pretty much it as far as simple setup is concerned. Few links of interest:

Setting zone-wide resource controls
System Administration Guide: Solaris Containers-Resource Management and Solaris Zones
Fair Share Scheduler Overview
How to Set FSS Shares in the Global Zone Using the prctl Command
How to control CPU usage in Global Zone
resource_controls man page
Zone Resource Controls
Zone Resource Control in Solaris 10 08/07 OS
Jeff Victor’s blog – New Zones Features
Short thread on setting memory limits
Solaris 10 Scheduling

Setting Zone-Wide Resource ControlsS Continue Reading

Network interface bonding in Linux

Bonding Ethernet interfaces in Linux is pretty straight forward. There is bunch of articles out there on it already, but since this is where I keep some of my notes, I decided to write a post on it. Plus I do not have to bother with Google and I can come straight here for instructions.
This was done on Poweredge running CentOS 5.2. Here are things that need to be done to make this happen:

  • tell OS to load bonding.ko module on boot
  • set up configuration files for members of the bonded interface and the bonded interface itself
  • restart networking services or reboot

The following is /etc/modprobe.conf file. To get the OS to load bonding module on boot, you will need to add the alias bond0 bonding line. You can also pass some options to the bonding module. In this case I wanted the driver to check for link loss every 100ms. I also wanted the bond0 interface to perform adaptive load balancing, hence mode=6. Adaptive load balancing does not require any configuration on the switch side. If you choose a different mode, you might have to do additional configuration on the switch.

[root@bigfoot etc]# cat /etc/modprobe.conf
alias eth0 e1000
alias eth1 e1000
alias bond0 bonding
alias scsi_hostadapter qla1280
alias scsi_hostadapter1 megaraid_mbox
alias scsi_hostadapter2 ata_piix
options bond0 miimon=100 mode=6

Next, you need to set up configuration files for physical interfaces to be included in the bond0 interface. In my case bond0 consists of eth0 and eth1. Configuration files for both interfaces are identical except for DEVICE= lines.

[root@bigfoot etc]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Intel Corporation 82541GI Gigabit Ethernet Controller
DEVICE=eth0
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes
USERCTL=no

The last step is to configure bond0 interface itself:

[root@bigfoot etc]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
IPADDR=192.168.11.200
NETMASK=255.255.255.0
NETWORK=192.168.11.0
ONBOOT=yes
USERCTL=no

That is all. You can now do either /etc/init.d/networking restart or reboot the box.

This time I actually ran into a problem, where the physical interfaces were not being “enslaved” properly:

May 4 11:17:40 bigfoot kernel: ADDRCONF(NETDEV_UP): bond0: link is not ready
May 4 11:17:40 bigfoot kernel: bonding: bond0: Adding slave eth0.
May 4 11:17:40 bigfoot kernel: bonding: bond0: enslaving eth0 as an active interface with a down link.
May 4 11:17:40 bigfoot kernel: bonding: bond0: link status definitely up for interface eth0.
May 4 11:17:40 bigfoot kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
May 4 11:17:40 bigfoot kernel: bonding: bond0: Adding slave eth1.
May 4 11:17:40 bigfoot kernel: bonding: bond0: enslaving eth1 as an active interface with a down link.
May 4 11:17:45 bigfoot kernel: bonding: bond0: Removing slave eth0
May 4 11:17:45 bigfoot kernel: bonding: bond0: Warning: the permanent HWaddr of eth0 - 00:11:43:D8:AF:63 - is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts.
May 4 11:17:45 bigfoot kernel: bonding: bond0: releasing active interface eth0
May 4 11:17:47 bigfoot kernel: bonding: bond0: Adding slave eth0.
May 4 11:17:48 bigfoot kernel: bonding: bond0: Warning: failed to get speed and duplex from eth0, assumed to be 100Mb/sec and Full.
May 4 11:17:48 bigfoot kernel: bonding: bond0: enslaving eth0 as an active interface with an up link.

I have never had this problem before and quick googlage revealed that I am not alone. I came across this guy who had the same problem. He also links to the solution. Basically it seems Xen is causing the issue and to fix it you will need to edit /etc/xen/xend-config.sxp and force the network device to be used for network bridge in Xen:

(network-script 'network-bridge netdev=bond0')

Once I had that in place everything worked as advertised. Oh, and for thorough documentation check out Documentation included with kernel source. The file is called bonding.txt. Here is an online version of it. Continue Reading

Getting handle on log files

Starting with Solaris 9 there is a very handy tool called logadm that makes management of any log files a breeze. Syslog and messages files, among others, are managed by logadm which is called from root’s crontab.
Logadm reads /etc/logadm.conf file to figure out what it needs to do. By default there are following entries in logadm.conf:

/var/log/syslog -C 8 -P 'Wed Apr 8 02:10:22 2009' -a 'kill -HUP `cat /var/run/syslog.pid`'
/var/adm/messages -C 4 -P 'Fri Apr 10 02:10:15 2009' -a 'kill -HUP `cat /var/run/syslog.pid`'
/var/cron/log -c -s 512k -t /var/cron/olog
/var/lp/logs/lpsched -C 2 -N -t '$file.$N'
/var/fm/fmd/errlog -M '/usr/sbin/fmadm -q rotate errlog && mv /var/fm/fmd/errlog.0- $nfile' -N -s 2m
smf_logs -C 8 -s 1m /var/svc/log/*.log
/var/adm/pacct -C 0 -N -a '/usr/lib/acct/accton pacct' -g adm -m 664 -o adm -p never
/var/log/pool/poold -N -a 'pkill -HUP poold; true' -s 512k

Logadm provides -w switch which will write an entry into logadm.conf file that reflects current command line arguments. Of course logadm.conf can be edited using text editor, if that is the preferred method. If that’s the case, -V option can validate syntax of logadm.conf for you. Another handy option is -n which will cause logadm to do a dry run without actually performing the log rotation.

Other useful switches are:

  • -b and -a which allow specification of pre and post rotation commands to execute
  • -e sends error messages to a specific address instead of sending it to the owner of the crontab
  • -r removes entry from logadm.conf for a specific log file
  • -o sets different owner for the new log file from the original
  • -g sets different group for the new log file from the original
  • -m sets different permissions for the new log file from the original

For the whole story on logadm check out logadm man page.

Continue Reading

Solaris Link Aggregation

Link aggregation takes a bunch of network interfaces and creates a big pipe out of them.

Aggregation also provides redundancy. If all interfaces but one go down, the server will remain connected to the network.

Before starting make sure that:

  • interfaces to be aggregated are of the following type: xge, e1000g, and bge
  • interfaces to be aggregated are not plumbed
  • they run in full duplex mode at the same speeds
  • eeprom’s local-mac-address? variable is set to true

The following will create aggr1 interface with bge1 as one of its members:

bash-3.00# dladm create-aggr -d bge1 1

Next plumb the aggregate interface, configure an IP address on it and bring it up:

bash-3.00# ifconfig aggr1 plumb 192.168.1.5 netmask 255.255.255.0 up

At this point you can list aggregations:

bash-3.00# dladm show-aggr
key: 1 (0x0001) policy: L4      address: 0:3:ba:56:7f:ba (auto)
           device       address                 speed           duplex  link    state
           bge1         0:3:ba:56:7f:ba   0     Mbps    unknown down    standby

Now add bge0 as a second member of aggr1 aggregation interface and list aggregate interfaces:

bash-3.00# dladm add-aggr -d bge0 1
bash-3.00# dladm show-aggr
key: 1 (0x0001) policy: L4 address: 0:3:ba:56:7f:ba (auto)
           device       address                 speed           duplex  link    state
           bge1         0:3:ba:56:7f:ba   0     Mbps    unknown down    standby
           bge0         0:3:ba:56:7f:b9   1000  Mbps    full    up      attached

To keep the configuration persistent across reboots, create /etc/hostname.aggr1 with appropriate content and remove any hostname.* files pertaining to the interfaces that are now members of aggr1.

To have link aggregation working properly you need to have the switch to which server is connected to properly configured with LACP.

Another thing to consider is load balancing policy for outgoing traffic. You can load balance on layers 2,3 and 4. Load balancing policy can be changed using dladm command. Here is a quick example that will modify load balancing policy to combination of L3 and L4:

bash-3.00# dladm modify-aggr -P L3,L4 1
bash-3.00# dladm show-aggr -L
key: 1 (0x0001) policy: L3,L4 address: 0:3:ba:56:7f:ba (auto)
                LACP mode: off  LACP timer: short
    device    activity timeout aggregatable sync  coll dist defaulted expired
    bge1      passive  short   yes          no    no   no   no        no
    bge0      passive  short   yes          no    no   no   no        no

And finally, command that will allow you to see utilisation of individual links within aggregation. Note the %ipkts column, I did not have LACP turned on on the switch at that time:

bash-3.00# dladm show-aggr -s
key: 1  ipackets  rbytes      opackets   obytes          %ipkts %opkts
           Total        2723785   2287233197  1481682   710633551
           bge1 618712    115674760   870443    636559150       22.7    58.7
           bge0 2105073   2171558437  611239    74074401        77.3    41.3

More info on link aggregation is here. Continue Reading

Growing mirrored LUN in RedHat

I was putting a RedHat server onto a SAN and I could not find any clear instructions on how to grow a single mirrored LUN on the fly. Anyway, here are some notes on the process. First the setup: Two LUN’s mirrored across two SAN’s with LVM volume on the top of it. I could have easily just presented another set of mirrored LUN’s, add them to VG and go from there. I wanted to avoid that, as that kind of setup can quickly get out of hand as the number of presented LUN’s grows. If there is a more “sensible” and flexible setup, I would most definitely want to know about it.

For sake of completeness, here are steps to recreate the initial setup I had:

  1. Create a mirror from two LUN’s
  2. Use the mirror as PV
  3. Create a VG using the PV
  4. Create LV on the top of the VG
  5. Make ext3 filesystem on the top of LV and mount it

Here are the actual steps with some output:

[root@ultra /]# mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/mapper/mpath4 /dev/mapper/mpath5
mdadm: array /dev/md10 started.
[root@
ultra /]# pvcreate /dev/md10
Physical volume "/dev/md10" successfully created
[root@
ultra /]# vgcreate testvg /dev/md10
Volume group "testvg" successfully created
[root@
ultra /]# lvcreate -l+100%FREE -n testlv testvg
Logical volume "testlv" created
[root@
ultra /]# mkfs -t ext3 /dev/testvg/testlv
[root@
ultra /]# mount /dev/testvg/testlv /tmp/test

Now the resizing part. There might be a few steps but the upshot is that the filesystem can stay mounted and in use. High level overview of steps to take:

  1. Grow the two LUN’s using SAN management software
  2. Fail and remove one of the submirrors
  3. Force the kernel to see the size increase of the submirror
  4. Flush and recreate the multipath device map so multipathing sees the new size
  5. Re-add the submirror to the mirror and let it sync
  6. Repeat 2-4 for the second submirror
  7. Resize the PV
  8. Resize the LV
  9. Resize the filesystem

First, you fail and remove the submirror:

[root@ultra /]# mdadm /dev/md10 -f /dev/mapper/mpath4 -r /dev/mapper/mpath4
mdadm: set /dev/mapper/mpath4 faulty in /dev/md10
mdadm: hot removed /dev/mapper/mpath4

Now, note all paths to the LUN. Kernel sees a separate device at the end of each path to a LUN. In this case they are sdj, sdt, sdg and sdq.

[root@ultra /]# multipath -ll mpath4
mpath4 (3600508b400011c300000f000008d0000)
[size=12 GB][features="1 queue_if_no_path"][hwhandler="0"]
_ round-robin 0 [prio=100][active]
._ 1:0:3:1 sdj 8:144 [active][ready]
._ 2:0:3:1 sdt 65:48 [active][ready]
_ round-robin 0 [prio=20][enabled]
._ 1:0:2:1 sdg 8:96 [active][ready]
._ 2:0:2:1 sdq 65:0 [active][ready]

At this point the problem is to get the kernel to recognize the new size without reboot. After a lot of trying and sifting through man pages I found that blockdev command does the magic. Then I googled “blockdev resize” and I found this confirming my finding. So, the next step is to probe all logical paths to the LUN:

[root@ultra /]# blockdev --rereadpt /dev/sdj
[root@
ultra /]# blockdev --rereadpt /dev/sdt
[root@
ultra /]# blockdev --rereadpt /dev/sdg
[root@
ultra /]# blockdev --rereadpt /dev/sdq

You should see messages in /var/log/messages about kernel seeing new size on all paths. If you were to issue multipath -ll right now you would see that multipathing is still reporting old size. To fix that, flush the device map of the LUN and then recreate it:

[root@ultra /]# multipath -f mpath4
[root@
ultra /]# multipath -v2
create: mpath4 (3600508b400011c300000f000008d0000)
[size=13 GB][features="0"][hwhandler="0"]
_ round-robin 0 [prio=100]
._ 1:0:3:1 sdj 8:144 [ready]
._ 2:0:3:1 sdt 65:48 [ready]
_ round-robin 0 [prio=20]
._ 1:0:2:1 sdg 8:96 [ready]
._ 2:0:2:1 sdq 65:0 [ready]

Multipathing should be reporting the new size. Now you are ready to put back the grown submirror and let the whole mirror sync:

[root@ultra /]# mdadm /dev/md10 -a /dev/mapper/mpath4
mdadm: hot added /dev/mapper/mpath4

When the mirror has synced up, repeat the above process for the second submirror and wait for the sync to finish. Time to grow the mirror device itself:

[root@ultra /]# mdadm --grow /dev/md10 --size=max

After the completion /proc/mdstat should report increase in size of /dev/md10. Moving on you need to grow the PV that resides on /dev/md10:

[root@ultra /]# pvresize /dev/md10
Physical volume "/dev/md10" changed
1 physical volume(s) resized / 0 physical volume(s) not resized

And finally, you need to resize the LV:

[root@ultra /]# lvresize -l+100%FREE testvg/testlv
Extending logical volume testlv to 13.00 GB
Logical volume testlv successfully resized

Of course, don’t forget to grow the filesystem itself:

[root@ultra /]# ext2online /dev/testvg/testlv
ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
[root@
ultra /]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-rootlv
.                    132304280   5104976 120478588   5% /
/dev/md0                132134     32791     92521  27% /boot
none                   8202920         0   8202920   0% /dev/shm
/dev/mapper/testvg-testlv
.                     13413488     63516  12668820   1% /tmp/test

That should be it. The sync time for huge volumes is going to be something to keep in mind. The whole setup is clean and neat without clutter. I could have opted to mirror using LVM, but there seems to be a strange requirement for third, log volume. It is possible to keep the log in memory, but that supposedly causes resync on boot. Continue Reading

Page 3 of 812345...Last »