SUT Operations
This tutorial demonstrates working with test machines ('System Under Test' or SUT) in Crucible. SUT operations covers remotely rebooting a machine, patching and installing its kernel, viewing the console logs, manually running tests, and tips for troubleshooting common problems.
Contents:
- Logging into the Crucible Driver
- Requeuing a Test Run
- Locking a SUT
- Remotely Power Cycling a SUT
- Logging into a SUT
- Rebuilding and Reinstalling the Kernel
- Booting the SUT to a Given Kernel
- Watching a SUT's Console
- Starting Test-Specific Services
Logging into the Crucible Driver
SUT operations are performed on the Driver system. This system schedules test runs and provides other services to the SUTs. You can perform various operations on SUTs from this machine, as well as SSHing into the SUTs themselves from here.
In order to access the Driver system, you will need to have a login account and your SSH public key on the Driver. If you don't have one or think there may be a problem with your account, contact the Crucible administrator.
If your account is set up correctly, you can log in through ssh. E.g.:
$ ssh my_name@crucible.osdl.org
Crucible's tools are generally installed into a special directory, such as /testing/usr/bin. Check that this is included in your $PATH variable. A quick way to do this is the command:
$ which sut /testing/usr/bin/sut
Most sut operations require that you have sudo access; if you ever get permission denied errors, that's most likely what's going on.
Requeuing a Test Run
If you are investigating an issue found during a previous test run, you may find it worthwhile to rerun the test. This will both verify the issue still exists on the system, and will get the right environment set up for you to investigate.
There are two ways to requeue a test. If you know the Test Run ID, you can requeue it from the test driver:
$ testrun requeue 510
Alternatively, you can queue it using the package name and patch or software name:
$ queue_package nfsv4/linux-2.6.17-rc5-g1a2098e-server-cluster-locking-api.diff
You can look in /testing/packages/ for available packages and patches.
Watch `sut status` and `testrun status` to see what got queued, and when it begins to run on the SUTs.
Note that in both of the two cases, it is possible that additional test runs will be queued up. If you like, you can cancel the unneeded runs like this:
$ testrun cancel 1234
While it is running a test, you can review its progress via this command:
$ testrun info 1232
Locking a SUT
To take a SUT out of service and make it stop running tests, use the sut script.
$ sudo /testing/usr/bin/sut lock nfs04
You can review the state of all the machines like this:
$ sut status SUT RUN STATE PKG amd01 947 finished patch-2.6.17-git1.bz2 ita01 unknown unknown nfs02 932 running linux-2.6.17-g9eb516f-nfs-server-stable.diff nfs03 932 running linux-2.6.17-g9eb516f-nfs-server-stable.diff nfs04 LOCK unknown unknown nfs05 945 finished linux-2.6.17-rc1-CITI_NFS4_ALL-1.diff nfs06 891 finished linux-2.6.17-g4bee93e-nfs-server-stable.diff nfs07 891 finished linux-2.6.17-g4bee93e-nfs-server-stable.diff nfs08 LOCK unknown unknown nfs09 LOCK unknown unknown nfs10 949 finished patch-2.6.17-git1.bz2 nfs11 897 finished cairo-1.2.0.tar.gz nfs12 929 running linux-2.6.17-rc5-g1a2098e-server-cluster-locking-api.diff nfs13 929 running linux-2.6.17-rc5-g1a2098e-server-cluster-locking-api.diff ppc01 879 finished linux-2.6.17-gdbd8524-nfs-server-stable.diff
The SUT will finish up whatever step it was working on for its last testrun, and then become idle.
Don't forget to unlock the SUT when you're done with it:
$ sudo /testing/usr/bin/sut unlock nfs04
Remotely Power Cycling a SUT
Invariably, a test will put a machine into a bad state and lock it up. You can power cycle a machine using the sut script. For example, to power cycle 'nfs04', you'd do:
$ sudo /testing/usr/bin/sut power nfs04
You can review the power status like this:
$ sudo /testing/usr/bin/sut power nfs04 status
[Note that some systems may not have remote power control set up, either
because its owner doesn't want to allow automated reboots, or because of
a lack of hardware or software support for power management. In any
case, you can see if/how the SUT does power management by looking for a
script /testing/suts/
Logging into a SUT
To log into a SUT, first log into the Driver, then from there you can ssh into the SUT directly, as root:
$ ssh my_name@crucible.osdl.org $ ssh root@nfs04 #
The root password for SUTs is simply 'password'.
If you want to kill off any lingering test processes (in case they'll interfere with what you'll be working on), you can determine the processes via:
# ps aux | grep RUNNING
Rebuilding and Reinstalling the Kernel
Kernels are unpacked by crucible into /usr/src/linux-*, and installed into /boot/kernel-*.
It is good practice to create a separate copy of whatever kernel tree you'll be hacking on, to avoid confusion later:
# cp -r /usr/src/linux-2.6.17-rc6 /usr/src/linux-2.6.17-rc6-bryce-1
It's also good practice to set the EXTRAVERSION so there won't be issues with conflicting module paths.
$ vi /usr/src/linux-2.6.17-rc6-bryce-1/Makefile SUBLEVEL = 17 EXTRAVERSION =-rc6-bryce NAME=Crazed Snow-Weasel
If you like, you can manually compile and install the kernel using the normal linux kernel processes, and update the bootloader configuration files to suit. Just be careful not to change the default boot option! Otherwise, if there is a failed kernel, you won't be able to recover remotely.
If you don't want to do things quite so manually, Crucible's kernel management commands are at your disposal. This can be especially useful if you suspect the issue may be related to how Crucible is building the kernel. In any case, here's an example of how to use them:
# build_kernel[kernel-label] [config-default] [kernel-args] # build_kernel linux-2.6.17-rc6-bryce-1 \ 2.6.17-r6b1 \ /testing/packages/linux/config.default \ mem=512M
The arguments to build_kernel are 1) the directory name you gave above, 2) some short tag to use as the bootloader entry's title, 3) the config file to use, and 4) any kernel arguments you wish to use.
You can determine which config file that a given testrun used by looking at the log file for that machine; it is printed within the first line or two after "### RUNNING '010-build_kernel' ###". linux/config.default is generally used for x86 systems, but if the test needed special config settings or ran on non-x86 hardware, a different config may be used.
Of course, you can also specify your own config file. ;-)
build_kernel will configure, make, install the kernel and its modules, update the bootloader, and create initrd if appropriate.
Booting the SUT to a Given Kernel
Once you've built and installed a kernel, you can boot to it using the boottool command:
# /testing/usr/bin/boottool --boot-once --title 2.6.17-r6b1 # reboot && logout
This is analogous to doing 'lilo -R 2.6.17-r6b1' (in fact, on systems using lilo, that's exactly what it does).
Note that some bootloaders (e.g. elilo & yaboot) don't yet have a boot-once capability, so to boot them you'll have to set the default kernel to your test kernel, reboot, and pray. Contact the administrator if a kernel fails on it.
Watching a SUT's Console
From the Driver, you can view a SUT's console via 'console':
$ console nfs04 Helpful commands: ^E c ? - help ^E c p - replay log ^E c u - host status ^E c . - disconnect
The consoles are also logged to /var/consoles/ on the Driver.
Starting Test-Specific Services
Some tests start up services on bootup independently of the OS's regular init system. If you manually boot the system, you'll probably also need to start these up manually as well. For instance, NFSv4 test runs will require:
# /testing/usr/bin/init_nfsv4_svcs
You can look at `testrun info $run_id` to see what post-boot actions are performed. These steps correspond to scripts you can usually find in the /testing/runs/$run_id/FINISHED/$sut_id/ directory.