Playing with systemtap

Published 07-06-2012 00:00:00

From official website

   SystemTap is a tracing and probing tool that allows users to study and monitor the activities of the
   computer system (particularly, the kernel) in fine detail. It provides information similar to the output of
   tools like netstat, ps, top, and iostat; however, SystemTap is designed to provide more filtering
   and analysis options for collected information.

Install

Simply install systemtap package and kernel debug packages for your currently running kernel.

$ sudo apt-get install systemtap linux-image-`uname -r`-dbg linux-headers-`uname -r`

To check installation:

$ sudo stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
Pass 1: parsed user script and 81 library script(s) using 78600virt/22436res/2512shr kb, in 100usr/0sys/125real ms.
Pass 2: analyzed script: 1 probe(s), 1 function(s), 3 embed(s), 0 global(s) using 276868virt/117456res/7792shr kb, in 1190usr/220sys/7225real ms.
Pass 3: translated to C into "/tmp/stapQvI4gP/stap_347bddcb57970a4c3f9ee4e0705f0f68_1473_src.c" using 267024virt/112808res/5676shr kb, in 10usr/20sys/31real ms.
Pass 4: compiled C into "stap_347bddcb57970a4c3f9ee4e0705f0f68_1473.ko" in 5410usr/840sys/12991real ms.
Pass 5: starting run.
read performed
Pass 5: run completed in 0usr/20sys/499real ms.

How it works

  1. read script file written in “System Tap” language
  2. convert it to C source code
  3. compile code as a linux kernel module
  4. load it and execute it
  5. unload kernel module

All these steps are done thanks to the stap command.

Scripts

A SystemTap script is of the following form:

function function_name(arguments) {statements}
 
probe event1, event2, ..., eventn {
    function_name(args)
}

To be read: if one eventn occurs, execute the corresponding statements ie : function_name(args)

First example

Let’s print when the mkdir system call is called:

function fancy_print(text) {
    printf("**** %s ****\n", text);
}

probe syscall.mkdir {
    fancy_print("mkdir syscall called with pathname: " . pathname);
    exit();
}

Run the script:

$ sudo stap -v mkdir_probe.stp
Pass 1: parsed user script and 81 library script(s) using 78596virt/22424res/2508shr kb, in 100usr/10sys/104real ms.
Pass 2: analyzed script: 1 probe(s), 2 function(s), 1 embed(s), 0 global(s) using 205108virt/118440res/75184shr kb, in 310usr/50sys/399real ms.
Pass 3: translated to C into "/tmp/stappLnq6v/stap_0de87453a9617d4db8e5f589bf21474f_1770_src.c" using 205108virt/118548res/75292shr kb, in 0usr/0sys/3real ms.
Pass 4: compiled C into "stap_0de87453a9617d4db8e5f589bf21474f_1770.ko" in 1170usr/150sys/1407real ms.
Pass 5: starting run.

Then issue an mkdir command to trigger the probe, the SystemTap script should end:

**** mkdir syscall called ! ****
Pass 5: run completed in 10usr/20sys/14869real ms.

Where does the pathname variable comes from ? SystemTap defines a set of predefined functions and probes typically in /usr/share/systemtap/tapset/. By looking at the /usr/share/systemtap/tapset/syscalls.stp file we can see the following mkdir section:

    # mkdir ______________________________________________________
    # long sys_mkdir(const char __user * pathname, int mode)
    probe syscall.mkdir = kernel.function("sys_mkdir").call
    {
            name = "mkdir"
            pathname_uaddr = $pathname
            pathname = user_string($pathname)
            mode = $mode
            argstr = sprintf("%s, %#o", user_string_quoted($pathname), $mode)
    }

where the pathname variable is defined.

iotop with systemtap (taken from <http://sourceware.org/systemtap/examples/io/iotop.stp>_)

global reads, writes, total_io

probe vfs.read.return {
    reads[execname()] += bytes_read
}

probe vfs.write.return {
    writes[execname()] += bytes_written
}

# print top 10 IO processes every 5 seconds
probe timer.s(5) {
    foreach (name in writes)
        total_io[name] += writes[name]
    foreach (name in reads)
        total_io[name] += reads[name]
    printf ("%16s\t%10s\t%10s\n", "Process", "KB Read", "KB Written")
    foreach (name in total_io- limit 10)
        printf("%16s\t%10d\t%10d\n", name,
               reads[name]/1024, writes[name]/1024)
    delete reads
    delete writes
    delete total_io
    print("\n")
}
  • .return suffixes to probes means probes are triggered when the syscall returns,
  • bytes_read and bytes_written are predefined variables,
  • timer.s is a probe that executes periodically, in our case, every 5 seconds,
  • note the minus sign in the foreach statement meaning sort by descending order and limited to the first 10 entries (limit 10).

Running the script will periodically print the iotop-like output::

         Process	   KB Read	KB Written
            mocp	       199	         0
            Xorg	       161	         0
         firefox	         2	         5
             psi	         1	         0
          stapio	         0	         0
           urxvt	         0	         0
      parcellite	         0	         0
            tmux	         0	         0
        beam.smp	         0	         0
 plugin-containe	         0	         0