1. Introduction

You're working away, and suddenly your edit session goes to hell because /tmp is full.

It's one thing when it happens on your workstation; it's a much bigger deal on a server with actual paying customers. Here are some scripts to make your life easier.

2. Sudden disk space shortages

2.1. Cron is your friend

Use something like this for your crontab file if you want to check diskspace every 5 minutes around the clock:

# Set these environment variables within cron:
CRON=yes
#    Scripts know when they're being run via cron.
#
MAILTO=yourname
#    Who to send mail to.  Leave blank if no mail is to be sent.
#
# Environment variables set by cron:
#   SHELL=/bin/sh
#   USER=yourname
#   PATH=/usr/bin:/bin
#   PWD=/home/yourname
#   SHLVL=1
#   HOME=/home/yourname
#   LOGNAME=yourname
#
# To test, uncomment this line:
## *    *     *   *   *    /bin/env > /tmp/env$$
#===================================================================
# Everything on a line is separated by blanks or tabs.
#
#+----------------------------- Minute (0-59)
#|     +----------------------- Hour   (0-23)
#|     |     +----------------- Day    (1-31)
#|     |     |   +------------- Month  (1-12)
#|     |     |   |   +--------- Day of week (0-6, 0=Sunday)
#|     |     |   |   |    +---- Command to be run
#|     |     |   |   |    |
#v     v     v   v   v    v
#===================================================================
# Keep an eye on drives and disk space.  Run every 5 minutes.
4-54/5 *     *   *   *    $HOME/cron/checkdrives

2.2. checkdrives script

#!/bin/ksh
#
# $Revision: 1.3 $ $Date: 2010-11-10 13:20:42-05 $
# $UUID: ca583930-f781-3100-b878-5542c05bace9 $
#
#<checkdrives: send mail if a filesystem gets too full
# Try to avoid depending on GNU software being installed.

PATH=/bin:/usr/bin
BLOCKSIZE=1m
BLOCK_SIZE=1048576
export PATH BLOCKSIZE BLOCK_SIZE
tag=${0##*/}

# Portability and configuration stuff here.
subject='drives getting full'
to='admin-urgent'
host=$(hostname | cut -f1 -d.)
work=$HOME/var/drives
max=96  # more than this percent == drive is too full.

# What df should we use?
case "$(uname -s)" in
    SunOS)    DF='/usr/xpg4/bin/df -F ufs' ;;
    FreeBSD)  DF='df -t ufs' ;;
    *)        DF='df' ;;
esac

# Real work starts here.  Run df, skip the header, kill %-sign,
# and list filesystems that are too full.
filesys=$($DF | sed -n -e '2,$p' |
          tr -d '%' | awk -v max=$max '$5 >= max {print $6}')

case "X$filesys" in
    X)  exit 0 ;;
    *)  ;;
esac

# Keep current and previous drive status.
if test ! -d $work; then
    mkdir -p $work 2> /dev/null

    if test ! -d $work; then
        echo "$host: $tag: mkdir $work failed" | mailx $to
        exit 1
    fi
fi

# Don't send the same message repeatedly.
cd $work
(echo $host; $DF $filesys) > cur

if test -f prev; then
    cmp -s cur prev || mailx -s "$host $subject" $to < cur
else
    mailx -s "$host $subject" $to < cur
fi

mv cur prev
exit 0

Notice that the script uses mail to tell you about problems; just replace mailx with something to send a popup message if you're running this on the same host that's being checked.

If you have several hosts to keep track of, it's better to set up a mail address that will automatically send you a popup message or alert of some type upon receipt of a message. Procmail will handle that very nicely.

2.3. Popup messages

These can be incredibly annoying, so I don't use them unless there's something requiring immediate attention. If you use X-Windows, have a look at the xalarm package. If not, "write" will do the trick:

#!/bin/ksh
#
# $Revision: 1.3 $ $Date: 2011-09-25 19:51:09-04 $
# $UUID: 97362b8a-57af-3b67-b751-ce8712d62c27 $
#
#<popup: send a quick popup message.

export PATH=/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin
export USER=yourname

# If the user isn't taking calls, exit.
test -f "$HOME/.nopopup" && exit 0

# If no message, exit.
case "$#" in
    0)  exit 0 ;;
    *)  str=${1+"$@"} ;;
esac

# If running under X use xalarm, else use write.
case "$DISPLAY" in
    "") set X $(who | grep pts/ | head -1)
        tty="$3"
        echo "$str" | write $USER $tty
        ;;

    *)  set X $(date)
        today="$4 $3 $5"
        msg=$(echo "$today @ $str" | tr '@' '\012')
        export DISPLAY
        xalarm -name xmemo -time +0 -geometry +20-40 -nowarn "$msg"
        ;;
esac

exit 0

3. Long-term monitoring

If you know your system was fine a few hours ago, it's handy to have a timeline to see where things started going to hell. The examples below are run under Linux, but you only need trivial changes to use it under Solaris or FreeBSD.

3.1. adm crontab

Since "adm" is usually responsible for accounting stuff, I run these scripts under that userid. Here's the crontab file:

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=yourname
HOME=/var/log/sa
#
# Need this for performance log archives
PERFLOG=/var/log/perflog
#===================================================================
# Everything on a line is separated by blanks or tabs.
#
#+--------------------------- Minute (0-59)
#|   +----------------------- Hour   (0-23)
#|   |     +----------------- Day    (1-31)
#|   |     |   +------------- Month  (1-12)
#|   |     |   |   +--------- Day of week (0-6, 0=Sunday)
#|   |     |   |   |    +---- Command to be run
#|   |     |   |   |    |
#v   v     v   v   v    v
#===================================================================
# Run performance log every 10 min.
1-51/10 *  *   *   *    /usr/local/cron/perflog $PERFLOG
#-------------------------------------------------------------------
# Summarize files just before midnight.
55   23    *   *   *    run-parts /etc/cron.perflog

3.2. Directory layout

My /var/log/perflog directory looks like this:

/var/log/perflog:
drwxr-s---   3 adm mis    4096 Sep 24 00:01 2011/
drwxrwsr-x   2 adm mis    4096 Sep 23 23:55 2011.n/

    /var/log/perflog/2011:
    drwxr-s--- 145 adm mis    4096 Sep 24 23:51 0924/

        /var/log/perflog/2011/0924:
        drwxr-s---   2 adm mis    4096 Sep 24 00:01 0001/
        drwxr-s---   2 adm mis    4096 Sep 24 00:11 0011/
        drwxr-s---   2 adm mis    4096 Sep 24 00:21 0021/
        [...]
        drwxr-s---   2 adm mis    4096 Sep 24 23:51 2351/

            /var/log/perflog/2011/0924/0001:
            -rw-r-----   1 adm mis     754 Sep 24 00:01 cache
            -rw-r-----   1 adm mis    1424 Sep 24 00:01 df
            -rw-r-----   1 adm mis     848 Sep 24 00:01 ifconfig
            -rw-r-----   1 adm mis     771 Sep 24 00:01 meminfo
            -rw-r-----   1 adm mis     171 Sep 24 00:01 netstat
            -rw-r-----   1 adm mis    1198 Sep 24 00:01 ping
            -rw-r-----   1 adm mis   10724 Sep 24 00:01 ps
            -rw-r-----   1 adm mis    3245 Sep 24 00:01 smbstatus
            -rw-r-----   1 adm mis     104 Sep 24 00:01 swap
            -rw-r-----   1 adm mis      84 Sep 24 00:01 uname
            -rw-r-----   1 adm mis      71 Sep 24 00:01 uptime

            /var/log/perflog/2011/0924/0011:
            -rw-r-----   1 adm mis     754 Sep 24 00:11 cache
            -rw-r-----   1 adm mis    1424 Sep 24 00:11 df
            -rw-r-----   1 adm mis     848 Sep 24 00:11 ifconfig
            -rw-r-----   1 adm mis     771 Sep 24 00:11 meminfo
            -rw-r-----   1 adm mis     171 Sep 24 00:11 netstat
            -rw-r-----   1 adm mis    1197 Sep 24 00:11 ping
            -rw-r-----   1 adm mis   10776 Sep 24 00:11 ps
            -rw-r-----   1 adm mis    3568 Sep 24 00:11 smbstatus
            -rw-r-----   1 adm mis     104 Sep 24 00:11 swap
            -rw-r-----   1 adm mis      84 Sep 24 00:11 uname
            -rw-r-----   1 adm mis      71 Sep 24 00:11 uptime

            /var/log/perflog/2011/0924/0021:
            -rw-r-----   1 adm mis     754 Sep 24 00:21 cache
            -rw-r-----   1 adm mis    1424 Sep 24 00:21 df
            -rw-r-----   1 adm mis     848 Sep 24 00:21 ifconfig
            -rw-r-----   1 adm mis   58362 Sep 24 01:19 iostat
            -rw-r-----   1 adm mis     771 Sep 24 00:21 meminfo
            -rw-r-----   1 adm mis   10752 Sep 24 01:20 mpstat
            -rw-r-----   1 adm mis     171 Sep 24 00:21 netstat
            -rw-r-----   1 adm mis    1197 Sep 24 00:21 ping
            -rw-r-----   1 adm mis   10580 Sep 24 00:21 ps
            -rw-r-----   1 adm mis    3435 Sep 24 00:21 smbstatus
            -rw-r-----   1 adm mis     104 Sep 24 00:21 swap
            -rw-r-----   1 adm mis      84 Sep 24 00:21 uname
            -rw-r-----   1 adm mis      71 Sep 24 00:21 uptime
            -rw-r-----   1 adm mis   10654 Sep 24 01:19 vmstat

            [...]

            /var/log/perflog/2011/0924/2351:
            -rw-r-----   1 adm mis     754 Sep 24 23:51 cache
            -rw-r-----   1 adm mis    1424 Sep 24 23:51 df
            -rw-r-----   1 adm mis     848 Sep 24 23:51 ifconfig
            -rw-r-----   1 adm mis     771 Sep 24 23:51 meminfo
            -rw-r-----   1 adm mis     171 Sep 24 23:51 netstat
            -rw-r-----   1 adm mis    1197 Sep 24 23:51 ping
            -rw-r-----   1 adm mis   10281 Sep 24 23:51 ps
            -rw-r-----   1 adm mis    2651 Sep 24 23:41 smbstatus
            -rw-r-----   1 adm mis     104 Sep 24 23:51 swap
            -rw-r-----   1 adm mis      84 Sep 24 23:51 uname
            -rw-r-----   1 adm mis      72 Sep 24 23:51 uptime

Each file holds output from one specific command.

For example, the file /var/log/perflog/2011/0924/0001/cache holds output from "vmstat -s" at 12:01am, 9/24/2011:

      1943948  total memory
      1892424  used memory
        49772  active memory
      1806208  inactive memory
        51524  free memory
         6592  buffer memory
      1819552  swap cache
      2096472  total swap
        75892  used swap
      2020580  free swap
    137464830 non-nice user cpu ticks
        45415 nice user cpu ticks
      7180476 system cpu ticks
    651831801 idle cpu ticks
     69403053 IO-wait cpu ticks
        66205 IRQ cpu ticks
       962645 softirq cpu ticks
            0 stolen cpu ticks
   1345948021 pages paged in
    967598390 pages paged out
      4522472 pages swapped in
      4535965 pages swapped out
   1806195040 interrupts
   1877621550 CPU context switches
   1312589033 boot time
      2675983 forks

Every 20 minutes, output from iostat and mpstat is included:

Linux ... (server.com)    09/24/11        _i686_  (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.86    0.01    0.95    8.01    0.00   75.18

Device:     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda       60.90    2.41   323.08   119.35     6.99     0.59    9.36   1.67  10.60
sdb       28.92    0.31   789.33    20.64    27.71     0.88   30.22   3.53  10.31
sdc        1.22    0.17   173.42    60.09   168.97     0.06   45.14   2.78   0.38
sdd       38.05    0.28   914.60    68.24    25.64     0.20    5.32   3.40  13.03
sde        0.13    0.14    22.63    51.03   271.50     0.05  166.80   3.48   0.09
sdf        6.47    0.14   675.25    55.67   110.54     0.10   14.80   2.60   1.72
sdg       17.62    0.22   705.99    71.35    43.56     0.59   33.05   4.55   8.11

3.3. Nightly summaries

Just before midnight, I like to jam the day's entries into one file to reduce storage space. One easy way is to use "head":

==> 0924/0001/df <==
Filesystem   1M-blocks   Used Available  Use%  Mounted
/dev/sda1        15873   9821      5234   66%  /
/dev/sda2         7933   2628      4896   35%  /var
/dev/sda5         7933    247      7277    4%  /home
/dev/sda6       335728 252096     66303   80%  /rd01
tmpfs              950      0       950    0%  /dev/shm
tmpfs              950     64       887    7%  /tmp
/dev/sdb6       341144 233036     90780   72%  /rd02
/dev/sdc6       341144 217470    106345   68%  /rd03
/dev/sdd6       341144 225758     98058   70%  /rd04
/dev/sdf6       341144 263800     60015   82%  /rd07
/dev/sdg6       341144 244507     79308   76%  /rd08
/dev/sde6       341144  28520    295295    9%  /rd05

Filesystem      Inodes  IUsed     IFree IUse%  Mounted
/dev/sda1      4198176 446524   3751652   11%  /
/dev/sda2      2097152   5429   2091723    1%  /var
/dev/sda5      2097152   1387   2095765    1%  /home
/dev/sda6     88735744 232375  88503369    1%  /rd01
tmpfs           191235      1    191234    1%  /dev/shm
tmpfs           191235     12    191223    1%  /tmp
/dev/sdb6     44367872 187957  44179915    1%  /rd02
/dev/sdc6     44367872  44983  44322889    1%  /rd03
/dev/sdd6     44367872 276423  44091449    1%  /rd04
/dev/sdf6     44367872 196609  44171263    1%  /rd07
/dev/sdg6     44367872 147284  44220588    1%  /rd08
/dev/sde6     44367872   6438  44361434    1%  /rd05

==> 0924/0001/ifconfig <==
[...]
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:539988 errors:0 dropped:0 overruns:0 frame:0
          TX packets:539988 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:44423496 (42.3 MiB)  TX bytes:44423496 (42.3 MiB)

==> 0924/0001/meminfo <==
MemTotal:      1943948 kB
MemFree:         53076 kB
Buffers:          6180 kB
Cached:        1798304 kB
SwapCached:       3256 kB
[...]

These files compress very nicely under a separate directory tree:

/var/log/perflog/2011.n:
-rw-r--r--   1 adm mis  405844 Jun  1 23:55 0601.xz
-rw-r--r--   1 adm mis  398520 Jun  2 23:55 0602.xz
[...]
-rw-r--r--   1 adm mis  394736 Sep 21 23:55 0921.xz
-rw-r--r--   1 adm mis  429668 Sep 22 23:55 0922.xz
-rw-r--r--   1 adm mis 4629315 Sep 23 23:55 0923

To create the summaries and do the cleanup, I have three scripts that run under their own directory:

me% cd /etc/cron.perflog

me% ls -lF
-rwxr-xr-x 1 root mis 1506 Mar 12  2011 100.perf-reduce*
-rwxr-xr-x 1 root mis 1462 Mar 12  2011 110.perf-clean*
-rwxr-xr-x 1 root mis 1161 Mar 12  2011 120.perf-compress*

me% grep '#<' * | cut -f2 -d'<'
100.perf-reduce: merges separate perflog files to save space.
110.perf-clean: removes old perflog directory if "reduce" worked.
120.perf-compress: runs xz on yesterday's logfile.

4. Finding disk hogs

If you want to keep an eye on who (or what) uses the most space over time, you can put this script under /etc/cron.daily.

4.1. dirsize script

This writes du summary output to a file named after the current date.

#!/bin/ksh
#
# $Revision: 1.4 $ $Date: 2010-11-09 15:11:35-05 $
# $UUID: 31a46065-7dd1-3f41-b27c-bc96ce22c12d $
#
#<dirsize: see how big each top-level group directory is.
# usage: dirsize [etc-file [output-file]]

export PATH=/usr/local/bin:/bin:/usr/bin
export BLOCKSIZE=1m
export BLOCK_SIZE=1048576        # BLOCK* sets du output to Mbytes.

umask 022
tag=$(basename $0)
host=$(hostname | cut -f1 -d.)
out='/var/adm/sa/du'

# Format output in consistent-width columns.
# Argument is the number of columns you want.
layout () {
    case "$#" in
        0) k=1 ;;
        *) k=$1 ;;
    esac

    case "$k" in
        [1-9]) ;;
        *) echo 'layout botch'; exit 1 ;;
    esac

    awk '{printf "%6s %s\n", $1, $2}' | pr -o1 -w88 -${k}t | expand
}

say () {
    echo; echo "$(date '+%Y-%m-%d %T'): $*"; echo
}

warn () {
    echo "WARN: $(date '+%Y-%m-%d %T'): $*"
}

logmsg () {
    logger -t $tag -p local1.info "$@"
}

die () {
    logmsg "FATAL: $*"; exit 1
}

# Check the input settings file.  Set an optional output file.
ofile=
case "$#" in
    0) ifile="/usr/local/etc/$tag" ;;
    1) ifile="$1" ;;
    2) ifile="$1"; ofile="$2" ;;
esac
test -f "$ifile" || die "$ifile not found"

# Figure out the date.
logmsg start
set X $(date "+%Y %m%d"); shift
yr=$1
mday=$2

# Set up the output file.
test -d "$out/$yr" || mkdir -p $out/$yr
test -d "$out/$yr" || die "unable to mkdir $out/$yr"

# Redirect all stdout and stderr output.
case "$ofile" in
    "") ofile="$out/$yr/$mday" ;;
    *)  ;;
esac

exec > $ofile
exec 2>&1

# Real work starts here.  Read the directories, columns, etc.
grep '^[1-9]' $ifile | while read depth columns dir
do
    if test -d "$dir"
    then
        say Directory $dir
    else
        warn "$dir: not a directory"
        continue
    fi

    # NOTE: after awk, we could put "sort -nr" or "cat" depending
    # on whether you wanted output sorted by directory size.
    # Ignore anything under 10 Mb.
    (
      cd $dir
      find . -mindepth $depth -maxdepth $depth -type d -print |
          sort | tr '\012' '\000' | xargs -0 du -s |
          awk '{ if ($1 > 9) print }' |
          layout $columns |
          sed -e 's! ./! !g'
    )
done

say done
logmsg done
exit 0

4.2. dirsize results

Some sample output from 9/24/2011:

2011-09-24 04:27:56: Directory /fs1b/server5/2008

 ** 935 0104       495 0325       381 0530       407 0807       296 1015
    897 0107       223 0328       230 0602       228 0813       260 1021
    502 0110       441 0331       544 0605       789 0819       646 1024
    435 0116       480 0403       282 0611       197 0822       387 1027
    790 0122       440 0409       245 0617       276 0825       446 1030
    561 0125       138 0415       231 0620       286 0828       177 1105
    599 0128       277 0418       204 0623       308 0903       263 1114
    425 0131       246 0421       131 0626      2602 0909       864 1117
    409 0206       660 0424       461 0702       164 0912       576 1120
    396 0212       283 0430       264 0708       322 0915       352 1126
    556 0215       938 0506       513 0711       435 0918       596 1202
    620 0221       358 0509       713 0714        37 0921       746 1205
    574 0227       554 0512       338 0717       132 0924       431 1208
    503 0304       688 0515       326 0723       625 0930       745 1211
    204 0307        11 0518       252 0729       440 1003       288 1217
    591 0310       355 0521       376 0801       544 1006       118 1223
    307 0313       512 0527       284 0804      1069 1009       126 1229
    435 0319

2011-09-24 04:32:28: Directory /fs1b/server5/2009

    594 0707       174 0812       121 0921       627 1027       113 1202
    536 0715       194 0820       203 0925       186 1104       112 1210
     22 0719       345 0824       280 0929       275 1112       315 1214
    227 0723       267 0828       252 1007       174 1116       104 1218
    311 0727       672 0901       494 1015       322 1120        76 1222
    469 0731       136 0909       188 1019       283 1124        38 1230
    427 0804       303 0917       840 1023 **

For example, the directory holding backups for server5 on 1/4/2008 takes up 935 Mb. The directory holding backups for server5 on 10/23/2009 takes up 840 Mb.

4.3. dirsize config file

"dirsize" reads directory and layout information from the file /usr/local/etc/dirsize:

# $Revision: 1.1 $ $Date: 2011-08-09 18:52:30-04 $
# $UUID: d9f49e1d-111c-35dd-9265-8d81644455c8 $
#
# Expand this list into additional directories to check.
# Field 1: min/max depth of directories to traverse
# Field 2: number of columns to print
# Field 3: starting directory
#
# EXAMPLE:
#   "2 3 /usr" runs "cd /usr; find . -mindepth 2 -maxdepth 2 -type d"
#   and prints 3-column output.

1   5   /fs1b/server5/2008
1   5   /fs1b/server5/2009

The lines for "server5" tell the script to descend one level into the given directory and print the results in 5 columns.

4.4. A better design

If I were doing this over again, I'd divvy up the work a bit differently. Instead of writing the report in one script, I'd store raw du output without any formatting in one directory, and have separate scripts to read that and write something suitable for a webpage display or database import.

5. Example scripts

6. Feedback

Feel free to send comments.

Generated from disk-space.t2t by txt2tags
$Revision: 1.8 $

Monitoring disk space

Karl Vogel

Sun, 25 Sep 2011 23:13:25