Copyright 1998, Hewlett-Packard Company


	    A Brief Look at the Performance of Named 8.1.1
				 and
			Steps to Improve Same

   ftp://ftp.cup.hp.com/dist/networking/briefs/named_performace.txt

		     Rick Jones <raj@cup.hp.com>
	       Systems Networking Solutions Laboratory
		       Hewlett-Packard Company
			    Cupertino, CA

		     Revision 0.9; April 9, 1998


Introduction:

    BIND 8.1.1 is the current release of software implementing the
Domain Name Service of the Internet. Virtually the entire Internet is
dependent on the robustness and performance of BIND. This paper takes
an all too brief look at the performance of the named from BIND 8.1.1
running under the HP-UX 11.00 operating system. Potential bottlenecks
are described, as well as potential solutions to those
bottlenecks. Also discussed are the effects upon named performance of
various optimizations made available in the HP C compiler.

    This paper will demonstrate that through a combination of compiler
optimizations, small makefile and porting changes, and slight code
modifications it is possible to increase the performance of named by
as much as 98%. This is with no major algorithmic or structural
changes to the BIND source.

Test Environment:

    The performance measurements described in this paper were taken
with the DNS_RR test of revision 3.0.0alpha (3/8/98) of the netperf
benchmark. While this revision of netperf is not as complete, nor as
pretty as revision 2.1, it was sufficiently robust and ready for this
task. There is little value to a benchmark for which source code is
unavailable. Details on obtaining copies of netperf sources and a
discussion of the netperf DNS_RR test can be found later in this
paper.

    Each measurement was of 15 minutes duration. This allowed the puma
profiling tool to collect approximately 5000 samples and also helped
ensure that any unforeseen periodic glitches in the environment were
evened-out. Four streams of DNS requests were directed at the System
Under Test (SUT), two from each of the client systems. Each stream of
requests started at a random point in a list of 1000 names selected
from the hostnames on the Hewlett-Packard site in Cupertino
California, and kept cycling through the list until the end of the
measurement.

    The systems generating the DNS requests were a pair of HP 9000
B160L workstations running HP-UX 10.20. The B160L contains a PA-7300
CPU running at 160 MHz. Each system contained 128 MB of RAM and an
add-on PCI 100BaseT Network Interface Card (NIC).

    The System Under Test (SUT) was a uniprocessor HP 9000 Model
K460. The K460 uses a PA-8000 processor with 1MB of instruction and
1MB of data cache and has a clock rate of 180 MHz. The HP 9000 Model
C180 and the HP 9000 Model D380 (uniprocessor) would have similar
characteristics The SUT also used a dual-port HSC 100BaseT NIC. Both
ports on the card were used during testing. A caching-only nameserver
was run on the SUT.

    The systems were joined by an eight-port AdvanceStack 800T switch
and the two logical 100BaseT networks ran in full-duplex mode..

    The profiles shown in this paper were taken with the "puma"
performance tool which is bundled with the HP C compiler. Puma can
take profiles of "normal" binaries without requiring special
compilation. More information about the puma tool can be found via
http://www.hp.com/lang.


Baseline Performance:

[ named.O2 ]

    "Out of the box" compilation under HP-UX uses -g and +O2. For the
baseline figures the -g switch was removed from the compilation and
the +O2 retained. Compiled this way, named was able to serve
approximately 1,900 requests per second using all the CPU resources of
the SUT. The paper will refer to this version of named as "named.O2."

    At the gross system level, 75% of the CPU time was spent in
user-space; the remaining 25% of the time was spent in the
kernel. Puma showed that of the user-space time, roughly 60% was spent
in libc and 40 percent in named code:


Samples   
In Only   
%Total    Image or Library
--------- ----------------
  62      /usr/lib/libc.2
  38      /home/bind/named.O2
--------

From this it would seem that a fertile area of exploration would be
reducing the time spent in libc. Puma's breakdown of time on a
per-routine basis showed that a very large fraction of the CPU time in
libc was in string manipulation routines:

Samples   
In Only   
%Total    Routine
--------- -------
   7      (libc.2) _sendto_sys
   7      (named.O2) nlookup
   6      (libc.2) _select_sys
   6      (named.O2) dn_find
   6      (libc.2) _recvfrom_sys
   5      (libc.2) _doprnt
   5      (libc.2) gettimeofday
   4      (named.O2) __ns_name_pton
   3      (libc.2) $$remI
   3      (named.O2) __ns_name_pack
   3      (libc.2) strncasecmp
   3      (libc.2) tolower
   2      (libc.2) isascii(hpux_export stub)
   2      (libc.2) __tolower(shlib_import stub)
   2      (libc.2) __towlower_std(shlib_import stub)
   2      (libc.2) asctime_r
   2      (libc.2) malloc
   2      (libc.2) _ltostr
   1      (named.O2) doaddinfo
   1      (named.O2) mklower
   1      (named.O2) make_rr
   1      (libc.2) gmtime_r
   1      (named.O2) ns_debug
   1      (named.O2) __ns_name_ntop
   1      (libc.2) localtime_r
   1      (named.O2) __log_vwrite
   1      (libc.2) _strncasecmp(hpux_export stub)
   1      (libc.2) free
   1      (libc.2) strcpy
   1      (libc.2) getenv
   1      (named.O2) __evGetNext
   1      (named.O2) req_query
   1      (named.O2) __ns_name_compress
   1      (libc.2) _memcpy(hpux_export stub)
   1      (libc.2) $$divI
   1      (named.O2) match
   1      (libc.2) strlen
--------

While there were some routines with a fairly large CPU time, the CPU
time was rather spread-out. An hierarchical look at the profile was a
bit more enlightening:


Samples     Samples   
In or Under In Only   
%Total      %Total    Level Routine
----------- --------- ----- -------
 100           0         1  $START$
 100           0         2    _start(hpux_export stub)
 100           0         3      _start
 100           0         4        main(hpux_export stub)
 100           0         5          main
  90           0         6            __evDispatch
  90           0         7              datagram_read
  68           0         8                ns_req
  31           1         9                  req_query
  18           0        10                    add_data
  17           1        11                      make_rr
   9           0        12                        __dn_comp
   9           0        13                          __ns_name_compress
   6           2        14                            __ns_name_pack
   4           3        15                              dn_find
   1           1        16                                mklower
   2           2        14                            __ns_name_pton
   7           0        12                        addname
   5           0        13                          savestr
   5           0        14                            strdup(hpux_export stub)
   4           0        15                              _strdup
   3           2        16                                malloc
   1           1        17                                  _ltostr
   1           1        16                                strlen
   2           0        13                          strcasecmp(hpux_export stub)
   2           0        14                            _strcasecmp
   1           1        15                              __tolower(shlib_import stub)
   4           1        10                    nlookup
   3           0        11                      nlookup
   2           0        12                        nlookup
   1           0        13                          nlookup
   1           0        14                            _strncasecmp(hpux_export stub)
   1           0        15                              strncasecmp
   1           0        11                      _strncasecmp(hpux_export stub)
   1           0        12                        strncasecmp
   3           0        10                    finddata
   2           0        11                      make_rr
   2           0        12                        __dn_comp
   2           0        13                          __ns_name_compress
   1           0        14                            __ns_name_pack
   1           1        15                              dn_find
   1           1        14                            __ns_name_pton
   1           0        11                      wanted
   1           0        10                    findns
   2           0        10                    __dn_expand
   2           0        11                      __ns_name_uncompress
   2           1        12                        __ns_name_ntop
   4           0         9                  sin_ntoa
   3           0        10                    __inet_ntoa
   2           0        11                      __inet_ntop
   2           0        12                        inet_ntop4
   2           0        13                          _sprintf(hpux_export stub)
   2           0        14                            sprintf
   2           2        15                              _doprnt
   1           0        10                    _sprintf(hpux_export stub)
   1           0        11                      sprintf
   1           1        12                        _doprnt
  25           1         9                  doaddinfo
   2           0        10                    free(hpux_export stub)
   2           1        11                      free
   1           1        12                        _ltostr
   1           0        10                    ns_debug
  14           1        10                    nlookup
  10           1        11                      nlookup
   6           1        12                        nlookup
   1           0        13                          _strncasecmp(hpux_export stub)
   1           0        14                            strncasecmp
   4           1        13                          nlookup
   2           0        14                            _strncasecmp(hpux_export stub)
   2           1        15                              strncasecmp
   2           0        12                        _strncasecmp(hpux_export stub)
   2           1        13                          strncasecmp
   1           1        14                            tolower
   1           0        12                        isascii(hpux_export stub)
   2           0        11                      _strncasecmp(hpux_export stub)
   2           0        12                        strncasecmp
   1           1        13                          __tolower(shlib_import stub)
   1           1        11                      isascii(hpux_export stub)
   5           0        10                    make_rr
   4           0        11                      __dn_comp
   4           0        12                        __ns_name_compress
   1           1        13                          __ns_name_pton
   3           1        13                          __ns_name_pack
   2           2        14                            dn_find
   1           1        10                    match
   7           0         9                  _sendto(hpux_export stub)
   7           0        10                    sendto
   7           7        11                      _sendto_sys
   1           0         9                  nameserIncr
   1           0        10                    nameserFind
   1           0        11                      tree_srch
   6           0         8                _recvfrom(hpux_export stub)
   6           0         9                  _recvfrom
   6           6        10                    _recvfrom_sys
  11           0         8                ctimel
  11           0         9                  checked_ctime
  10           0        10                    _ctime(hpux_export stub)
  10           0        11                      _ctime
   8           1        12                        localtime_r
   4           1        13                          gmtime_r
   2           2        14                            $$remI
   2           0        13                          _tzset
   1           1        14                            asctime_r
   1           1        14                            getenv
   1           0        13                          _bsearch
   2           1        12                        asctime_r
   1           1        13                          $$remI
   3           0         8                sin_ntoa
   2           0         9                  __inet_ntoa
   2           0        10                    __inet_ntop
   2           0        11                      inet_ntop4
   2           0        12                        _sprintf(hpux_export stub)
   2           0        13                          sprintf
   1           1        14                            _doprnt
   1           0         9                  _sprintf(hpux_export stub)
   1           0        10                    sprintf
   1           1        11                      _doprnt
   2           0         8                gettime
   2           0         9                  _gettimeofday(hpux_export stub)
   2           2        10                    gettimeofday
  10           1         6            __evGetNext
   6           0         7              pselect
   6           0         8                _select(hpux_export stub)
   6           0         9                  select
   6           6        10                    _select_sys
   3           0         7              __evNowTime
   3           0         8                _gettimeofday(hpux_export stub)
   3           3         9                  gettimeofday
--------

    For instance, the profile showed that roughly 11% of the time was
spent in or under the routine ctimel() when called from within
datagram_read(). Here is the code fragment from datagram_read() in
src/bin/named/ns_main.c:

        gettime(&tt);
        ns_debug(ns_log_default, 1,
                 "datagram from %s, fd %d, len %d; now %s", sin_ntoa(from),
                 fd, n, ctimel(tt.tv_sec));

        if (n < HFIXEDSZ)
                return;

        /* Consult database to get the answer. */
        ns_req(buf, n, PACKETSZ, NULL, from, fd);

    The call to ctimel is coming from a parameter to the ns_debug()
routine. Looking at the other parameters one also sees a call to
sin_ntoa() which the hierarchical profile shows to be another 3% of
the CPU time.

    The routine ns_debug() is defined in src/bin/named/ns_glue.c. It
is a rather short routine:

/*
 * Logging Support
 */

void ns_debug(int category, int level, const char *format, ...) {
        va_list args;

        if (!log_ctx_valid)
                return;
        va_start(args, format);
        log_vwrite(log_ctx, category, log_debug(level), format, args);
        va_end(args);
}

which merely checks that some logging context is valid and then
prepares a call to log_vwrite, which is defined in
src/lib/isc/logging.c. Here is some of the code at the beginning of
the routine:

        debugging = (lcp->flags & LOG_OPTION_DEBUG);

        /*
         * If not debugging, short circuit debugging messages very early.
         */
        if (level > 0 && !debugging)
                return;

    The routine attempts to determine early-on if something really
needs to be logged. However, this check is placed at the bottom of the
call sequence which means that any computationally expensive arguments to
ns_debug will have already been evaluated - perhaps only to have the
work discarded, as it would seem to be in the default case. Layering
is a useful architectural concept which does not always perform well
in practice.

    This suggests the first tuning opportunity.


First Tune - Earlier Logging Checks:

[/home/bind2/DataSet2.puma]

    So, we arrive at the first potential performance improvement -
pushing the short-circuiting of logging earlier in the stack. In this
case, before the arguments datagram_read() passes to ns_debug() are
evaluated.

    I was not entirely certain as to the best way to go about
performing this task as some of the logging information checks are
kept rather opaque. So, I added a routine to src/bin/named/ns_glue.c
called ns_wouldlog():

int ns_wouldlog(int level) {

        if (log_ctx_valid)
            return(log_wouldlog(log_ctx,level));

        return(0);
}

which in turn calls log_wouldlog(), a routine added to
src/isc/logging.c:

int
log_wouldlog(log_context lc, int level) {

        log_context_p lcp;
        int           debugging = 0;

        lcp = lc.opaque;

        assert(lcp != NULL);

        debugging = (lcp->flags & LOG_OPTION_DEBUG);

        /*
         * If not debugging, short circuit debugging messages very early.
         */
        if (level > 0 && !debugging)
                return(0);
        return(1);
}

[ These routines look remarkably similar to the beginnings of ns_debug
  and log_vwrite respectively :) ]

    This call sequence returns a 1 if the logging would likely happen
and a zero if it would not. I would have preferred to simply re-define
ns_debug as a macro with all the necessary checks up front - saving
the procedure calls - but left that to those who were more familiar
with the code. The issue of making ns_debug a macro is complicated by
the use of varargs. The ns_debug() call in datagram_read() becomes:

        gettime(&tt);
        if (ns_wouldlog(1)) {
                ns_debug(ns_log_default, 1,
                         "datagram from %s, fd %d, len %d; now %s",
                         sin_ntoa(from),
                         fd, n, ctimel(tt.tv_sec));
        }


    This version of named was labeled "named.O2logck."

    Named.O2logck was able to serve approximately 2,215 requests per
second. This is roughly a 16.5% improvement in performance from the
addition of perhaps 20 lines of code which allow two rather expensive
routines (ctimel and sin_ntoa) to be avoided. Since this performance
improvement was not necessarily platform specific, similar results
should be possible on other platforms.

    After this tune, the user/kernel split was 72/28.

    At the library level, puma shows the time breakdown to have
become:

Samples   
In Only   
%Total    Image or Library
--------- ----------------
  57      /usr/lib/libc.2
  43      /home/bind2/named.O2logck
--------

    The updated profile of time spent per-routine becomes:

Samples   
In Only   
%Total    Routine
--------- -------
   8      (libc.2) _sendto_sys
   7      (named.O2logck) dn_find
   7      (named.O2logck) nlookup
   6      (libc.2) _select_sys
   6      (libc.2) _recvfrom_sys
   5      (libc.2) gettimeofday
   4      (named.O2logck) __ns_name_pton
   4      (libc.2) _ltostr
   3      (named.O2logck) __ns_name_pack
   3      (libc.2) tolower
   3      (libc.2) __tolower(shlib_import stub)
   3      (libc.2) isascii(hpux_export stub)
   3      (libc.2) _doprnt
   3      (libc.2) strncasecmp
   2      (libc.2) __towlower_std(shlib_import stub)
   2      (named.O2logck) doaddinfo
   2      (libc.2) malloc
   2      (named.O2logck) mklower
   2      (named.O2logck) make_rr
   2      (libc.2) free
   1      (named.O2logck) __log_vwrite
   1      (named.O2logck) ns_debug
   1      (named.O2logck) __evGetNext
   1      (libc.2) memcpy
   1      (libc.2) _memcpy(hpux_export stub)
   1      (libc.2) _strncasecmp(hpux_export stub)
   1      (libc.2) strcpy
   1      (named.O2logck) stale
   1      (named.O2logck) delete_stale
   1      (named.O2logck) match
   1      (named.O2logck) __ns_name_ntop
   1      (named.O2logck) req_query
   1      (named.O2logck) tree_srch
   1      (named.O2logck) findns
   1      (named.O2logck) __ns_name_compress
   1      (named.O2logck) isascii(hpux_import stub)
--------

    Isascii() and or getting to it is roughly four percent of the
total time. Tolower is roughly 8% of the time, and the
case-insensitive string comparison routines are roughly 4%. 

    Most contemporary operating systems have to deal with native
language support and multi-byte character sets. However, DNS is based
on 7-bit ASCII. This means that there is overhead in the libc string
routines for functionality that named does not need. It may be
possible to get rid of those calls to libc.

    The hierarchical profile shows some clues:


Samples     Samples     
In or Under In or Under 
%Total      %Total      Level Routine
----------- ----------- ----- -------
 100         100           1  $START$
 100         100           2    _start(hpux_export stub)
 100         100           3      _start
 100         100           4        main(hpux_export stub)
 100         100           5          main
  89          89           6            __evDispatch
  88          88           7              datagram_read
  80          80           8                ns_req
  30          30           9                  req_query
  18          18          10                    add_data
  17          17          11                      make_rr
   8           8          12                        addname
   6           6          13                          savestr
   5           5          14                            strdup(hpux_export stub)
   5           5          15                              _strdup
   3           3          16                                malloc
   1           1          17                                  _ltostr
   2           2          13                          strcasecmp(hpux_export stub)
   2           2          14                            _strcasecmp
   8           8          12                        __dn_comp
   8           8          13                          __ns_name_compress
   2           2          14                            __ns_name_pton
   6           6          14                            __ns_name_pack
   3           3          15                              dn_find
   1           1          16                                mklower
   1           1          15                              _memcpy(hpux_export stub)
   5           5          10                    nlookup
   1           1          11                      isascii(hpux_export stub)
   3           3          11                      nlookup
   2           2          12                        nlookup
   1           1          13                          nlookup
   1           1          14                            _strncasecmp(hpux_export stub)
   1           1          15                              strncasecmp
   1           1          11                      _strncasecmp(hpux_export stub)
   1           1          12                        strncasecmp
   2           2          10                    __dn_expand
   2           2          11                      __ns_name_uncompress
   1           1          12                        __ns_name_ntop
   1           1          12                        __ns_name_unpack
   3           3          10                    finddata
   2           2          11                      make_rr
   1           1          12                        __dn_comp
   1           1          13                          __ns_name_compress
   1           1          14                            __ns_name_pack
   1           1          15                              dn_find
   1           1          11                      wanted
   1           1          10                    findns
  36          36           9                  doaddinfo
  16          16          10                    nlookup
  10          10          11                      nlookup
   6           6          12                        nlookup
   3           3          13                          nlookup
   2           2          14                            _strncasecmp(hpux_export stub)
   2           2          15                              strncasecmp
   1           1          13                          _strncasecmp(hpux_export stub)
   1           1          14                            strncasecmp
   2           2          12                        _strncasecmp(hpux_export stub)
   2           2          13                          strncasecmp
   1           1          14                            __towlower_std(shlib_import stub)
   1           1          14                            __tolower(shlib_import stub)
   1           1          12                        isascii(hpux_export stub)
   3           3          11                      _strncasecmp(hpux_export stub)
   3           3          12                        strncasecmp
   1           1          13                          __tolower(shlib_import stub)
   1           1          13                          tolower
   1           1          13                          __towlower_std(shlib_import stub)
   1           1          11                      isascii(hpux_export stub)
  10          10          10                    make_rr
   9           9          11                      __dn_comp
   8           8          12                        __ns_name_compress
   6           6          13                          __ns_name_pack
   5           5          14                            dn_find
   1           1          15                              mklower
   2           2          13                          __ns_name_pton
   4           4          10                    free(hpux_export stub)
   4           4          11                      free
   2           2          12                        _ltostr
   1           1          10                    match
   1           1          10                    ns_debug
   1           1          11                      __log_vwrite
   1           1          10                    delete_stale
   4           4           9                  sin_ntoa
   1           1          10                    _sprintf(hpux_export stub)
   1           1          11                      sprintf
   1           1          12                        _doprnt
   3           3          10                    __inet_ntoa
   3           3          11                      __inet_ntop
   3           3          12                        inet_ntop4
   3           3          13                          _sprintf(hpux_export stub)
   3           3          14                            sprintf
   2           2          15                              _doprnt
   8           8           9                  _sendto(hpux_export stub)
   8           8          10                    sendto
   8           8          11                      _sendto_sys
   1           1           9                  nameserIncr
   1           1          10                    nameserFind
   1           1          11                      tree_srch
   6           6           8                _recvfrom(hpux_export stub)
   6           6           9                  _recvfrom
   6           6          10                    _recvfrom_sys
   2           2           8                gettime
   2           2           9                  _gettimeofday(hpux_export stub)
   2           2          10                    gettimeofday
  11          11           6            __evGetNext
   6           6           7              pselect
   6           6           8                _select(hpux_export stub)
   6           6           9                  select
   6           6          10                    _select_sys
   3           3           7              __evNowTime
   3           3           8                _gettimeofday(hpux_export stub)
   3           3           9                  gettimeofday
--------

    The profile shows that all the isascii() calls appear to be coming
from named code, not libc code. That is good because there is a
#define _SB_CTYPE_MACROS that one can use to have isascii replaced by
a a macro from /usr/include/ctype.h. This should save us the call
overhead of going into libc.

    While searching through the BIND source files for calls to
strncasecmp, I discovered the file src/lib/bsd/strcasecmp.c which has
replacements for strcasecmp and strncasecmp. These are compiled-in if
NEED_STRCASECMP is defined. Switching to the BIND-provided strcasecmp
and strncasecmp should save the call-overhead of going into libc and
may allow for some additional inlining by the compiler at higher
optimization levels. The switch should also eliminate all the calls to
tolower contained within the libc versions.

    Searching through the files, I discovered that the way to get
NEED_STRCASECMP defined was to use the "port_after.h" file. The
"port_before.h" file was used to get _SB_CTYPE_MACROS defined.

    This leads to the next version of named - named.O2logckascii


Tune Two - Reducing String/Character Manipulation Overhead:

    The combination of _SB_CTYPE_MACROS and NEED_STRCASECMP altered
the user/kernel split to roughly 67/33, with an increase in
performance to approximately 2600 transactions per second, or an
additional 17.4%. This was an increase rather greater than I initially
expected. It is possible that the use of the isascii macro and the
local strcasecmp/strncasecmp led to better cache locality. Total
increase thus far is roughly 36.8% over baseline.

    The library-level profile from puma shows a further reduction in
time spent in libc:

Samples   
In Only   
%Total    Image or Library
--------- ----------------
  52      /usr/lib/libc.2
  48      /home/bind2/named.O2logckascii
--------

    The less time spent in libc, the greater the value of higher
levels of compiler optimizations.

    The per-routine puma profile now looks like:

Samples   
In Only   
%Total    Routine
--------- -------
  14      (libc.2) _sendto_sys
   8      (named.O2logckascii) dn_find
   7      (libc.2) _recvfrom_sys
   7      (named.O2logckascii) nlookup
   7      (libc.2) _select_sys
   7      (libc.2) gettimeofday
   5      (named.O2logckascii) __ns_name_pton
   4      (libc.2) _doprnt
   3      (named.O2logckascii) __ns_name_pack
   3      (libc.2) _ltostr
   2      (named.O2logckascii) doaddinfo
   2      (libc.2) malloc
   2      (named.O2logckascii) mklower
   2      (named.O2logckascii) make_rr
   2      (named.O2logckascii) ns_debug
   2      (libc.2) free
   1      (named.O2logckascii) __log_vwrite
   1      (named.O2logckascii) req_query
   1      (libc.2) memcpy
   1      (libc.2) strcpy
   1      (named.O2logckascii) match
   1      (named.O2logckascii) stale
   1      (named.O2logckascii) __evGetNext
   1      (libc.2) _memcpy(hpux_export stub)
   1      (named.O2logckascii) __ns_name_ntop
   1      (named.O2logckascii) findns
   1      (named.O2logckascii) strncasecmp
   1      (named.O2logckascii) delete_stale
   1      (named.O2logckascii) __ns_name_compress
--------

    At this point, much of the string manipulation time is gone. There
is a little in strcpy. We can also see that the "$$" routines have
gone away - these were another four percent in the previous
profile. So, we have found at least part of the reason for the greater
than expected performance boost.

    Seven percent of the time in the profile is spent in either
_doprnt or _ltostr. It would be interesting to see where the calls
come from. Here is the hierarchical profile:

Samples     Samples     
In or Under In or Under 
%Total      %Total      Level Routine
----------- ----------- ----- -------
 100         100           1  $START$
 100         100           2    _start(hpux_export stub)
 100         100           3      _start
 100         100           4        main(hpux_export stub)
 100         100           5          main
  87          87           6            __evDispatch
  87          87           7              datagram_read
  76          76           8                ns_req
  30          30           9                  req_query
  18          18          10                    add_data
  17          17          11                      make_rr
   9           9          12                        __dn_comp
   9           9          13                          __ns_name_compress
   7           7          14                            __ns_name_pack
   4           4          15                              dn_find
   1           1          16                                mklower
   1           1          15                              _memcpy(hpux_export stub)
   2           2          14                            __ns_name_pton
   7           7          12                        addname
   6           6          13                          savestr
   6           6          14                            strdup(hpux_export stub)
   6           6          15                              _strdup
   3           3          16                                malloc
   1           1          17                                  _ltostr
   1           1          16                                strcpy
   2           2          10                    findns
   4           4          10                    finddata
   2           2          11                      make_rr
   2           2          12                        __dn_comp
   2           2          13                          __ns_name_compress
   1           1          14                            __ns_name_pack
   1           1          15                              dn_find
   1           1          11                      wanted
   2           2          10                    nlookup
   1           1          11                      nlookup
   1           1          12                        nlookup
   1           1          13                          nlookup
   1           1          10                    free_nsp
   2           2          10                    __dn_expand
   2           2          11                      __ns_name_uncompress
   1           1          12                        __ns_name_ntop
  14          14           9                  _sendto(hpux_export stub)
  14          14          10                    sendto
  14          14          11                      _sendto_sys
   5           5           9                  sin_ntoa
   1           1          10                    _sprintf(hpux_export stub)
   1           1          11                      sprintf
   1           1          12                        _doprnt
   4           4          10                    __inet_ntoa
   3           3          11                      __inet_ntop
   3           3          12                        inet_ntop4
   3           3          13                          _sprintf(hpux_export stub)
   3           3          14                            sprintf
   3           3          15                              _doprnt
   1           1           9                  nameserIncr
   1           1          10                    nameserFind
   1           1          11                      tree_srch
  26          26           9                  doaddinfo
   6           6          10                    nlookup
   3           3          11                      nlookup
   2           2          12                        nlookup
   1           1          13                          nlookup
  10          10          10                    make_rr
   8           8          11                      __dn_comp
   8           8          12                        __ns_name_compress
   6           6          13                          __ns_name_pack
   5           5          14                            dn_find
   1           1          15                              mklower
   2           2          13                          __ns_name_pton
   1           1          11                      _memcpy(hpux_export stub)
   4           4          10                    free(hpux_export stub)
   4           4          11                      free
   2           2          12                        _ltostr
   1           1          10                    delete_stale
   1           1          11                      stale
   1           1          10                    ns_debug
   1           1          11                      __log_vwrite
   1           1          10                    match
   8           8           8                _recvfrom(hpux_export stub)
   8           8           9                  _recvfrom
   7           7          10                    _recvfrom_sys
   2           2           8                gettime
   2           2           9                  _gettimeofday(hpux_export stub)
   2           2          10                    gettimeofday
  13          13           6            __evGetNext
   7           7           7              pselect
   7           7           8                _select(hpux_export stub)
   7           7           9                  select
   7           7          10                    _select_sys
   5           5           7              __evNowTime
   5           5           8                _gettimeofday(hpux_export stub)
   4           4           9                  gettimeofday
--------

    The time in _ltostr seems to come from malloc and free - however,
after speaking with some people familiar withthe HP libc it turns out
that the routine malloc and free call is not _ltostr, but some static
function in libc. Ltostr just happens to be the exported symbol
nearest to that static function. The time in _doprnt comes from
sprintf. Those do not seem to be easily removed at the moment.

    So, we arrive at our next tune - an increase in the compiler
optimization level.


Tune Three - +03 Optimization.

    The increase to +O3 optimization was done by editing
src/port/hpux11/Makefile.set file. The manpage for the HP-UX 11.00
unbundled cc has this to say about +O:

           +Olevel           Invoke optimizations selected by level.
                             Defined values for level are:

                                  0    Perform no optimizations.  This is
                                       the default.
                                  1    Perform optimizations within basic
                                       blocks only.
                                  2    Perform level 1 and global
                                       optimizations.  Same as -O.
                                  3    Perform level 2 as well as
                                       interprocedural global optimizations.
                                  4    Perform level 3 as well as doing link
                                       time optimizations.  Also sends
                                       +Oprocelim and +Ofastaccess to the
                                       linker (see ld(1)).

    With +O3 compilation, the user/kernel split becomes roughly 65/35
and the requests per second increases to approximately 2775. This is
another 6.3% and represents an increase of approximately 46%.

    The per-libary profile:

Samples   
In Only   
%Total    Image or Library
--------- ----------------
  54      /usr/lib/libc.2
  46      /home/bind2/named.O3logckascii
--------


    The per-routine profile:

Samples   
In Only   
%Total    Routine
--------- -------
  14      (libc.2) _sendto_sys
   9      (named.O3logckascii) __ns_name_pack
   8      (libc.2) _recvfrom_sys
   8      (libc.2) gettimeofday
   7      (named.O3logckascii) nlookup
   6      (libc.2) _select_sys
   6      (named.O3logckascii) __ns_name_compress
   4      (libc.2) _doprnt
   3      (libc.2) _ltostr
   2      (libc.2) malloc
   2      (named.O3logckascii) ns_debug
   2      (named.O3logckascii) doaddinfo
   2      (named.O3logckascii) req_query
   1      (named.O3logckascii) make_rr
   1      (libc.2) free
   1      (libc.2) memcpy
   1      (named.O3logckascii) __log_vwrite
   1      (libc.2) _memcpy(hpux_export stub)
   1      (named.O3logckascii) 27_1d20_cl_make_rr
   1      (named.O3logckascii) match
   1      (named.O3logckascii) nameserCompar
   1      (named.O3logckascii) tree_srch
   1      (libc.2) strcpy
   1      (named.O3logckascii) findns
   1      (named.O3logckascii) strncasecmp
   1      (named.O3logckascii) finddata
   1      (named.O3logckascii) stale
   1      (named.O3logckascii) __evGetNext
   1      (named.O3logckascii) __ns_name_ntop
   1      (libc.2) _strdup
   1      (libc.2) strlen
   1      (named.O3logckascii) nameserIncr
--------

    We can see that __ns_name_pack increased in percentage over the
previous profile. A look at the hierarchical profile shows why:

Samples     Samples     
In or Under In or Under 
%Total      %Total      Level Routine
----------- ----------- ----- -------
 100         100           1  $START$
 100         100           2    _start(hpux_export stub)
 100         100           3      _start
 100         100           4        main(hpux_export stub)
 100         100           5          main
  86          86           6            __evDispatch
  86          86           7              datagram_read
  75          75           8                ns_req
  28          28           9                  req_query
  16          16          10                    add_data
  15          15          11                      make_rr
   6           6          12                        savestr
   6           6          13                          strdup(hpux_export stub)
   5           5          14                            _strdup
   3           3          15                              malloc
   1           1          16                                _ltostr
   1           1          15                              strlen
   1           1          15                              strcpy
   7           7          12                        __dn_comp
   7           7          13                          __ns_name_compress
   4           4          14                            __ns_name_pack
   1           1          15                              _memcpy(hpux_export stub)
   1           1          16                                memcpy
   1           1          11                      getname
   3           3          10                    finddata
   2           2          11                      make_rr
   1           1          12                        __dn_comp
   1           1          13                          __ns_name_compress
   1           1          14                            __ns_name_pack
   1           1          10                    __dn_expand
   1           1          11                      __ns_name_uncompress
   1           1          12                        __ns_name_ntop
   1           1          10                    nameserIncr
   1           1          11                      tree_srch
   2           2          10                    nlookup
   1           1          11                      nlookup
   1           1          12                        nlookup
   1           1          13                          nlookup
   2           2          10                    findns
  25          25           9                  doaddinfo
   4           4          10                    free(hpux_export stub)
   4           4          11                      free
   2           2          12                        _ltostr
   6           6          10                    nlookup
   4           4          11                      nlookup
   2           2          12                        nlookup
   1           1          13                          nlookup
  10          10          10                    27_1d20_cl_make_rr
   8           8          11                      __dn_comp
   8           8          12                        __ns_name_compress
   5           5          13                          __ns_name_pack
   1           1          11                      _memcpy(hpux_export stub)
   1           1          11                      ns_debug
   1           1          10                    match
   1           1          10                    delete_stale
   1           1          10                    ns_debug
  14          14           9                  _sendto(hpux_export stub)
  14          14          10                    sendto
  14          14          11                      _sendto_sys
   5           5           9                  sin_ntoa
   4           4          10                    __inet_ntoa
   3           3          11                      __inet_ntop
   3           3          12                        _sprintf(hpux_export stub)
   3           3          13                          sprintf
   3           3          14                            _doprnt
   1           1          10                    _sprintf(hpux_export stub)
   1           1          11                      sprintf
   1           1          12                        _doprnt
   2           2           9                  nameserIncr
   2           2          10                    tree_srch
   1           1          11                      tree_srch
   8           8           8                _recvfrom(hpux_export stub)
   8           8           9                  _recvfrom
   8           8          10                    _recvfrom_sys
   3           3           8                gettime
   3           3           9                  _gettimeofday(hpux_export stub)
   3           3          10                    gettimeofday
  13          13           6            __evGetNext
   6           6           7              __evNowTime
   6           6           8                _gettimeofday(hpux_export stub)
   5           5           9                  gettimeofday
   6           6           7              _select(hpux_export stub)
   6           6           8                select
   6           6           9                  _select_sys
--------

dn_find and mklower have been pulled-up into __ns_name_pack - the
compiler has done some inlining of routines to reduce procedure call
overhead and to optimize away some code.

    At this point, time spent in gettimeofday() is becoming rather
large. Gettimeofday is being called from two placed in the
profile. One is from a call to gettime() in datagram_read(), and the
other from __evNowTime. Looking at datagram it appears that the "tt"
variable set is used only for the logging call, so it would seem that
putting the call to gettime(&tt) inside the new logging check would be
a worthwhile thing. The tt variable is actually a global defined in
ns_glob.h:

        /* often set to the current time */
DECL    struct timeval          tt;

    "Often" does not imply "required to always be the current time"
and a very brief examination of the call tree (from the profile)
beneath datagram_read seemed to indicate that moving the gettime(&tt)
call. However, I am not confident enough of this examination of the
code to suggest that tune without further advice from the experts.

    Increasing the level of optimization seemed to help before, so we
arrive at our next tune - going to +O4 compilation.


Tune Four - +O4 and Profile Based Optimization:

[ named.O4logckasciipbo ]

    The +O4 compilation flag will pretty much attempt to optimize the
whole program as a unit. It does this by deferring much of the code
generation and optimization to the link phase. This might not be a
good thing to do with a library as it might have undesired results
(greatly increase compile time, and requiring the "real" compiler be
present on target systems) for those including the library but using
+O4 themselves. So, to allow for a difference in the compilation of
programs versus libraries, a new flags variable was introduced -
CPFLAGS. CPFLAGS is included in CFLAGS for those programs compiled by
src/bin/Makefile, but not for those programs compiled by
src/lib/Makefile.

[ Being interested in reasonably quick results, the author did this in
  a way that results in the compiler warning about a higher level of
  optimization overriding a lower one. This is something that can be
  dealt-with in productization. ]

    Profile Based Optimization (PBO) is an example of feedback
directed compilation and linking. Building an application with PBO is
a two-step process. Step one is to compile the application with
profiling hooks inserted. This is accomplished with the +I compiler
directive. The application is then run under what is considered a
typical workload. Upon program termination a "flow.data" file is
created which is then given to the compiler for a recompilation of the
application with the +I replaced by +P.

    The stock named compilation did not appear to have a normal
"clean" exit path as required by PBO. So I piggy-backed on the SIGTERM
handler installed with XSTATS compilation. A new define was added to
the makefile "PBO_COMPILATION" and a check for this was added to
ns_main.c:

#if defined(XSTATS) || defined(BIND_UPDATE) || defined(PBO_COMPILATION)
static SIG_FN
onintr(int sig) {
        ns_need(MAIN_NEED_EXIT);
}
#endif
...
#if defined(XSTATS) || defined(BIND_UPDATE) || defined(HP_PROFILING)
        set_signal_handler(SIGTERM, onintr);
#endif

    As I was only interested in the performance of named, and the
compiler when given the +P directive will fail if there is no data for
a given program in the flow.data file, the SUBDIRS variable in
src/bin/Makefile was truncated to just named.

    With these tunes in place, performance increases to 2940 requests
per second. This is another 6% over the previous tune, and represents
a 55% increase over baseline.

    The hierarchical profile shows why:

Samples     Samples   
In or Under In Only   
%Total      %Total    Level Routine
----------- --------- ----- -------
 100           0         1  $START$
 100           0         2    _start(hpux_export stub)
 100           0         3      _start
 100           0         4        main(hpux_export stub)
 100           0         5          main
  80           0         6            __evDispatch
  80           1         7              datagram_read
  25          10         8                doaddinfo
   6           0         9                  __dn_comp
   6           2        10                    __ns_name_compress
   4           4        11                      __ns_name_pack
   6           0         9                  free(hpux_export stub)
   5           2        10                    free
   3           3        11                      _ltostr
   1           1         9                  ns_debug
   1           1        10                    fprintf(hpux_import stub)
   1           0         9                  _memcpy(hpux_export stub)
   1           1         9                  strncasecmp
  26           5         8                819_4336_$_HLO_ns_req.o##_req_query
   1           1         9                  971_4336_$_HLO_ns_resp.o##_96f_4336_cl_findns
   7           1         9                  strdup(hpux_export stub)
   7           1        10                    _strdup
   4           3        11                      malloc
   2           2        12                        _ltostr
   1           1        11                      _strlen(shlib_import stub)
   1           1        11                      strcpy
   7           0         9                  __dn_comp
   7           2        10                    __ns_name_compress
   5           4        11                      __ns_name_pack
   1           0        12                        _memcpy(hpux_export stub)
   1           0         9                  ns_debug
   1           0         9                  __dn_expand
   1           0        10                    __ns_name_uncompress
   1           1        11                      __ns_name_ntop
   2           0         9                  968_4336_$_HLO_ns_req.o##_966_4336_cl_make_rr
   2           0        10                    __dn_comp
   2           1        11                      __ns_name_compress
   1           1        12                        __ns_name_pack
   8           0         8                _recvfrom(hpux_export stub)
   8           0         9                  _recvfrom
   8           8        10                    _recvfrom_sys
   5           0         8                __inet_ntoa
   4           0         9                  __inet_ntop
   4           0        10                    _sprintf(hpux_export stub)
   4           0        11                      sprintf
   3           3        12                        _doprnt
  11           0         8                _sendto(hpux_export stub)
  10           0         9                  sendto
  10          10        10                    _sendto_sys
   1           0         8                tree_srch
   3           0         8                _gettimeofday(hpux_export stub)
   3           3         9                  gettimeofday
   1           0         8                _sprintf(hpux_export stub)
   1           0         9                  sprintf
   1           1        10                    _doprnt
  20           1         6            __evGetNext
   4           0         7              __evNowTime
   4           0         8                _gettimeofday(hpux_export stub)
   4           4         9                  gettimeofday
  14           0         7              _select(hpux_export stub)
  14           0         8                select
  14          14         9                  _select_sys
--------

it is possible to see that the call tree is now flatter than it was
before - that is the result of the inlining done by +O4. It is also
possible to see where the optimizer has left its mark - the "HLO"
entry points. It would be nice to get more from this. While shipping a
+O4 compiled library might not be a good thing to do since it would
require a full compiler on target systems, there is no reason why
full-bore optimization could not be done just to make named. Also,
there may be some additional tweaks to be made to the options. So, we
arrive at the next tune:


Tune Five - Full-bore optimization a la SPECint95

[ named.specint95 ]

    I have very little experience with optimization settings, but do
know of a place where experts have worked to squeeze every bit out of
a program they could. That would be SPEC's CPU benchmarks. Perhaps the
switches used to get the SPECint95 figures for a similar system would
be of benefit.

    I used the peak SPECint95 figures for the HP 9000 C240 as a
guide. Based on essentially a hunch, the settings used to compile the
126.gcc benchmark were selected, with some changes. First, the binary
was not compiled archive. This also meant that the +ESfic could not be
used (I discovered this the hard way...). However, it was possible to
use +ESlit to put literals in text space. 

    No source code changes were made, which meant that the existing
flow.data was still valid without re-training. Since PBO data was
available to say which way the branches went +Ostatic_prediction was
used - this tells the compiler to mark the binary for static branch
prediction. Since compilation time was no object, +Onolimit was used
to tell the compiler and linker to take all the time it needed to
optimize the binary.

    With these optimizations in place, the user/kernel split became
approximately 57/43 and the requests per second increased to
3400. This is another 15.6% incremental performance improvement, or
roughly 79% above baseline.

    At about this time, I became curious as to all the time still
being spend in sin_ntoa(), called from ns_req(). Turns-out that the
define DEBUG appears to be set, which causes some additional
ns_debug() calls to be included:

#ifdef DEBUG
        if (debug > 3) {
                ns_debug(ns_log_packet, 3, "ns_req(from %s)", sin_ntoa(from));
                fp_nquery(msg, msglen, log_get_stream(packet_channel));
        }
#endif

...

ifdef DEBUG
        ns_debug(ns_log_default, 1,
                 "ns_req: answer -> %s fd=%d id=%d size=%d",
                 sin_ntoa(from), (qsp == NULL) ? dfd : qsp->s_rfd,
                 ntohs(hp->id), cp - msg);
        if (debug >= 10)
                fp_nquery(msg, cp - msg, log_get_stream(packet_channel));
#endif /*DEBUG*/

this caught me by surprise, though given my unfamiliarity with the
source code perhaps it should not have been surprising. Comments in
the src/bin/named/resolv.h file implied that DEBUG was supposed to be
set, so rather than undefine it, I chose instead to use the
ns_wouldlog() tune. 

    It also became desireable to reduce the overahed of getting to and
from strdup. As was the case for strcasecmp, the BIND sources
contained a replacement strdup which could be inabled through the
port_after.h file.

Tune Six: Minimize overhead of DEBUG and strdup

[ named.specint95dbgstdp ]

    Since this was a sourcefile change, it was necessary to refresh
the flow.data file. After this change, the user/kernel split became
53/47. At this point, one begins to wonder about time spent in the
kernel, but that is best left to another paper :)

    The aggregate performance increased to approximately 3610 requests
per second. This represents an incremental increase of about 6.1
percent and an overall increase of 90%.

    Here is the heirarchical profile from puma:


Samples     Samples   
In or Under In Only   
%Total      %Total    Level Routine
----------- --------- ----- -------
 100           0         1  $START$
 100           0         2    _start(hpux_export stub)
 100           0         3      _start
 100           0         4        main(hpux_export stub)
 100           2         5          main
   8           0         6            _recvfrom(hpux_export stub)
   8           0         7              _recvfrom
   8           8         8                _recvfrom_sys
  13           0         6            _select(hpux_export stub)
  13           0         7              select
  13          13         8                _select_sys
  24          14         6            132_6353_$_HLO_ns_req.o##_req_query
   4           1         7              malloc(hpux_export stub)
   4           2         8                malloc
   2           2         9                  _ltostr
   1           1         7              _memcpy(hpux_export stub)
   1           1         8                memcpy
   1           1         7              57a_6353_$_HLO_ns_req.o##_578_6353_cl_make_rr
   1           0         7              _strcpy(hpux_export stub)
   1           1         8                strcpy
   1           0         7              _strlen(hpux_export stub)
   1           1         8                strlen
  12           0         6            _gettimeofday(hpux_export stub)
  12          12         7              gettimeofday
  17           0         6            _sendto(hpux_export stub)
  17           0         7              sendto
  17          17         8                _sendto_sys
  23          17         6            doaddinfo
   4           0         7              free(hpux_export stub)
   4           1         8                free
   3           3         9                  _ltostr
   1           1         7              ns_debug
   1           0         7              _memcpy(hpux_export stub)
--------

it shows that about 8% of the remaining time is spent in either malloc
or free, with the bulk of that in the _ltostr calls each makes. It
would seem that the call to free() in doaddinfo() is to free memory
allocated in the call to savestr() in in addname. These routines
appear to be manipulating entries in a statically allocated array of
addinfo structures not accessed outside of src/bin/named/ns_req.c:

static struct addinfo   addinfo[NADDRECS];

struct addinfo {
        char            *a_dname;               /* domain name */
        char            *a_rname;               /* referred by */
        u_int16_t       a_rtype;                /* referred by */
        u_int16_t       a_class;                /* class for address */
};

Given that there is a bounded-length to a domain name, it would seem
possible that the char *'s in the addinfo struct could be replaced
with fixed-length strings. This would indeed increase the memory usage
somewhat, but it could save as much as 8% of the time spent in named.

    I was unable to persue the option further at the time of this
experiment, but changes would need to be made in at least doaddinfo(),
free_addinfo(), addname() and savestr(). While it may be possible to
find a more efficient malloc() and/or free() call, it is unlikely to
be any more efficient than not making the call in the first place.

    Time spent in gettimeofday() is now 12% of the total which once
again begs the question of whether or not the call to gettime() in
datagram_read is really required.  There is a gettimeofday()
replacement in src/lib/bsd/gettimeofday.c, but it makes a call to
time() which is unlikely to be any faster.

    Select appears to be consuming 13% of the time. It would be
interesting to try and reduce the time spent in that system call -
perhaps by reducing the number of times it is called.

    The profile from higher levels of optimization becomes more
difficult to interpret as the compiler inlines and collapses levels
considerably compared with the baseline. The baseline puma profile
showed procedure call nesting getting 17 calls deep, and now it is
only 9 calls deep.

    While performance has increased, the time spent in libc has grown
again:

Samples     Samples   
In or Under In Only   
%Total      %Total    Image or Library
----------- --------- ----------------
 100          63      /usr/lib/libc.2
 100          36      /home/bind2/src/bin/named/named
--------

With the named code more highly optimized this is not terribly
surprising as libc has remained unchanged.

Tune Seven: Multiple recvfrom() calls in datagram_read()

    Assuming that the file descriptor was marked non-blocking, I
basically wrapped the innards of datagram_read in a for loop of fixed
length. The idea is to read up to a maximum number of requests on the
socket and then return to allow another socket to be serviced. An
EWOULDBLOCK or EAGAIN would result in the immediate return from
datgram_read. This should have the effect of reducing the number of
calls to select() when the named is under load.

    A reasonable initial value for the maximum number of times through
the loop seemed to be five. This should cut-down on select without
giving too much preference to one socket over another. After training
this optimization took the performance up to approximately 3760
requests per second, or an increase of roughly four percent. The total
increase over baseline becomes almost 98%. 

    We can see from the updated profile that this indeed reduced the
proportion of time spent in select():


About netperf:

    Netperf is a freely-distributed network performance
benchmark. Revision 2.1pl4 of netperf includes a rather simplistic DNS
performance benchmark which simply calls res_query repeatedly for the
same name - the name of my valiant, former workstation.

    Netperf revision 2.1 was designed for single, point-to-point tests.
While nothing precludes running several tests concurrently, only
"eye-ball" synchronization is available - the tester must watch the
output of the separate instances to make sure they stopped within a
reasonable time of one another. He also has to have a quick mouse and
fingers to make sure that all the tests start within a reasonable time
of one another.

    Netperf Revision 3.0.0 was designed to simplify aggregate
testing. It provides for the running of multiple tests either as
separate processes on most any Unix system, or as separate threads on
a POSIX-compliant system. In the future, netperf 3.0.0 will be ported
to the Win95/NT platform as was netperf 2.1.

    Additional information about netperf, as well as a database of
netperf results, and pointers to download sites can be found via the
web at http://www.cup.hp.com/netperf/NetperfPage.html.

Sizes of the various nameds:

   Tune      Name                      Size (bytes)
   Base    named.O2                      473615
   One     named.O2logck                 477719
   Two     named.O2logckascii            477726
   Three   named.O3logckascii            543037
   Four    named.O4logckasciipbo         526811
   Five    named.specint95               559682
   Six     named.specint95dbgstdp        555593

Annectdotal Results from Production Use:

    Some intermediate flavors of the tunes have been used in
production in the HP Internet. While the results will not be as clear
for lack of a controlled test environment, they encompass areas not
covered by the netperf benchmark.

    In one test on a two-CPU K410 running HP-UX 10.10, the
non-optimized, non-PBO version of named took 3 minutes, 20.58 seconds
to start-up and load 92 zone files containing a total of 1,276,513
resource records:

   # time ndc start
    Name Server Started
 
    real    3m20.58s
    user    3m12.51s
    sys     0m5.12s

The PBO'd named with +O2 libs and +O4 named was able to perform the
same work in 1 minute 50.57 seconds:

   # time ndc start
    Name Server Started
 
    real    1m50.57s
    user    1m42.24s
    sys     0m4.90s

or a reduction in start-up time of roughly 45%. This is good news for
two reasons - one it shows that the benefits of PBO and higher
optimization levels are broader than just the simple requests and
responses as measured by netperf. Secondly, it shows that these
benefits can be had on earlier versions of HP-UX and thus earlier
versions of the HP compilers.


Non-Performance Porting Changes:

    Initial porting of BIND 8.1.1 to HP-UX 11.00 for these tests did
involve a little work but was rather straightforward. First, a
src/port/hpux11 directory was created and populated with the files
from src/port/hpux. The probe script was edited to replace "B.10.*"
with "B.11.*." Further, Makefile.set was edited to include -lnsl in
the SYSLIBS. No other changes were required for the initial porting.

Acknowledgements:

    I would like thank all those who contributed to this paper and
thus made it possible. Among them are Lori Dutra, Denis Handly,
"Cousin" LaMont Jones, Michael Milligan, Chris Ruemmler and Suzanne
Woolf. I am certain I am forgetting folks and appologize for my
seive-like memory.