Copyright 1998, Hewlett-Packard Company A Brief Look at the Performance of Named 8.1.1 and Steps to Improve Same ftp://ftp.cup.hp.com/dist/networking/briefs/named_performace.txt Rick Jones Systems Networking Solutions Laboratory Hewlett-Packard Company Cupertino, CA Revision 0.9; April 9, 1998 Introduction: BIND 8.1.1 is the current release of software implementing the Domain Name Service of the Internet. Virtually the entire Internet is dependent on the robustness and performance of BIND. This paper takes an all too brief look at the performance of the named from BIND 8.1.1 running under the HP-UX 11.00 operating system. Potential bottlenecks are described, as well as potential solutions to those bottlenecks. Also discussed are the effects upon named performance of various optimizations made available in the HP C compiler. This paper will demonstrate that through a combination of compiler optimizations, small makefile and porting changes, and slight code modifications it is possible to increase the performance of named by as much as 98%. This is with no major algorithmic or structural changes to the BIND source. Test Environment: The performance measurements described in this paper were taken with the DNS_RR test of revision 3.0.0alpha (3/8/98) of the netperf benchmark. While this revision of netperf is not as complete, nor as pretty as revision 2.1, it was sufficiently robust and ready for this task. There is little value to a benchmark for which source code is unavailable. Details on obtaining copies of netperf sources and a discussion of the netperf DNS_RR test can be found later in this paper. Each measurement was of 15 minutes duration. This allowed the puma profiling tool to collect approximately 5000 samples and also helped ensure that any unforeseen periodic glitches in the environment were evened-out. Four streams of DNS requests were directed at the System Under Test (SUT), two from each of the client systems. Each stream of requests started at a random point in a list of 1000 names selected from the hostnames on the Hewlett-Packard site in Cupertino California, and kept cycling through the list until the end of the measurement. The systems generating the DNS requests were a pair of HP 9000 B160L workstations running HP-UX 10.20. The B160L contains a PA-7300 CPU running at 160 MHz. Each system contained 128 MB of RAM and an add-on PCI 100BaseT Network Interface Card (NIC). The System Under Test (SUT) was a uniprocessor HP 9000 Model K460. The K460 uses a PA-8000 processor with 1MB of instruction and 1MB of data cache and has a clock rate of 180 MHz. The HP 9000 Model C180 and the HP 9000 Model D380 (uniprocessor) would have similar characteristics The SUT also used a dual-port HSC 100BaseT NIC. Both ports on the card were used during testing. A caching-only nameserver was run on the SUT. The systems were joined by an eight-port AdvanceStack 800T switch and the two logical 100BaseT networks ran in full-duplex mode.. The profiles shown in this paper were taken with the "puma" performance tool which is bundled with the HP C compiler. Puma can take profiles of "normal" binaries without requiring special compilation. More information about the puma tool can be found via http://www.hp.com/lang. Baseline Performance: [ named.O2 ] "Out of the box" compilation under HP-UX uses -g and +O2. For the baseline figures the -g switch was removed from the compilation and the +O2 retained. Compiled this way, named was able to serve approximately 1,900 requests per second using all the CPU resources of the SUT. The paper will refer to this version of named as "named.O2." At the gross system level, 75% of the CPU time was spent in user-space; the remaining 25% of the time was spent in the kernel. Puma showed that of the user-space time, roughly 60% was spent in libc and 40 percent in named code: Samples In Only %Total Image or Library --------- ---------------- 62 /usr/lib/libc.2 38 /home/bind/named.O2 -------- From this it would seem that a fertile area of exploration would be reducing the time spent in libc. Puma's breakdown of time on a per-routine basis showed that a very large fraction of the CPU time in libc was in string manipulation routines: Samples In Only %Total Routine --------- ------- 7 (libc.2) _sendto_sys 7 (named.O2) nlookup 6 (libc.2) _select_sys 6 (named.O2) dn_find 6 (libc.2) _recvfrom_sys 5 (libc.2) _doprnt 5 (libc.2) gettimeofday 4 (named.O2) __ns_name_pton 3 (libc.2) $$remI 3 (named.O2) __ns_name_pack 3 (libc.2) strncasecmp 3 (libc.2) tolower 2 (libc.2) isascii(hpux_export stub) 2 (libc.2) __tolower(shlib_import stub) 2 (libc.2) __towlower_std(shlib_import stub) 2 (libc.2) asctime_r 2 (libc.2) malloc 2 (libc.2) _ltostr 1 (named.O2) doaddinfo 1 (named.O2) mklower 1 (named.O2) make_rr 1 (libc.2) gmtime_r 1 (named.O2) ns_debug 1 (named.O2) __ns_name_ntop 1 (libc.2) localtime_r 1 (named.O2) __log_vwrite 1 (libc.2) _strncasecmp(hpux_export stub) 1 (libc.2) free 1 (libc.2) strcpy 1 (libc.2) getenv 1 (named.O2) __evGetNext 1 (named.O2) req_query 1 (named.O2) __ns_name_compress 1 (libc.2) _memcpy(hpux_export stub) 1 (libc.2) $$divI 1 (named.O2) match 1 (libc.2) strlen -------- While there were some routines with a fairly large CPU time, the CPU time was rather spread-out. An hierarchical look at the profile was a bit more enlightening: Samples Samples In or Under In Only %Total %Total Level Routine ----------- --------- ----- ------- 100 0 1 $START$ 100 0 2 _start(hpux_export stub) 100 0 3 _start 100 0 4 main(hpux_export stub) 100 0 5 main 90 0 6 __evDispatch 90 0 7 datagram_read 68 0 8 ns_req 31 1 9 req_query 18 0 10 add_data 17 1 11 make_rr 9 0 12 __dn_comp 9 0 13 __ns_name_compress 6 2 14 __ns_name_pack 4 3 15 dn_find 1 1 16 mklower 2 2 14 __ns_name_pton 7 0 12 addname 5 0 13 savestr 5 0 14 strdup(hpux_export stub) 4 0 15 _strdup 3 2 16 malloc 1 1 17 _ltostr 1 1 16 strlen 2 0 13 strcasecmp(hpux_export stub) 2 0 14 _strcasecmp 1 1 15 __tolower(shlib_import stub) 4 1 10 nlookup 3 0 11 nlookup 2 0 12 nlookup 1 0 13 nlookup 1 0 14 _strncasecmp(hpux_export stub) 1 0 15 strncasecmp 1 0 11 _strncasecmp(hpux_export stub) 1 0 12 strncasecmp 3 0 10 finddata 2 0 11 make_rr 2 0 12 __dn_comp 2 0 13 __ns_name_compress 1 0 14 __ns_name_pack 1 1 15 dn_find 1 1 14 __ns_name_pton 1 0 11 wanted 1 0 10 findns 2 0 10 __dn_expand 2 0 11 __ns_name_uncompress 2 1 12 __ns_name_ntop 4 0 9 sin_ntoa 3 0 10 __inet_ntoa 2 0 11 __inet_ntop 2 0 12 inet_ntop4 2 0 13 _sprintf(hpux_export stub) 2 0 14 sprintf 2 2 15 _doprnt 1 0 10 _sprintf(hpux_export stub) 1 0 11 sprintf 1 1 12 _doprnt 25 1 9 doaddinfo 2 0 10 free(hpux_export stub) 2 1 11 free 1 1 12 _ltostr 1 0 10 ns_debug 14 1 10 nlookup 10 1 11 nlookup 6 1 12 nlookup 1 0 13 _strncasecmp(hpux_export stub) 1 0 14 strncasecmp 4 1 13 nlookup 2 0 14 _strncasecmp(hpux_export stub) 2 1 15 strncasecmp 2 0 12 _strncasecmp(hpux_export stub) 2 1 13 strncasecmp 1 1 14 tolower 1 0 12 isascii(hpux_export stub) 2 0 11 _strncasecmp(hpux_export stub) 2 0 12 strncasecmp 1 1 13 __tolower(shlib_import stub) 1 1 11 isascii(hpux_export stub) 5 0 10 make_rr 4 0 11 __dn_comp 4 0 12 __ns_name_compress 1 1 13 __ns_name_pton 3 1 13 __ns_name_pack 2 2 14 dn_find 1 1 10 match 7 0 9 _sendto(hpux_export stub) 7 0 10 sendto 7 7 11 _sendto_sys 1 0 9 nameserIncr 1 0 10 nameserFind 1 0 11 tree_srch 6 0 8 _recvfrom(hpux_export stub) 6 0 9 _recvfrom 6 6 10 _recvfrom_sys 11 0 8 ctimel 11 0 9 checked_ctime 10 0 10 _ctime(hpux_export stub) 10 0 11 _ctime 8 1 12 localtime_r 4 1 13 gmtime_r 2 2 14 $$remI 2 0 13 _tzset 1 1 14 asctime_r 1 1 14 getenv 1 0 13 _bsearch 2 1 12 asctime_r 1 1 13 $$remI 3 0 8 sin_ntoa 2 0 9 __inet_ntoa 2 0 10 __inet_ntop 2 0 11 inet_ntop4 2 0 12 _sprintf(hpux_export stub) 2 0 13 sprintf 1 1 14 _doprnt 1 0 9 _sprintf(hpux_export stub) 1 0 10 sprintf 1 1 11 _doprnt 2 0 8 gettime 2 0 9 _gettimeofday(hpux_export stub) 2 2 10 gettimeofday 10 1 6 __evGetNext 6 0 7 pselect 6 0 8 _select(hpux_export stub) 6 0 9 select 6 6 10 _select_sys 3 0 7 __evNowTime 3 0 8 _gettimeofday(hpux_export stub) 3 3 9 gettimeofday -------- For instance, the profile showed that roughly 11% of the time was spent in or under the routine ctimel() when called from within datagram_read(). Here is the code fragment from datagram_read() in src/bin/named/ns_main.c: gettime(&tt); ns_debug(ns_log_default, 1, "datagram from %s, fd %d, len %d; now %s", sin_ntoa(from), fd, n, ctimel(tt.tv_sec)); if (n < HFIXEDSZ) return; /* Consult database to get the answer. */ ns_req(buf, n, PACKETSZ, NULL, from, fd); The call to ctimel is coming from a parameter to the ns_debug() routine. Looking at the other parameters one also sees a call to sin_ntoa() which the hierarchical profile shows to be another 3% of the CPU time. The routine ns_debug() is defined in src/bin/named/ns_glue.c. It is a rather short routine: /* * Logging Support */ void ns_debug(int category, int level, const char *format, ...) { va_list args; if (!log_ctx_valid) return; va_start(args, format); log_vwrite(log_ctx, category, log_debug(level), format, args); va_end(args); } which merely checks that some logging context is valid and then prepares a call to log_vwrite, which is defined in src/lib/isc/logging.c. Here is some of the code at the beginning of the routine: debugging = (lcp->flags & LOG_OPTION_DEBUG); /* * If not debugging, short circuit debugging messages very early. */ if (level > 0 && !debugging) return; The routine attempts to determine early-on if something really needs to be logged. However, this check is placed at the bottom of the call sequence which means that any computationally expensive arguments to ns_debug will have already been evaluated - perhaps only to have the work discarded, as it would seem to be in the default case. Layering is a useful architectural concept which does not always perform well in practice. This suggests the first tuning opportunity. First Tune - Earlier Logging Checks: [/home/bind2/DataSet2.puma] So, we arrive at the first potential performance improvement - pushing the short-circuiting of logging earlier in the stack. In this case, before the arguments datagram_read() passes to ns_debug() are evaluated. I was not entirely certain as to the best way to go about performing this task as some of the logging information checks are kept rather opaque. So, I added a routine to src/bin/named/ns_glue.c called ns_wouldlog(): int ns_wouldlog(int level) { if (log_ctx_valid) return(log_wouldlog(log_ctx,level)); return(0); } which in turn calls log_wouldlog(), a routine added to src/isc/logging.c: int log_wouldlog(log_context lc, int level) { log_context_p lcp; int debugging = 0; lcp = lc.opaque; assert(lcp != NULL); debugging = (lcp->flags & LOG_OPTION_DEBUG); /* * If not debugging, short circuit debugging messages very early. */ if (level > 0 && !debugging) return(0); return(1); } [ These routines look remarkably similar to the beginnings of ns_debug and log_vwrite respectively :) ] This call sequence returns a 1 if the logging would likely happen and a zero if it would not. I would have preferred to simply re-define ns_debug as a macro with all the necessary checks up front - saving the procedure calls - but left that to those who were more familiar with the code. The issue of making ns_debug a macro is complicated by the use of varargs. The ns_debug() call in datagram_read() becomes: gettime(&tt); if (ns_wouldlog(1)) { ns_debug(ns_log_default, 1, "datagram from %s, fd %d, len %d; now %s", sin_ntoa(from), fd, n, ctimel(tt.tv_sec)); } This version of named was labeled "named.O2logck." Named.O2logck was able to serve approximately 2,215 requests per second. This is roughly a 16.5% improvement in performance from the addition of perhaps 20 lines of code which allow two rather expensive routines (ctimel and sin_ntoa) to be avoided. Since this performance improvement was not necessarily platform specific, similar results should be possible on other platforms. After this tune, the user/kernel split was 72/28. At the library level, puma shows the time breakdown to have become: Samples In Only %Total Image or Library --------- ---------------- 57 /usr/lib/libc.2 43 /home/bind2/named.O2logck -------- The updated profile of time spent per-routine becomes: Samples In Only %Total Routine --------- ------- 8 (libc.2) _sendto_sys 7 (named.O2logck) dn_find 7 (named.O2logck) nlookup 6 (libc.2) _select_sys 6 (libc.2) _recvfrom_sys 5 (libc.2) gettimeofday 4 (named.O2logck) __ns_name_pton 4 (libc.2) _ltostr 3 (named.O2logck) __ns_name_pack 3 (libc.2) tolower 3 (libc.2) __tolower(shlib_import stub) 3 (libc.2) isascii(hpux_export stub) 3 (libc.2) _doprnt 3 (libc.2) strncasecmp 2 (libc.2) __towlower_std(shlib_import stub) 2 (named.O2logck) doaddinfo 2 (libc.2) malloc 2 (named.O2logck) mklower 2 (named.O2logck) make_rr 2 (libc.2) free 1 (named.O2logck) __log_vwrite 1 (named.O2logck) ns_debug 1 (named.O2logck) __evGetNext 1 (libc.2) memcpy 1 (libc.2) _memcpy(hpux_export stub) 1 (libc.2) _strncasecmp(hpux_export stub) 1 (libc.2) strcpy 1 (named.O2logck) stale 1 (named.O2logck) delete_stale 1 (named.O2logck) match 1 (named.O2logck) __ns_name_ntop 1 (named.O2logck) req_query 1 (named.O2logck) tree_srch 1 (named.O2logck) findns 1 (named.O2logck) __ns_name_compress 1 (named.O2logck) isascii(hpux_import stub) -------- Isascii() and or getting to it is roughly four percent of the total time. Tolower is roughly 8% of the time, and the case-insensitive string comparison routines are roughly 4%. Most contemporary operating systems have to deal with native language support and multi-byte character sets. However, DNS is based on 7-bit ASCII. This means that there is overhead in the libc string routines for functionality that named does not need. It may be possible to get rid of those calls to libc. The hierarchical profile shows some clues: Samples Samples In or Under In or Under %Total %Total Level Routine ----------- ----------- ----- ------- 100 100 1 $START$ 100 100 2 _start(hpux_export stub) 100 100 3 _start 100 100 4 main(hpux_export stub) 100 100 5 main 89 89 6 __evDispatch 88 88 7 datagram_read 80 80 8 ns_req 30 30 9 req_query 18 18 10 add_data 17 17 11 make_rr 8 8 12 addname 6 6 13 savestr 5 5 14 strdup(hpux_export stub) 5 5 15 _strdup 3 3 16 malloc 1 1 17 _ltostr 2 2 13 strcasecmp(hpux_export stub) 2 2 14 _strcasecmp 8 8 12 __dn_comp 8 8 13 __ns_name_compress 2 2 14 __ns_name_pton 6 6 14 __ns_name_pack 3 3 15 dn_find 1 1 16 mklower 1 1 15 _memcpy(hpux_export stub) 5 5 10 nlookup 1 1 11 isascii(hpux_export stub) 3 3 11 nlookup 2 2 12 nlookup 1 1 13 nlookup 1 1 14 _strncasecmp(hpux_export stub) 1 1 15 strncasecmp 1 1 11 _strncasecmp(hpux_export stub) 1 1 12 strncasecmp 2 2 10 __dn_expand 2 2 11 __ns_name_uncompress 1 1 12 __ns_name_ntop 1 1 12 __ns_name_unpack 3 3 10 finddata 2 2 11 make_rr 1 1 12 __dn_comp 1 1 13 __ns_name_compress 1 1 14 __ns_name_pack 1 1 15 dn_find 1 1 11 wanted 1 1 10 findns 36 36 9 doaddinfo 16 16 10 nlookup 10 10 11 nlookup 6 6 12 nlookup 3 3 13 nlookup 2 2 14 _strncasecmp(hpux_export stub) 2 2 15 strncasecmp 1 1 13 _strncasecmp(hpux_export stub) 1 1 14 strncasecmp 2 2 12 _strncasecmp(hpux_export stub) 2 2 13 strncasecmp 1 1 14 __towlower_std(shlib_import stub) 1 1 14 __tolower(shlib_import stub) 1 1 12 isascii(hpux_export stub) 3 3 11 _strncasecmp(hpux_export stub) 3 3 12 strncasecmp 1 1 13 __tolower(shlib_import stub) 1 1 13 tolower 1 1 13 __towlower_std(shlib_import stub) 1 1 11 isascii(hpux_export stub) 10 10 10 make_rr 9 9 11 __dn_comp 8 8 12 __ns_name_compress 6 6 13 __ns_name_pack 5 5 14 dn_find 1 1 15 mklower 2 2 13 __ns_name_pton 4 4 10 free(hpux_export stub) 4 4 11 free 2 2 12 _ltostr 1 1 10 match 1 1 10 ns_debug 1 1 11 __log_vwrite 1 1 10 delete_stale 4 4 9 sin_ntoa 1 1 10 _sprintf(hpux_export stub) 1 1 11 sprintf 1 1 12 _doprnt 3 3 10 __inet_ntoa 3 3 11 __inet_ntop 3 3 12 inet_ntop4 3 3 13 _sprintf(hpux_export stub) 3 3 14 sprintf 2 2 15 _doprnt 8 8 9 _sendto(hpux_export stub) 8 8 10 sendto 8 8 11 _sendto_sys 1 1 9 nameserIncr 1 1 10 nameserFind 1 1 11 tree_srch 6 6 8 _recvfrom(hpux_export stub) 6 6 9 _recvfrom 6 6 10 _recvfrom_sys 2 2 8 gettime 2 2 9 _gettimeofday(hpux_export stub) 2 2 10 gettimeofday 11 11 6 __evGetNext 6 6 7 pselect 6 6 8 _select(hpux_export stub) 6 6 9 select 6 6 10 _select_sys 3 3 7 __evNowTime 3 3 8 _gettimeofday(hpux_export stub) 3 3 9 gettimeofday -------- The profile shows that all the isascii() calls appear to be coming from named code, not libc code. That is good because there is a #define _SB_CTYPE_MACROS that one can use to have isascii replaced by a a macro from /usr/include/ctype.h. This should save us the call overhead of going into libc. While searching through the BIND source files for calls to strncasecmp, I discovered the file src/lib/bsd/strcasecmp.c which has replacements for strcasecmp and strncasecmp. These are compiled-in if NEED_STRCASECMP is defined. Switching to the BIND-provided strcasecmp and strncasecmp should save the call-overhead of going into libc and may allow for some additional inlining by the compiler at higher optimization levels. The switch should also eliminate all the calls to tolower contained within the libc versions. Searching through the files, I discovered that the way to get NEED_STRCASECMP defined was to use the "port_after.h" file. The "port_before.h" file was used to get _SB_CTYPE_MACROS defined. This leads to the next version of named - named.O2logckascii Tune Two - Reducing String/Character Manipulation Overhead: The combination of _SB_CTYPE_MACROS and NEED_STRCASECMP altered the user/kernel split to roughly 67/33, with an increase in performance to approximately 2600 transactions per second, or an additional 17.4%. This was an increase rather greater than I initially expected. It is possible that the use of the isascii macro and the local strcasecmp/strncasecmp led to better cache locality. Total increase thus far is roughly 36.8% over baseline. The library-level profile from puma shows a further reduction in time spent in libc: Samples In Only %Total Image or Library --------- ---------------- 52 /usr/lib/libc.2 48 /home/bind2/named.O2logckascii -------- The less time spent in libc, the greater the value of higher levels of compiler optimizations. The per-routine puma profile now looks like: Samples In Only %Total Routine --------- ------- 14 (libc.2) _sendto_sys 8 (named.O2logckascii) dn_find 7 (libc.2) _recvfrom_sys 7 (named.O2logckascii) nlookup 7 (libc.2) _select_sys 7 (libc.2) gettimeofday 5 (named.O2logckascii) __ns_name_pton 4 (libc.2) _doprnt 3 (named.O2logckascii) __ns_name_pack 3 (libc.2) _ltostr 2 (named.O2logckascii) doaddinfo 2 (libc.2) malloc 2 (named.O2logckascii) mklower 2 (named.O2logckascii) make_rr 2 (named.O2logckascii) ns_debug 2 (libc.2) free 1 (named.O2logckascii) __log_vwrite 1 (named.O2logckascii) req_query 1 (libc.2) memcpy 1 (libc.2) strcpy 1 (named.O2logckascii) match 1 (named.O2logckascii) stale 1 (named.O2logckascii) __evGetNext 1 (libc.2) _memcpy(hpux_export stub) 1 (named.O2logckascii) __ns_name_ntop 1 (named.O2logckascii) findns 1 (named.O2logckascii) strncasecmp 1 (named.O2logckascii) delete_stale 1 (named.O2logckascii) __ns_name_compress -------- At this point, much of the string manipulation time is gone. There is a little in strcpy. We can also see that the "$$" routines have gone away - these were another four percent in the previous profile. So, we have found at least part of the reason for the greater than expected performance boost. Seven percent of the time in the profile is spent in either _doprnt or _ltostr. It would be interesting to see where the calls come from. Here is the hierarchical profile: Samples Samples In or Under In or Under %Total %Total Level Routine ----------- ----------- ----- ------- 100 100 1 $START$ 100 100 2 _start(hpux_export stub) 100 100 3 _start 100 100 4 main(hpux_export stub) 100 100 5 main 87 87 6 __evDispatch 87 87 7 datagram_read 76 76 8 ns_req 30 30 9 req_query 18 18 10 add_data 17 17 11 make_rr 9 9 12 __dn_comp 9 9 13 __ns_name_compress 7 7 14 __ns_name_pack 4 4 15 dn_find 1 1 16 mklower 1 1 15 _memcpy(hpux_export stub) 2 2 14 __ns_name_pton 7 7 12 addname 6 6 13 savestr 6 6 14 strdup(hpux_export stub) 6 6 15 _strdup 3 3 16 malloc 1 1 17 _ltostr 1 1 16 strcpy 2 2 10 findns 4 4 10 finddata 2 2 11 make_rr 2 2 12 __dn_comp 2 2 13 __ns_name_compress 1 1 14 __ns_name_pack 1 1 15 dn_find 1 1 11 wanted 2 2 10 nlookup 1 1 11 nlookup 1 1 12 nlookup 1 1 13 nlookup 1 1 10 free_nsp 2 2 10 __dn_expand 2 2 11 __ns_name_uncompress 1 1 12 __ns_name_ntop 14 14 9 _sendto(hpux_export stub) 14 14 10 sendto 14 14 11 _sendto_sys 5 5 9 sin_ntoa 1 1 10 _sprintf(hpux_export stub) 1 1 11 sprintf 1 1 12 _doprnt 4 4 10 __inet_ntoa 3 3 11 __inet_ntop 3 3 12 inet_ntop4 3 3 13 _sprintf(hpux_export stub) 3 3 14 sprintf 3 3 15 _doprnt 1 1 9 nameserIncr 1 1 10 nameserFind 1 1 11 tree_srch 26 26 9 doaddinfo 6 6 10 nlookup 3 3 11 nlookup 2 2 12 nlookup 1 1 13 nlookup 10 10 10 make_rr 8 8 11 __dn_comp 8 8 12 __ns_name_compress 6 6 13 __ns_name_pack 5 5 14 dn_find 1 1 15 mklower 2 2 13 __ns_name_pton 1 1 11 _memcpy(hpux_export stub) 4 4 10 free(hpux_export stub) 4 4 11 free 2 2 12 _ltostr 1 1 10 delete_stale 1 1 11 stale 1 1 10 ns_debug 1 1 11 __log_vwrite 1 1 10 match 8 8 8 _recvfrom(hpux_export stub) 8 8 9 _recvfrom 7 7 10 _recvfrom_sys 2 2 8 gettime 2 2 9 _gettimeofday(hpux_export stub) 2 2 10 gettimeofday 13 13 6 __evGetNext 7 7 7 pselect 7 7 8 _select(hpux_export stub) 7 7 9 select 7 7 10 _select_sys 5 5 7 __evNowTime 5 5 8 _gettimeofday(hpux_export stub) 4 4 9 gettimeofday -------- The time in _ltostr seems to come from malloc and free - however, after speaking with some people familiar withthe HP libc it turns out that the routine malloc and free call is not _ltostr, but some static function in libc. Ltostr just happens to be the exported symbol nearest to that static function. The time in _doprnt comes from sprintf. Those do not seem to be easily removed at the moment. So, we arrive at our next tune - an increase in the compiler optimization level. Tune Three - +03 Optimization. The increase to +O3 optimization was done by editing src/port/hpux11/Makefile.set file. The manpage for the HP-UX 11.00 unbundled cc has this to say about +O: +Olevel Invoke optimizations selected by level. Defined values for level are: 0 Perform no optimizations. This is the default. 1 Perform optimizations within basic blocks only. 2 Perform level 1 and global optimizations. Same as -O. 3 Perform level 2 as well as interprocedural global optimizations. 4 Perform level 3 as well as doing link time optimizations. Also sends +Oprocelim and +Ofastaccess to the linker (see ld(1)). With +O3 compilation, the user/kernel split becomes roughly 65/35 and the requests per second increases to approximately 2775. This is another 6.3% and represents an increase of approximately 46%. The per-libary profile: Samples In Only %Total Image or Library --------- ---------------- 54 /usr/lib/libc.2 46 /home/bind2/named.O3logckascii -------- The per-routine profile: Samples In Only %Total Routine --------- ------- 14 (libc.2) _sendto_sys 9 (named.O3logckascii) __ns_name_pack 8 (libc.2) _recvfrom_sys 8 (libc.2) gettimeofday 7 (named.O3logckascii) nlookup 6 (libc.2) _select_sys 6 (named.O3logckascii) __ns_name_compress 4 (libc.2) _doprnt 3 (libc.2) _ltostr 2 (libc.2) malloc 2 (named.O3logckascii) ns_debug 2 (named.O3logckascii) doaddinfo 2 (named.O3logckascii) req_query 1 (named.O3logckascii) make_rr 1 (libc.2) free 1 (libc.2) memcpy 1 (named.O3logckascii) __log_vwrite 1 (libc.2) _memcpy(hpux_export stub) 1 (named.O3logckascii) 27_1d20_cl_make_rr 1 (named.O3logckascii) match 1 (named.O3logckascii) nameserCompar 1 (named.O3logckascii) tree_srch 1 (libc.2) strcpy 1 (named.O3logckascii) findns 1 (named.O3logckascii) strncasecmp 1 (named.O3logckascii) finddata 1 (named.O3logckascii) stale 1 (named.O3logckascii) __evGetNext 1 (named.O3logckascii) __ns_name_ntop 1 (libc.2) _strdup 1 (libc.2) strlen 1 (named.O3logckascii) nameserIncr -------- We can see that __ns_name_pack increased in percentage over the previous profile. A look at the hierarchical profile shows why: Samples Samples In or Under In or Under %Total %Total Level Routine ----------- ----------- ----- ------- 100 100 1 $START$ 100 100 2 _start(hpux_export stub) 100 100 3 _start 100 100 4 main(hpux_export stub) 100 100 5 main 86 86 6 __evDispatch 86 86 7 datagram_read 75 75 8 ns_req 28 28 9 req_query 16 16 10 add_data 15 15 11 make_rr 6 6 12 savestr 6 6 13 strdup(hpux_export stub) 5 5 14 _strdup 3 3 15 malloc 1 1 16 _ltostr 1 1 15 strlen 1 1 15 strcpy 7 7 12 __dn_comp 7 7 13 __ns_name_compress 4 4 14 __ns_name_pack 1 1 15 _memcpy(hpux_export stub) 1 1 16 memcpy 1 1 11 getname 3 3 10 finddata 2 2 11 make_rr 1 1 12 __dn_comp 1 1 13 __ns_name_compress 1 1 14 __ns_name_pack 1 1 10 __dn_expand 1 1 11 __ns_name_uncompress 1 1 12 __ns_name_ntop 1 1 10 nameserIncr 1 1 11 tree_srch 2 2 10 nlookup 1 1 11 nlookup 1 1 12 nlookup 1 1 13 nlookup 2 2 10 findns 25 25 9 doaddinfo 4 4 10 free(hpux_export stub) 4 4 11 free 2 2 12 _ltostr 6 6 10 nlookup 4 4 11 nlookup 2 2 12 nlookup 1 1 13 nlookup 10 10 10 27_1d20_cl_make_rr 8 8 11 __dn_comp 8 8 12 __ns_name_compress 5 5 13 __ns_name_pack 1 1 11 _memcpy(hpux_export stub) 1 1 11 ns_debug 1 1 10 match 1 1 10 delete_stale 1 1 10 ns_debug 14 14 9 _sendto(hpux_export stub) 14 14 10 sendto 14 14 11 _sendto_sys 5 5 9 sin_ntoa 4 4 10 __inet_ntoa 3 3 11 __inet_ntop 3 3 12 _sprintf(hpux_export stub) 3 3 13 sprintf 3 3 14 _doprnt 1 1 10 _sprintf(hpux_export stub) 1 1 11 sprintf 1 1 12 _doprnt 2 2 9 nameserIncr 2 2 10 tree_srch 1 1 11 tree_srch 8 8 8 _recvfrom(hpux_export stub) 8 8 9 _recvfrom 8 8 10 _recvfrom_sys 3 3 8 gettime 3 3 9 _gettimeofday(hpux_export stub) 3 3 10 gettimeofday 13 13 6 __evGetNext 6 6 7 __evNowTime 6 6 8 _gettimeofday(hpux_export stub) 5 5 9 gettimeofday 6 6 7 _select(hpux_export stub) 6 6 8 select 6 6 9 _select_sys -------- dn_find and mklower have been pulled-up into __ns_name_pack - the compiler has done some inlining of routines to reduce procedure call overhead and to optimize away some code. At this point, time spent in gettimeofday() is becoming rather large. Gettimeofday is being called from two placed in the profile. One is from a call to gettime() in datagram_read(), and the other from __evNowTime. Looking at datagram it appears that the "tt" variable set is used only for the logging call, so it would seem that putting the call to gettime(&tt) inside the new logging check would be a worthwhile thing. The tt variable is actually a global defined in ns_glob.h: /* often set to the current time */ DECL struct timeval tt; "Often" does not imply "required to always be the current time" and a very brief examination of the call tree (from the profile) beneath datagram_read seemed to indicate that moving the gettime(&tt) call. However, I am not confident enough of this examination of the code to suggest that tune without further advice from the experts. Increasing the level of optimization seemed to help before, so we arrive at our next tune - going to +O4 compilation. Tune Four - +O4 and Profile Based Optimization: [ named.O4logckasciipbo ] The +O4 compilation flag will pretty much attempt to optimize the whole program as a unit. It does this by deferring much of the code generation and optimization to the link phase. This might not be a good thing to do with a library as it might have undesired results (greatly increase compile time, and requiring the "real" compiler be present on target systems) for those including the library but using +O4 themselves. So, to allow for a difference in the compilation of programs versus libraries, a new flags variable was introduced - CPFLAGS. CPFLAGS is included in CFLAGS for those programs compiled by src/bin/Makefile, but not for those programs compiled by src/lib/Makefile. [ Being interested in reasonably quick results, the author did this in a way that results in the compiler warning about a higher level of optimization overriding a lower one. This is something that can be dealt-with in productization. ] Profile Based Optimization (PBO) is an example of feedback directed compilation and linking. Building an application with PBO is a two-step process. Step one is to compile the application with profiling hooks inserted. This is accomplished with the +I compiler directive. The application is then run under what is considered a typical workload. Upon program termination a "flow.data" file is created which is then given to the compiler for a recompilation of the application with the +I replaced by +P. The stock named compilation did not appear to have a normal "clean" exit path as required by PBO. So I piggy-backed on the SIGTERM handler installed with XSTATS compilation. A new define was added to the makefile "PBO_COMPILATION" and a check for this was added to ns_main.c: #if defined(XSTATS) || defined(BIND_UPDATE) || defined(PBO_COMPILATION) static SIG_FN onintr(int sig) { ns_need(MAIN_NEED_EXIT); } #endif ... #if defined(XSTATS) || defined(BIND_UPDATE) || defined(HP_PROFILING) set_signal_handler(SIGTERM, onintr); #endif As I was only interested in the performance of named, and the compiler when given the +P directive will fail if there is no data for a given program in the flow.data file, the SUBDIRS variable in src/bin/Makefile was truncated to just named. With these tunes in place, performance increases to 2940 requests per second. This is another 6% over the previous tune, and represents a 55% increase over baseline. The hierarchical profile shows why: Samples Samples In or Under In Only %Total %Total Level Routine ----------- --------- ----- ------- 100 0 1 $START$ 100 0 2 _start(hpux_export stub) 100 0 3 _start 100 0 4 main(hpux_export stub) 100 0 5 main 80 0 6 __evDispatch 80 1 7 datagram_read 25 10 8 doaddinfo 6 0 9 __dn_comp 6 2 10 __ns_name_compress 4 4 11 __ns_name_pack 6 0 9 free(hpux_export stub) 5 2 10 free 3 3 11 _ltostr 1 1 9 ns_debug 1 1 10 fprintf(hpux_import stub) 1 0 9 _memcpy(hpux_export stub) 1 1 9 strncasecmp 26 5 8 819_4336_$_HLO_ns_req.o##_req_query 1 1 9 971_4336_$_HLO_ns_resp.o##_96f_4336_cl_findns 7 1 9 strdup(hpux_export stub) 7 1 10 _strdup 4 3 11 malloc 2 2 12 _ltostr 1 1 11 _strlen(shlib_import stub) 1 1 11 strcpy 7 0 9 __dn_comp 7 2 10 __ns_name_compress 5 4 11 __ns_name_pack 1 0 12 _memcpy(hpux_export stub) 1 0 9 ns_debug 1 0 9 __dn_expand 1 0 10 __ns_name_uncompress 1 1 11 __ns_name_ntop 2 0 9 968_4336_$_HLO_ns_req.o##_966_4336_cl_make_rr 2 0 10 __dn_comp 2 1 11 __ns_name_compress 1 1 12 __ns_name_pack 8 0 8 _recvfrom(hpux_export stub) 8 0 9 _recvfrom 8 8 10 _recvfrom_sys 5 0 8 __inet_ntoa 4 0 9 __inet_ntop 4 0 10 _sprintf(hpux_export stub) 4 0 11 sprintf 3 3 12 _doprnt 11 0 8 _sendto(hpux_export stub) 10 0 9 sendto 10 10 10 _sendto_sys 1 0 8 tree_srch 3 0 8 _gettimeofday(hpux_export stub) 3 3 9 gettimeofday 1 0 8 _sprintf(hpux_export stub) 1 0 9 sprintf 1 1 10 _doprnt 20 1 6 __evGetNext 4 0 7 __evNowTime 4 0 8 _gettimeofday(hpux_export stub) 4 4 9 gettimeofday 14 0 7 _select(hpux_export stub) 14 0 8 select 14 14 9 _select_sys -------- it is possible to see that the call tree is now flatter than it was before - that is the result of the inlining done by +O4. It is also possible to see where the optimizer has left its mark - the "HLO" entry points. It would be nice to get more from this. While shipping a +O4 compiled library might not be a good thing to do since it would require a full compiler on target systems, there is no reason why full-bore optimization could not be done just to make named. Also, there may be some additional tweaks to be made to the options. So, we arrive at the next tune: Tune Five - Full-bore optimization a la SPECint95 [ named.specint95 ] I have very little experience with optimization settings, but do know of a place where experts have worked to squeeze every bit out of a program they could. That would be SPEC's CPU benchmarks. Perhaps the switches used to get the SPECint95 figures for a similar system would be of benefit. I used the peak SPECint95 figures for the HP 9000 C240 as a guide. Based on essentially a hunch, the settings used to compile the 126.gcc benchmark were selected, with some changes. First, the binary was not compiled archive. This also meant that the +ESfic could not be used (I discovered this the hard way...). However, it was possible to use +ESlit to put literals in text space. No source code changes were made, which meant that the existing flow.data was still valid without re-training. Since PBO data was available to say which way the branches went +Ostatic_prediction was used - this tells the compiler to mark the binary for static branch prediction. Since compilation time was no object, +Onolimit was used to tell the compiler and linker to take all the time it needed to optimize the binary. With these optimizations in place, the user/kernel split became approximately 57/43 and the requests per second increased to 3400. This is another 15.6% incremental performance improvement, or roughly 79% above baseline. At about this time, I became curious as to all the time still being spend in sin_ntoa(), called from ns_req(). Turns-out that the define DEBUG appears to be set, which causes some additional ns_debug() calls to be included: #ifdef DEBUG if (debug > 3) { ns_debug(ns_log_packet, 3, "ns_req(from %s)", sin_ntoa(from)); fp_nquery(msg, msglen, log_get_stream(packet_channel)); } #endif ... ifdef DEBUG ns_debug(ns_log_default, 1, "ns_req: answer -> %s fd=%d id=%d size=%d", sin_ntoa(from), (qsp == NULL) ? dfd : qsp->s_rfd, ntohs(hp->id), cp - msg); if (debug >= 10) fp_nquery(msg, cp - msg, log_get_stream(packet_channel)); #endif /*DEBUG*/ this caught me by surprise, though given my unfamiliarity with the source code perhaps it should not have been surprising. Comments in the src/bin/named/resolv.h file implied that DEBUG was supposed to be set, so rather than undefine it, I chose instead to use the ns_wouldlog() tune. It also became desireable to reduce the overahed of getting to and from strdup. As was the case for strcasecmp, the BIND sources contained a replacement strdup which could be inabled through the port_after.h file. Tune Six: Minimize overhead of DEBUG and strdup [ named.specint95dbgstdp ] Since this was a sourcefile change, it was necessary to refresh the flow.data file. After this change, the user/kernel split became 53/47. At this point, one begins to wonder about time spent in the kernel, but that is best left to another paper :) The aggregate performance increased to approximately 3610 requests per second. This represents an incremental increase of about 6.1 percent and an overall increase of 90%. Here is the heirarchical profile from puma: Samples Samples In or Under In Only %Total %Total Level Routine ----------- --------- ----- ------- 100 0 1 $START$ 100 0 2 _start(hpux_export stub) 100 0 3 _start 100 0 4 main(hpux_export stub) 100 2 5 main 8 0 6 _recvfrom(hpux_export stub) 8 0 7 _recvfrom 8 8 8 _recvfrom_sys 13 0 6 _select(hpux_export stub) 13 0 7 select 13 13 8 _select_sys 24 14 6 132_6353_$_HLO_ns_req.o##_req_query 4 1 7 malloc(hpux_export stub) 4 2 8 malloc 2 2 9 _ltostr 1 1 7 _memcpy(hpux_export stub) 1 1 8 memcpy 1 1 7 57a_6353_$_HLO_ns_req.o##_578_6353_cl_make_rr 1 0 7 _strcpy(hpux_export stub) 1 1 8 strcpy 1 0 7 _strlen(hpux_export stub) 1 1 8 strlen 12 0 6 _gettimeofday(hpux_export stub) 12 12 7 gettimeofday 17 0 6 _sendto(hpux_export stub) 17 0 7 sendto 17 17 8 _sendto_sys 23 17 6 doaddinfo 4 0 7 free(hpux_export stub) 4 1 8 free 3 3 9 _ltostr 1 1 7 ns_debug 1 0 7 _memcpy(hpux_export stub) -------- it shows that about 8% of the remaining time is spent in either malloc or free, with the bulk of that in the _ltostr calls each makes. It would seem that the call to free() in doaddinfo() is to free memory allocated in the call to savestr() in in addname. These routines appear to be manipulating entries in a statically allocated array of addinfo structures not accessed outside of src/bin/named/ns_req.c: static struct addinfo addinfo[NADDRECS]; struct addinfo { char *a_dname; /* domain name */ char *a_rname; /* referred by */ u_int16_t a_rtype; /* referred by */ u_int16_t a_class; /* class for address */ }; Given that there is a bounded-length to a domain name, it would seem possible that the char *'s in the addinfo struct could be replaced with fixed-length strings. This would indeed increase the memory usage somewhat, but it could save as much as 8% of the time spent in named. I was unable to persue the option further at the time of this experiment, but changes would need to be made in at least doaddinfo(), free_addinfo(), addname() and savestr(). While it may be possible to find a more efficient malloc() and/or free() call, it is unlikely to be any more efficient than not making the call in the first place. Time spent in gettimeofday() is now 12% of the total which once again begs the question of whether or not the call to gettime() in datagram_read is really required. There is a gettimeofday() replacement in src/lib/bsd/gettimeofday.c, but it makes a call to time() which is unlikely to be any faster. Select appears to be consuming 13% of the time. It would be interesting to try and reduce the time spent in that system call - perhaps by reducing the number of times it is called. The profile from higher levels of optimization becomes more difficult to interpret as the compiler inlines and collapses levels considerably compared with the baseline. The baseline puma profile showed procedure call nesting getting 17 calls deep, and now it is only 9 calls deep. While performance has increased, the time spent in libc has grown again: Samples Samples In or Under In Only %Total %Total Image or Library ----------- --------- ---------------- 100 63 /usr/lib/libc.2 100 36 /home/bind2/src/bin/named/named -------- With the named code more highly optimized this is not terribly surprising as libc has remained unchanged. Tune Seven: Multiple recvfrom() calls in datagram_read() Assuming that the file descriptor was marked non-blocking, I basically wrapped the innards of datagram_read in a for loop of fixed length. The idea is to read up to a maximum number of requests on the socket and then return to allow another socket to be serviced. An EWOULDBLOCK or EAGAIN would result in the immediate return from datgram_read. This should have the effect of reducing the number of calls to select() when the named is under load. A reasonable initial value for the maximum number of times through the loop seemed to be five. This should cut-down on select without giving too much preference to one socket over another. After training this optimization took the performance up to approximately 3760 requests per second, or an increase of roughly four percent. The total increase over baseline becomes almost 98%. We can see from the updated profile that this indeed reduced the proportion of time spent in select(): About netperf: Netperf is a freely-distributed network performance benchmark. Revision 2.1pl4 of netperf includes a rather simplistic DNS performance benchmark which simply calls res_query repeatedly for the same name - the name of my valiant, former workstation. Netperf revision 2.1 was designed for single, point-to-point tests. While nothing precludes running several tests concurrently, only "eye-ball" synchronization is available - the tester must watch the output of the separate instances to make sure they stopped within a reasonable time of one another. He also has to have a quick mouse and fingers to make sure that all the tests start within a reasonable time of one another. Netperf Revision 3.0.0 was designed to simplify aggregate testing. It provides for the running of multiple tests either as separate processes on most any Unix system, or as separate threads on a POSIX-compliant system. In the future, netperf 3.0.0 will be ported to the Win95/NT platform as was netperf 2.1. Additional information about netperf, as well as a database of netperf results, and pointers to download sites can be found via the web at http://www.cup.hp.com/netperf/NetperfPage.html. Sizes of the various nameds: Tune Name Size (bytes) Base named.O2 473615 One named.O2logck 477719 Two named.O2logckascii 477726 Three named.O3logckascii 543037 Four named.O4logckasciipbo 526811 Five named.specint95 559682 Six named.specint95dbgstdp 555593 Annectdotal Results from Production Use: Some intermediate flavors of the tunes have been used in production in the HP Internet. While the results will not be as clear for lack of a controlled test environment, they encompass areas not covered by the netperf benchmark. In one test on a two-CPU K410 running HP-UX 10.10, the non-optimized, non-PBO version of named took 3 minutes, 20.58 seconds to start-up and load 92 zone files containing a total of 1,276,513 resource records: # time ndc start Name Server Started real 3m20.58s user 3m12.51s sys 0m5.12s The PBO'd named with +O2 libs and +O4 named was able to perform the same work in 1 minute 50.57 seconds: # time ndc start Name Server Started real 1m50.57s user 1m42.24s sys 0m4.90s or a reduction in start-up time of roughly 45%. This is good news for two reasons - one it shows that the benefits of PBO and higher optimization levels are broader than just the simple requests and responses as measured by netperf. Secondly, it shows that these benefits can be had on earlier versions of HP-UX and thus earlier versions of the HP compilers. Non-Performance Porting Changes: Initial porting of BIND 8.1.1 to HP-UX 11.00 for these tests did involve a little work but was rather straightforward. First, a src/port/hpux11 directory was created and populated with the files from src/port/hpux. The probe script was edited to replace "B.10.*" with "B.11.*." Further, Makefile.set was edited to include -lnsl in the SYSLIBS. No other changes were required for the initial porting. Acknowledgements: I would like thank all those who contributed to this paper and thus made it possible. Among them are Lori Dutra, Denis Handly, "Cousin" LaMont Jones, Michael Milligan, Chris Ruemmler and Suzanne Woolf. I am certain I am forgetting folks and appologize for my seive-like memory.