BBS水木清华站∶精华区

发信人: cybergene (活泼的基因), 信区: Linux
标  题: The C10K problem
发信站: BBS 水木清华站 (Mon Dec 13 14:03:09 1999)

The C10K problem

It's time for web servers to handle ten thousand clients simultaneously,
don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 500MHz machine with 1 gigabyte
of RAM and six 100Mbit/sec Ethernet card for
$3000 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes,
and 60Kbits/sec per client. It shouldn't take any more
horsepower than that to take four kilobytes from the disk and send
them to the network once a second for each of ten thousand
clients. (That works out to $0.30 per client, by the way. Those
$100/client licensing fees some operating systems charge are
starting to look a little heavy!) So hardware is no longer the
bottleneck.

One of the busiest ftp sites, cdrom.com, actually can handle 10000
clients simultaneously through a Gigabit Ethernet pipe to its
ISP. ( Here's a somewhat dated page about their configuration.) Pipes
this fast aren't common yet, but technology is improving
rapidly.

There's interest in benchmarking this kind of configuration; see the
discussion on June 28th 1999 on linux-kernel, in which people
propose setting up a testbed that can be used for ongoing benchmarks.

There appears to be interest in this direction from the NSF, too; see
the Web100 project.

With that in mind, here are a few notes on how to configure operating
systems and write code to support thousands of clients. The
discussion centers around Unix-like operating systems, for obvious
reasons.

Book to Read First

If you haven't read it already, go out and get a copy of Unix Network
Programming : Networking Apis: Sockets and Xti (Volume
1) by the late W. Richard Stevens. It describes many of the I/O
strategies and pitfalls related to writing high-performance servers.
It even talks about the 'thundering herd' problem.

I/O Strategies

There seem to be four ways of writing a fast web server to handle many
clients:

   1.Build the server code into the kernel. Novell and Microsoft are
both said to have done this at various times, at least one
     NFS implementation does this, and khttpd does this for Linux and
static web pages.
     The linux-kernel list has been discussing the pros and cons of this
approach, and the consensus seems to be instead of
     moving web servers into the kernel, the kernel should have the
smallest possible hooks added to improve web server
     performance. That way, other kinds of servers can benefit. See e.g.
Zach Brown's remarks about userland vs. kernel http
     servers.

   2.serve one client with each server thread, and let read() and
write() block. (This is the only model supported by Java.)

   3.serve many clients with each server process or thread, set
nonblocking mode on all network handles, and use select() or
     poll() to tell which network handle has data waiting. This is the
traditional favorite. It's still being improved; see e.g. Niels
     Provos' benchmarks with hinting poll and thttpd.
     An important bottleneck in this method is that read() or sendfile()
from disk blocks if the page is not in core at the moment;
     setting nonblocking mode on a disk file handle has no effect.
Same thing goes for memory-mapped disk files. The first time
     a server needs disk I/O, its process blocks, all clients must wait,
and that raw nonthreaded performance goes to waste.
     Worker threads or processes that do the disk I/O can get around
this bottleneck. One approach is to use memory-mapped
     files, and if mincore() indicates I/O is needed, ask a worker to do
the I/O, and continue handling network traffic. Jef
     Poskanzer mentions that Pai, Druschel, and Zwaenepoel's Flash web
server uses this trick; they gave a talk at Usenix '99
     on it. It looks like mincore() is available in BSD-derived Unixes
like FreeBSD and Solaris, but is not part of the Single Unix
     Specification.

   4.serve many clients with each server process or thread, and use
asynchronous I/O to avoid blocking. This has not yet
     become popular, possibly because of poorly designed asynchronous
I/O interfaces. Zach Brown (author of HoserFTPD)
     thinks this might now be the way to go for highest performance; see
his 14 April 1999 post to hftpd-users.
     There are several flavors of asynchronous I/O:
          the aio_ interface (scroll down from that link to
"Asynchronous input and output"), which associates a signal and
          value with each I/O operation. Signals and their values are
queued and delivered efficiently to the user process. This
          is from the POSIX 1003.1b realtime extensions, and is also
in the Single Unix Specification, version 2, and in glibc
          2.1. However, the generic glibc 2.1 implementation may have
been written for standards compliance rather than
          performance.

          Much of the overhead of calling poll() can be eliminated by
using fcntl(fd, F_SETSIG, signum), which associates a
          signal with each file descriptor, and raises that signal
when a normal I/O function like read() or write() completes.
          Available in Linux 2.3.21 or higher.
          To use this, you choose a realtime signal number, mask that
signal and SIGIO with sigprocmask(), use fcntl(fd,
          F_SETSIG, signum) on each fd, write a normal poll() outer
loop, and inside it, after you've handled all the fd's
          noticed by poll(), you loop calling sigwaitinfo().
          If sigwaitinfo returns your realtime signal, siginfo.si_fd and
siginfo.si_band give the same information as pollfd.fd and
          pollfd.revents would after a call to poll(), so you handle the
i/o, and continue calling sigwaitinfo().
          If sigwaitinfo returns a traditional SIGIO, the signal queue
overflowed, so you flush the signal queue by temporarily
          changing the signal handler to SIG_DFL, and break back to
the outer poll() loop.
          You can support older kernels by surrounding the sigwaitinfo
code with #if defined(LINUX_VERSION_CODE)
          && (LINUX_VERSION_CODE >= KERNEL_VERSION(2.3.21))
          See Zach Brown's phhttpd for example code that uses this
feature, and deals with some of the gotchas, e.g. events
          queued before an fd is dealt with or closed, but arriving
afterwards.

          SIGIO (see glibc doc or BSD Sockets doc) -- doesn't tell you
which handle needs servicing, so it seems kind of
          coarse. Used by the Linux F_SETSIG/aio_ implementation as a
fallback when the realtime signal queue overflows.
          Here's an example of its use. (Was partly broken in Linux
kernels 2.2.0 - 2.2.7, fixed in 2.2.8.)

Richard Gooch has written a paper discussing these options.
Interesting reading.

The Apache mailing lists have some interesting posts about why they
prefer not to use select() (basically, they think that makes
plugins harder). Still, they're planning to use
select()/poll()/sendfile() for static requests in Apache 2.0.

Mark Russinovich wrote an editorial and an article discussing I/O
strategy issues in the 2.2 Linux kernel. Worth reading, even he
seems misinformed on some points. In particular, he seems to think
that Linux 2.2's asyncrhonous I/O (see F_SETSIG above)
doesn't notify the user process when data is ready, only when new
connections arrive. This seems like a bizarre misunderstanding.
See also comments on an earlier draft, a rebuttal from Mingo,
Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox,
and various posts to linux-kernel. I suspect he was trying to say that
Linux doesn't support asynchronous disk I/O, which is
somewhat true.

There was an interesting discussion on linux-kernel in September 1999
titled "> 15,000 Simultaneous Connections". (second
week, third week) Highlights:

     Ed Hall posted a few notes on his experiences; he's achieved
＞1000 connects/second on a UP P2/333 running Solaris. His
     code used a small pool of threads (1 or 2 per CPU) each managing
a large number of clients using "an event-based model".
     Mike Jagdis posted an analysis of poll/select overhead, and said
"The current select/poll implementation can be improved
     significantly, especially in the blocking case, but the overhead
will still increase with the number of descriptors because
     select/poll does not, and cannot, remember what descriptors are
interesting. This would be easy to fix with a new API.
     Suggestions are welcome..."
     Mike posted about his work on improving select() and poll().
     Mike posted a bit about a possible API to replace poll()/select():
"How about a 'device like' API where you write 'pollfd
     like' structs, the 'device' listens for events and delivers 'pollfd
like' structs representing them when you read it? ... "
     Rogier Wolff suggested using "the API that the digital guys
suggested", http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
     Joerg Pommnitz pointed out that any new API along these lines
should be able to wait for not just file descriptor events, but
     also signals and maybe SYSV-IPC. Our synchronization primitives
should certainly be able to do what Win32's
     WaitForMultipleObjects can, at least.
     Stephen Tweedie asserted that the combination of F_SETSIG, queued
realtime signals, and sigwaitinfo() was a superset of
     the API proposed in http://www.cs.rice.
edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the
signal
     blocked at all times if you're interested in performance; instead
of the signal being delivered asynchronously, the process
     grabs the next one from the queue with sigwaitinfo().
     Jayson Nordwick compared completion ports with the F_SETSIG
synchronous event model, and concluded they're pretty
     similar.
     Alan Cox noted that an older rev of SCT's SIGIO patch is included
in 2.3.18ac.
     Jordan Mendelson posted some example code showing how to use
F_SETSIG.
     Stephen C. Tweedie continued the comparison of completion ports and
F_SETSIG, and noted: "With a signal dequeuing
     mechanism, your application is going to get signals destined for
various library components if libraries are using the same
     mechanism," but the library can set up its own signal handler, so
this shouldn't affect the program (much).
     Doug Royer noted that he'd gotten 100,000 connections on Solaris
2.6 while he was working on the Sun calendar server.
     Others chimed in with estimates of how much RAM that would
require on Linux, and what bottlenecks would be hit.

Interesting reading!

Limits on open filehandles

     Any Unix: the limits set by ulimit or setrlimit.
     Solaris: see the Solaris FAQ, question 3.45.
     FreeBSD: use sysctl -w kern.maxfiles=nnnn to raise limit
     Linux: See Bodo Bauer's /proc documentation. On current 2.2.x
kernels,

     echo 32768 > /proc/sys/fs/file-max
     echo 65536 > /proc/sys/fs/inode-max

     increases the system limit on open files, and

     ulimit -n 32768

     increases the current process' limit. I verified that a process
on Red Hat 6.0 (2.2.5 or so plus patches) can open at least
     31000 file descriptors this way. Another fellow has verified that a
process on 2.2.12 can open at least 90000 file
     descriptors this way (with appropriate limits). The upper bound
seems to be available memory.
     Stephen C. Tweedie posted about how to set ulimit limits globally
or per-user at boot time using initscript and pam_limit.
     In older 2.2 kernels, though, the number of open files per
process is still limited to 1024, even with the above changes.
     See also Oskar's 1998 post, which talks about the per-process and
system-wide limits on file descriptors in the 2.0.36
     kernel.

Limits on threads

On any architecture, you may need to reduce the amount of stack space
allocated for each thread to avoid running out of virtual
memory. You can set this at runtime with pthread_attr_init() if you're
using pthreads.

     Solaris: it supports as many threads as will fit in memory, I hear.

     FreeBSD: ?
     Linux: Even the 2.2.13 kernel limits the number of threads, at
least on Intel. I don't know what the limits are on other
     architectures. Mingo posted a patch for 2.1.131 on Intel that
removed this limit. It appears to be integrated into 2.3.20.

     See also Volano's detailed instructions for raising file, thread,
and FD_SET limits in the 2.2 kernel. Wow. This stunning little
     document steps you through a lot of stuff that would be hard to
figure out yourself. This is a must-read, even though some
     of its advice is already out of date.
     Java: See Volano's detailed benchmark info, plus their info on
how to tune various systems to handle lots of threads.

Other limits/tips

     select() is limited to FD_SETSIZE handles. This limit is compiled
in to the standard library and user programs. The similar
     call poll() does not have a comparable limit, and can have less
overhead than select().
     Old system libraries might use 16 bit variables to hold file
handles, which causes trouble above 32767 handles. glibc2.1
     should be ok.
     Many systems use 16 bit variables to hold process or thread id's.
It would be interesting to port the Volano scalability
     benchmark to C, and see what the upper limit on number of threads
is for the various operating systems.
     Too much thread-local memory is preallocated by some operating
systems; if each thread gets 1MB, and total VM space
     is 2GB, that creates an upper limit of 2000 threads.
     Normally, data gets copied many times on its way from here to
there. mmap() and sendfile() can be used to reduce this
     overhead in some cases. IO-Lite is a proposal (already
implemented on FreeBSD) for a set of I/O primitives that gets rid
     of the need for many copies. It's sexy; go read it. But see also
Alan Cox's opinion of zero-copy.
     The sendfile() function in Linux and FreeBSD lets you tell the
kernel to send part or all of a file. This lets the OS do it as
     efficiently as possible. It can be used equally well in servers
using threads or servers using nonblocking I/O. (In Linux, It's
     poorly documented at the moment; use _syscall4 to call it. Andi
Kleen is writing new man pages that cover this.) Rumor
     has it, ftp.cdrom.com benefitted noticably from sendfile().
     A new socket option under Linux, TCP_CORK, tells the kernel to
avoid sending partial frames, which helps a bit e.g.
     when there are lots of little write() calls you can't bundle
together for some reason. Unsetting the option flushes the buffer.
     Not all threads are created equal. The clone() function in Linux
(and its friends in other operating systems) lets you create a
     thread that has its own current working directory, for instance,
which can be very helpful when implementing an ftp server.
     See Hoser FTPd for an example of the use of native threads rather
than pthreads.
     To keep the number of filehandles per process down, servers can
fork() once they reach the desired maximum; the child
     finishes serving the existing clients, and the parent accepts and
services new clients. (If the desired maximum is 1, this
     degenerates to the classical one-process-per-client model.)
     One developer using sendfile() with Freebsd reports that using
POLLWRBAND instead of POLLOUT makes a big
     difference.
     Look at the performance comparison graph at the bottom of http:
//www.acme.com/software/thttpd/benchmarks.html.
     Notice how various servers have trouble above 128 connections, even
on Solaris 2.6? Anyone who figures out why, let me
     know.
     Note: if the TCP stack has a bug that causes a short (200ms)
delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and
     the OS or http daemon has a hard limit on the number of connections
open, you would expect exactly this behavior. There
     may be other causes.
     "Re: fix for hybrid server problems" by Vivek Sadananda Pai
(vivek@cs.rice.edu) on new-httpd, May 9th, notes:

          "I've compared the raw performance of a select-based server
with a multiple-process server on both
          FreeBSD and Solaris/x86. On microbenchmarks, there's only a
marginal difference in performance stemming
          from the software architecture. The big performance win for
select-based servers stems from doing
          application-level caching. While multiple-process servers
can do it at a higher cost, it's harder to get the same
          benefits on real workloads (vs microbenchmarks). I'll be
presenting those measurements as part of a paper
          that'll appear at the next Usenix conference. If you've got
postscript, the paper is available at
          http://www.cs.rice.edu/~vivek/flash99/"

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly.
See Linux HQ, Kernel Traffic, and the Linux-Kernel mailing
list (Example interesting posts by a user asking how to tune, and Dean
Gaudet)

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux
at serving large numbers of http and smb clients, in
which they failed to see good results from Linux. See also my article on
Mindcraft's April 1999 Benchmarks for more info.

See also The Linux Scalability Project. They're doing interesting work,
including Niels Provos' hinting poll patch.

See also Mike Jagdis' work on improving select() and poll(); here's
Mike's post about it.

Solaris 7 supposedly supports an even more efficient poll() scheme which
involves /dev/poll; apparantly, if you like, you can avoid
having to tell the OS constantly what files you're interested in,
which should help throughput.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can
improve HTTP response time over 'slow' connections
by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

   1.raw connections per second (how many 512 byte files per second
can you serve?)
   2.total transfer rate on large files with many slow clients (how many
28.8k modem clients can simultaneously download from
     your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. See
http://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be
of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web
server benchmarking. It's worth a read.

Interesting sigio/siginfo-based servers

     Zach Brown's phhttpd - "a quick web server that was written to
showcase the sigio/siginfo event model. consider this code
     highly experimental and yourself highly mental if you try and use
it in a production environment." Uses the siginfo features of
     2.3.21 or later, and includes the needed patches for earlier
kernels. Rumored to be even faster than khttpd. See his post of
     31 May 1999 for some notes.

Interesting select()-based servers

     thttpd Very simple. Uses a single process. It has good performance,
but doesn't scale with the number of CPU's.
     mathopd. Similar to thttpd.
     fhttpd
     boa
     Roxen
     Zeus, a commercial server that tries to be the absolute fastest.
See their tuning guide.
     The other non-Java servers listed at http://www.acme.
com/software/thttpd/benchmarks.html
     BetaFTPd
     Flash-Lite - web server using IO-Lite.
     Flash: An efficient and portable Web server -- uses select(),
mmap(), mincore()
     xitami - uses select() to implement its own thread abstraction
for portability to systems without threads.
     Medusa - a server-writing toolkit in Python that tries to deliver
very high performance.

Interesting thread-based servers

     Hoser FTPD. See their benchmark page.
     Peter Eriksson's phttpd and
     pftpd
     The Java-based servers listed at http://www.acme.
com/software/thttpd/benchmarks.html
     Sun's Java Web Server (which has been reported to handle 500
simultaneous clients)

Interesting in-kernel servers

     khttpd

Other interesting links

     Novell's FastCache -- claims 10000 hits per second. Quite the
pretty performance graph.
     Rik van Riel's Linux Performance Tuning site

Copyright 1999 Dan Kegel
dank@alumni.caltech.edu
Last updated: 24 October 1999
[Return to www.kegel.com]

--
Windows: 32 bit extensions and a graphical shell for a 16 bit patch to
an 8 bit operating system originally coded for a 4 bit microprocessor,
written by a 2 bit company, that can't stand 1 bit of competition.

Welcome to DNA Studio: http://dnastudio.dhs.org
new software, navupdate, wallpapers, mp3z, linux, forums......

※ 来源:·BBS 水木清华站 bbs.net.tsinghua.edu.cn·[FROM: 202.112.85.250]