现在的位置: 首页 > 综合 > 正文

Scaling memcached at Facebook

2014年10月16日 ⁄ 综合 ⁄ 共 4160字 ⁄ 字号 评论关闭

If you've read anything about scaling large websites, you've probably heard about memcached.
memcached is a high-performance, distributed memory object caching
system. Here at Facebook, we're likely the world's largest user of
memcached. We use memcached to alleviate database load. memcached is
already fast, but we need it to be faster and more efficient than most
installations. We use more than 800 servers supplying over 28 terabytes
of memory to our users. Over the past year as Facebook's popularity has
skyrocketed, we've run into a number of scaling issues. This ever
increasing demand has required us to make modifications to both our
operating system and memcached to achieve the performance that provides
the best possible experience for our users.

Because we have thousands and thousands of computers, each running a
hundred or more Apache processes, we end up with hundreds of thousands
of TCP connections open to our memcached processes. The connections
themselves are not a big problem, but the way memcached allocates
memory for each TCP connection is. memcached uses a per-connection
buffer to read and write data out over the network. When you get into
hundreds of thousands of connections, this adds up to gigabytes of
memory-- memory that could be better used to store user data. To
reclaim this memory for user data, we implemented a per-thread shared
connection buffer pool for TCP and UDP sockets. This change enabled us
to reclaim multiple gigabytes of memory per server.

Although we improved the memory efficiency with TCP, we moved to UDP
for get operations to reduce network traffic and implement
application-level flow control for multi-gets (gets of hundreds of keys
in parallel). We discovered that under load on Linux, UDP performance
was downright horrible. This is caused by considerable lock contention
on the UDP socket lock when transmitting through a single socket from
multiple threads. Fixing the kernel by breaking up the lock is not
easy. Instead, we used separate UDP sockets for transmitting replies
(with one of these reply sockets per thread). With this change, we were
able to deploy UDP without compromising performance on the backend.

Another issue we saw in Linux is that under load, one core would get
saturated, doing network soft interrupt handing, throttling network IO.
In Linux, a network interrupt is delivered to one of the cores,
consequently all receive soft interrupt network processing happens on
that one core. Additionally, we saw an excessively high rate of
interrupts for certain network cards. We solved both of these by
introducing “opportunistic” polling of the network interfaces. In this
model, we do a combination of interrupt driven and polling driven
network IO. We poll the network interface anytime we enter the network
driver (typically for transmitting a packet) and from the process
scheduler’s idle loop. In addition, we also take interrupts (to keep
latencies bounded) but we take far fewer network interrupts (typically
by setting interrupt coalescing thresholds aggressively). Since we do
network transmission on every core and since we poll for network IO
from the scheduler’s idle loop, we distribute network processing evenly
across all cores.

Finally, as we started deploying 8-core machines and in our testing, we
discovered new bottlenecks. First, memcached's stat collection relied
on a global lock. A nuisance with 4 cores, with 8 cores, the lock now
accounted for 20-30% of CPU usage. We eliminated this bottleneck by
moving stats collection per-thread and aggregating results on-demand.
Second, we noticed that as we increased the number of threads
transmitting UDP packets, performance decreased. We found significant
contention on the lock that protects each network device’s transmit
queue. Packets are enqueued for transmission and dequeued by the device
driver. This queue is managed bv Linux’s “netdevice” layer that sits
in-between IP and device drivers. Packets are added and removed from
the queue one at a time, causing significant contention. One of our
engineers changed the dequeue algorithm to batch dequeues for transmit,
drop the queue lock, and then transmit the batched packets. This change
amortizes the cost of the lock acquisition over many packets and
reduces lock contention significantly, allowing us to scale memcached
to 8 threads on an 8-core system.

Since we’ve made all these changes, we have been able to scale
memcached to handle 200,000 UDP requests per second with an average
latency of 173 microseconds. The total throughput achieved is 300,000
UDP requests/s, but the latency at that request rate is too high to be
useful in our system. This is an amazing increase from 50,000 UDP
requests/s using the stock version of Linux and memcached.

We’re hoping to get our changes integrated into the official memcached
repository soon, but until that happens, we’ve decided to release all
our changes to memcached on github

抱歉!评论已关闭.