现在的位置: 首页 > 综合 > 正文

nfs优化2

2017年12月22日 ⁄ 综合 ⁄ 共 12408字 ⁄ 字号 评论关闭

5.3. Overflow of Fragmented Packets

Using an rsize


or wsize


larger than your network's MTU (often set to 1500, in many networks)
will cause IP packet fragmentation when using NFS over UDP. IP packet
fragmentation and reassembly require a significant amount of CPU
resource at both ends of a network connection. In addition, packet
fragmentation also exposes your network traffic to greater
unreliability, since a complete RPC request must be retransmitted if a
UDP packet fragment is dropped for any reason. Any increase of RPC
retransmissions, along with the possibility of increased timeouts, are
the single worst impediment to performance for NFS over UDP.

Packets may be dropped for many reasons. If your network topography
is complex, fragment routes may differ, and may not all arrive at the
Server for reassembly. NFS Server capacity may also be an issue, since
the kernel has a limit of how many fragments it can buffer before it
starts throwing away packets. With kernels that support the /proc

filesystem, you can monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh

and /proc/sys/net/ipv4/ipfrag_low_thresh

. Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh

(in bytes), the kernel will simply start throwing away fragmented
packets until the number of incomplete packets reaches the number
specified by ipfrag_low_thresh

.

Another counter to monitor is IP: ReasmFails

in the file /proc/net/snmp

;
this is the number of fragment reassembly failures. if it goes up too
quickly during heavy file activity, you may have a problem.


5.4. NFS Over TCP

A new feature, available for both 2.4 and 2.5 kernels but not yet
integrated into the mainstream kernel at the time of this writing, is
NFS over TCP. Using TCP has a distinct advantage and a distinct
disadvantage over UDP. The advantage is that it works far better than
UDP on lossy networks. When using TCP, a single dropped packet can be
retransmitted, without the retransmission of the entire RPC request,
resulting in better performance on lossy networks. In addition, TCP will
handle network speed differences better than UDP, due to the underlying
flow control at the network level.

The disadvantage of using TCP is that it is not a stateless protocol
like UDP. If your server crashes in the middle of a packet transmission,
the client will hang and any shares will need to be unmounted and
remounted.

The overhead incurred by the TCP protocol will result in somewhat
slower performance than UDP under ideal network conditions, but the cost
is not severe, and is often not noticable without careful measurement.
If you are using gigabit ethernet from end to end, you might also
investigate the usage of jumbo frames, since the high speed network may
allow the larger frame sizes without encountering increased collision
rates, particularly if you have set the network to full duplex.


5.5. Timeout and Retransmission Values

Two mount command options, timeo

and retrans

,
control the behavior of UDP requests when encountering client timeouts
due to dropped packets, network congestion, and so forth. The -o timeo

option allows designation of the length of time, in tenths of seconds,
that the client will wait until it decides it will not get a reply from
the server, and must try to send the request again. The default value is
7 tenths of a second. The -o retrans

option allows designation of the number of timeouts allowed before the
client gives up, and displays the Server not responding message. The
default value is 3 attempts. Once the client displays this message, it
will continue to try to send the request, but only once before
displaying the error message if another timeout occurs. When the client
reestablishes contact, it will fall back to using the correct retrans

value, and will display the Server OK

message.

If you are already encountering excessive retransmissions (see the output of the nfsstat

command), or want to increase the block transfer size without
encountering timeouts and retransmissions, you may want to adjust these
values. The specific adjustment will depend upon your environment, and
in most cases, the current defaults are appropriate.


5.6. Number of Instances of the NFSD Server Daemon

Most startup scripts, Linux and otherwise, start 8 instances of nfsd

.
In the early days of NFS, Sun decided on this number as a rule of
thumb, and everyone else copied. There are no good measures of how many
instances are optimal, but a more heavily-trafficked server may require
more. You should use at the very least one daemon per processor, but
four to eight per processor may be a better rule of thumb. If you are
using a 2.4 or higher kernel and you want to see how heavily each nfsd

thread is being used, you can look at the file /proc/net/rpc/nfsd

. The last ten numbers on the th

line in that file indicate the number of seconds that the thread usage
was at that percentage of the maximum allowable. If you have a large
number in the top three deciles, you may wish to increase the number of nfsd

instances. This is done upon starting nfsd

using the number of instances as the command line option, and is specified in the NFS startup script (/etc/rc.d/init.d/nfs

on Red Hat) as RPCNFSDCOUNT

. See the nfsd(8)

man page for more information.


5.7. Memory Limits on the Input Queue

On 2.2 and 2.4 kernels, the socket input queue, where requests sit
while they are currently being processed, has a small default size limit
(rmem_default

) of 64k. This
queue is important for clients with heavy read loads, and servers with
heavy write loads. As an example, if you are running 8 instances of nfsd
on the server, each will only have 8k to store write requests while it
processes them. In addition, the socket output queue - important for
clients with heavy write loads and servers with heavy read loads - also
has a small default size (wmem_default

).

Several published runs of the NFS benchmark SPECsfs97
specify usage of a much higher value for both the read and write value sets, [rw]mem_default and [rw]mem_max

.
You might consider increasing these values to at least 256k. The read
and write limits are set in the proc file system using (for example) the
files /proc/sys/net/core/rmem_default

and /proc/sys/net/core/rmem_max

. The rmem_default

value can be increased in three steps; the following method is a bit of
a hack but should work and should not cause any problems:

  1. Increase the size listed in the files:

    # echo 262144 > /proc/sys/net/core/rmem_default
    # echo 262144 > /proc/sys/net/core/rmem_max
  2. Restart NFS via the method described in your distribution's documentation.
  3. You might return the size limits to their normal size in case
    other kernel systems depend on it. This last step may be necessary
    because machines have been reported to crash or have issues when these
    variables are left unchanged for long periods of time.

    # echo 65536 > /proc/sys/net/core/rmem_default
    # echo 65536 > /proc/sys/net/core/rmem_max

 


5.8. Turning Off Autonegotiation of NICs and Hubs

If network cards auto-negotiate badly with hubs and switches, and
ports run at different speeds, or with different duplex configurations,
performance will be severely impacted due to excessive collisions,
dropped packets, etc. If you see excessive numbers of dropped packets in
the nfsstat

output, or
poor network performance in general, try playing around with the network
speed and duplex settings. If possible, concentrate on establishing a
100BaseT full duplex subnet; the virtual elimination of collisions in
full duplex will remove the most severe performance inhibitor for NFS
over UDP. Be careful when turning off autonegotiation on a card: The hub
or switch that the card is attached to will then resort to other
mechanisms (such as parallel detection) to determine the duplex
settings, and some cards default to half duplex because it is more
likely to be supported by an old hub. The best solution, if the driver
supports it, is to force the card to negotiate 100BaseT full duplex.


5.9. Synchronous vs. Asynchronous Behavior in NFS

The default export behavior for both NFS Version 2 and Version 3 protocols, used by exportfs

in nfs-utils versions prior to nfs-utils-1.0.1 is "asynchronous". This
default permits the server to reply to client requests as soon as it has
processed the request and handed it off to the local file system,
without waiting for the data to be written to stable storage. This is
indicated by the async

option denoted in the server's export list. It yields better performance
at the cost of possible data corruption if the server reboots while
still holding unwritten data and/or metadata in its caches. This
possible data corruption is not detectable at the time of occurrence,
since the async

option
instructs the server to lie to the client, telling the client that all
data has indeed been written to the stable storage, regardless of the
protocol used.

In order to conform with "synchronous" behavior, used as the default
for most proprietary systems supporting NFS (Solaris, HP-UX, RS/6000,
etc.), and now used as the default in the latest version of exportfs

,
the Linux Server's file system must be exported with the sync option.
Note that specifying synchronous exports will result in no option being
seen in the server's export list:

  • Export a couple file systems to everyone, using slightly different options:

    # /usr/sbin/exportfs -o rw,sync *:/usr/local
    # /usr/sbin/exportfs -o rw *:/tmp
  • Now we can see what the exported file system parameters look like:
    # /usr/sbin/exportfs -v
    /usr/local *(rw)
    /tmp *(rw,async)

 

If your kernel is compiled with the /proc

filesystem, then the file /proc/fs/nfs/exports

will also show the full list of export options.

When synchronous behavior is specified, the server will not complete
(that is, reply to the client) an NFS version 2 protocol request until
the local file system has written all data/metadata to the disk. The
server will complete a synchronous NFS version 3 request without this
delay, and will return the status of the data in order to inform the
client as to what data should be maintained in its caches, and what data
is safe to discard. There are 3 possible status values, defined an
enumerated type, nfs3_stable_how

, in include/linux/nfs.h

. The values, along with the subsequent actions taken due to these results, are as follows:

  • NFS_UNSTABLE: Data/Metadata was not committed to stable storage
    on the server, and must be cached on the client until a subsequent
    client commit request assures that the server does send data to stable
    storage.
  • NFS_DATA_SYNC: Metadata was not sent to stable storage, and must
    be cached on the client. A subsequent commit is necessary, as is
    required above.
  • NFS_FILE_SYNC: No data/metadata need be cached, and a subsequent commit need not be sent for the range covered by this request.

 

In addition to the above definition of synchronous behavior, the
client may explicitly insist on total synchronous behavior, regardless
of the protocol, by opening all files with the O_SYNC

option. In this case, all replies to client requests will wait until
the data has hit the server's disk, regardless of the protocol used
(meaning that, in NFS version 3, all requests will be NFS_FILE_SYNC

requests, and will require that the Server returns this status). In
that case, the performance of NFS Version 2 and NFS Version 3 will be
virtually identical.

If, however, the old default async

behavior is used, the O_SYNC

option has no effect at all in either version of NFS, since the server
will reply to the client without waiting for the write to complete. In
that case the performance differences between versions will also
disappear.

Finally, note that, for NFS version 3 protocol requests, a subsequent
commit request from the NFS client at file close time, or at fsync()

time, will force the server to write any previously unwritten
data/metadata to the disk, and the server will not reply to the client
until this has been completed, as long as sync behavior is followed. If async

is used, the commit is essentially a no-op, since the server once again
lies to the client, telling the client that the data has been sent to
stable storage. This again exposes the client and server to data
corruption, since cached data may be discarded on the client due to its
belief that the server now has the data maintained in stable storage.


5.10. Non-NFS-Related Means of Enhancing Server Performance

In general, server performance and server disk access speed will have
an important effect on NFS performance. Offering general guidelines for
setting up a well-functioning file server is outside the scope of this
document, but a few hints may be worth mentioning:

  • If you have access to RAID arrays, use RAID 1/0 for both write
    speed and redundancy; RAID 5 gives you good read speeds but lousy write
    speeds.
  • A journalling filesystem will drastically reduce your reboot time in the event of a system crash. Currently, ext3
    will work correctly with NFS version 3. In addition, Reiserfs version
    3.6 will work with NFS version 3 on 2.4.7 or later kernels (patches are
    available for previous kernels). Earlier versions of Reiserfs did not
    include room for generation numbers in the inode, exposing the
    possibility of undetected data corruption during a server reboot.
  • Additionally, journalled file systems can be configured to
    maximize performance by taking advantage of the fact that journal
    updates are all that is necessary for data protection. One example is
    using ext3 with data=journal

    so that all updates go first to the journal, and later to the main file
    system. Once the journal has been updated, the NFS server can safely
    issue the reply to the clients, and the main file system update can
    occur at the server's leisure. The journal in a journalling file system
    may also reside on a separate device such as a flash memory card so that
    journal updates normally require no seeking. With only rotational delay
    imposing a cost, this gives reasonably good synchronous IO performance.
    Note that ext3 currently supports journal relocation, and ReiserFS will
    (officially) support it soon. The Reiserfs tool package found at ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz
    contains the reiserfstune

    tool, which will allow journal relocation. It does, however, require a
    kernel patch which has not yet been officially released as of January,
    2002.
  • Using an automounter (such as autofs

    or amd

    )
    may prevent hangs if you cross-mount files on your machines (whether on
    purpose or by oversight) and one of those machines goes down. See the Automount Mini-HOWTO
    for details.
  • Some manufacturers (Network Appliance, Hewlett Packard, and
    others) provide NFS accelerators in the form of Non-Volatile RAM. NVRAM
    will boost access speed to stable storage up to the equivalent of async

    access.
【上篇】
【下篇】

抱歉!评论已关闭.