现在的位置: 首页 > 综合 > 正文

Documentation\block\data-integrity.txt

2013年12月05日 ⁄ 综合 ⁄ 共 16138字 ⁄ 字号 评论关闭

Chinese translated version of Documentation\block\data-integrity.txt

If you have any comment or update to the content, please contact the
original document maintainer directly.  However, if you have a problem
communicating in English you can also ask the Chinese maintainer for
help.  Contact the Chinese maintainer if this translation is outdated
or if there is a problem with the translation.

Chinese maintainer:huneng <huneng1991@163.com>
---------------------------------------------------------------------
Documentation\block\data-integrity.txt的中文翻译

如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
译存在问题,请联系中文版维护者。

中文版维护者: 胡能  <huneng1991@163.com>
中文版翻译者: 胡能  <huneng1991@163.com>
中文版校译者: 胡能  <huneng1991@163.com>

以下为正文

----------------------------------------------------------------------
1. INTRODUCTION
介绍

Modern filesystems feature checksumming of data and metadata to
protect against data corruption.  However, the detection of the
corruption is done at read time which could potentially be months
after the data was written.  At that point the original data that the
application tried to write is most likely lost.
现代的文件系统具有数据校检和和元数据,以防止数据损坏。
然而,损坏的检测通常是在读的时候,而又往往是上次写后的好几个月之后。
从这一点看来程序试图去写的原始数据很可能丢失了。

The solution is to ensure that the disk is actually storing what the
application meant it to.  Recent additions to both the SCSI family
protocols (SBC Data Integrity Field, SCC protection proposal) as well
as SATA/T13 (External Path Protection) try to remedy this by adding
support for appending integrity metadata to an I/O.  The integrity
metadata (or protection information in SCSI terminology) includes a
checksum for each sector as well as an incrementing counter that
ensures the individual sectors are written in the right order.  And
for some protection schemes also that the I/O is written to the right
place on disk.
解决方法是保证磁盘实际上存储着程序认为正确的数据。
最近新添加的SCSI系列协议(SBC数据完整性域,SCC保护保护建议)和
SATA/T13(外部路径保护)试图通过增加对I/O的附加完整元数据的支持来补救损坏。
完整性元数据(或者说SCSI学术上说的保护信息)包含一个对所有扇区的校检和,
一个增长计数器,以保证独立地扇区以正确的顺序写。
对一些保护方法,I/O操作被写到特定的磁盘位置。

Current storage controllers and devices implement various protective
measures, for instance checksumming and scrubbing.  But these
technologies are working in their own isolated domains or at best
between adjacent nodes in the I/O path.  The interesting thing about
DIF and the other integrity extensions is that the protection format
is well defined and every node in the I/O path can verify the
integrity of the I/O and reject it if corruption is detected.  This
allows not only corruption prevention but also isolation of the point
of failure.
目前的存储控制器和设备实现了多种保护方法,例如校检和、擦写。
但是这些技术只在他们自己的独立地域内工作,或者只在I/O路径相邻的节点之间工作。
有关DIF和其他完整性扩展的有趣的东西是保护格式被很好的定义,
并且在I/O路径上的每个节点可以验证I/O的完整性,如果数据损坏被检测到则会决绝I/O。
这样做,不径防止损坏,而且隔离了失败发生的那一点。

----------------------------------------------------------------------
2. THE DATA INTEGRITY EXTENSIONS
数据完整性扩展

As written, the protocol extensions only protect the path between
controller and storage device.  However, many controllers actually
allow the operating system to interact with the integrity metadata
(IMD).  We have been working with several FC/SAS HBA vendors to enable
the protection information to be transferred to and from their
controllers.
正如所写的那样,协议的扩展只保护控制器和存储设备之间的路径。
当然,血多控制器实际上允许操作系统用完整系元数据(IMD)操作。
我们曾和一些FC/SAS HBA厂商合作,使能传递和接受控制器的保护信息。

The SCSI Data Integrity Field works by appending 8 bytes of protection
information to each sector.  The data + integrity metadata is stored
in 520 byte sectors on disk.  Data + IMD are interleaved when
transferred between the controller and target.  The T13 proposal is
similar.
附加8字节到每个扇区的保护信息,SCSI 数据的完整性域就会工作。
数据+完整性元数据存储在磁盘上的520比特扇区。
Data+IMD是交错的,当控制器和目标交换数据。
T13的建议是相似的。

Because it is highly inconvenient for operating systems to deal with
520 (and 4104) byte sectors, we approached several HBA vendors and
encouraged them to allow separation of the data and integrity metadata
scatter-gather lists.
由于对操作系统处理520比特扇区极其不方便,我们接触了一些HBA厂商,
鼓励他们允许数据和元数据分散聚集表的分离。

The controller will interleave the buffers on write and split them on
read.  This means that Linux can DMA the data buffers to and from
host memory without changes to the page cache.
控制器会交错缓冲区的写操作,并且读的时候分离他们。这意味着Linux可以直接
操作数据缓冲区,与主存交换数据,不需要改变页缓存。

Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
is somewhat heavy to compute in software.  Benchmarks found that
calculating this checksum had a significant impact on system
performance for a number of workloads.  Some controllers allow a
lighter-weight checksum to be used when interfacing with the operating
system.  Emulex, for instance, supports the TCP/IP checksum instead.
The IP checksum received from the OS is converted to the 16-bit CRC
when writing and vice versa.  This allows the integrity metadata to be
generated by Linux or the application at very low cost (comparable to
software RAID5).
同样,由SCSI和SATA规格授权的16位CRC校检码对软件计算来说有点庞大。
Benchmarks发现计算校检和对系统的运行有着巨大的影响,增加了大量的工作负荷。
一些控制器允许一些轻量级的校检和用在控制器和操作系统对接的时候。
Emulex公司就至此TCP/IP校检和。
从操作系统接收到的IP校检和被转换位16位的CRC校检码,当写数据的时候,反之亦然。
这就允许由Linux或应用程序生成的完整性元数据的消耗减小到最低(与RAIDS软件相比)。

The IP checksum is weaker than the CRC in terms of detecting bit
errors.  However, the strength is really in the separation of the data
buffers and the integrity metadata.  These two distinct buffers must
match up for an I/O to complete.
IP校检和比CRC校检和要弱,在位错误的检测上。
当然,实力在于数据缓冲区和完整性元数据的分离情况。
这两个不同的缓冲区必须配合I/O来完成。

The separation of the data and integrity metadata buffers as well as
the choice in checksums is referred to as the Data Integrity
Extensions.  As these extensions are outside the scope of the protocol
bodies (T10, T13), Oracle and its partners are trying to standardize
them within the Storage Networking Industry Association.
数据和完整性元数据缓冲区间隔同校检和的选择一样,都是数据完整性的扩张。
因为这些扩展不在协议体的范围(T10, T13)内,甲骨文和他的合作伙伴试图将
这些扩展加入到存储网络工业标准。

----------------------------------------------------------------------
3. KERNEL CHANGES
内核改变

The data integrity framework in Linux enables protection information
to be pinned to I/Os and sent to/received from controllers that
support it.
在LINUX中数据完整性框架允许保护信息与I/O相关联,并且能从所支持的控制器收取、发送。

The advantage to the integrity extensions in SCSI and SATA is that
they enable us to protect the entire path from application to storage
device.  However, at the same time this is also the biggest
disadvantage. It means that the protection information must be in a
format that can be understood by the disk.
SCSI和SATA完整性扩展的优点是,允许保护从程序到存储设备的完整性路径。
当然,同时这也是最大的缺点,这意味着保护信息的格式那是能被磁盘理解的。

Generally Linux/POSIX applications are agnostic to the intricacies of
the storage devices they are accessing.  The virtual filesystem switch
and the block layer make things like hardware sector size and
transport protocols completely transparent to the application.
通常Linux/POSIX应用程序对他们所要使用的复杂的存储设备是不可知的,
虚拟文件系统开关和块层使像扇区大小和交换协议对应用程序完全透明。

However, this level of detail is required when preparing the
protection information to send to a disk.  Consequently, the very
concept of an end-to-end protection scheme is a layering violation.
It is completely unreasonable for an application to be aware whether
it is accessing a SCSI or SATA disk.
当然,这种级别的细节在准备发向磁盘的保护信息时是需要的。
所以,终端到终端的保护计划概念是违反分层概念的。
对一个应用程序来说,注意它正在访问SCSI磁盘,还是SATA磁盘,这是十分不合理的。

The data integrity support implemented in Linux attempts to hide this
from the application.  As far as the application (and to some extent
the kernel) is concerned, the integrity metadata is opaque information
that's attached to the I/O.
数据完整性支持在Linux的实现试图从应用程序那里隐藏这一点。
当应用程序被注意(对一些扩展的内核),与I/O关联的完整性元数据是不透明的。

The current implementation allows the block layer to automatically
generate the protection information for any I/O.  Eventually the
intent is to move the integrity metadata calculation to userspace for
user data.  Metadata and other I/O that originates within the kernel
will still use the automatic generation interface.
目前的实现允许块层次动态的生成保护信息为所有的I/O。
最终,意图变成了为用户数据移动完整性元数据计算到用户空间。
元数据和其他I/O数据,发源于内核,始终使用动态生成接口。

Some storage devices allow each hardware sector to be tagged with a
16-bit value.  The owner of this tag space is the owner of the block
device.  I.e. the filesystem in most cases.  The filesystem can use
this extra space to tag sectors as they see fit.  Because the tag
space is limited, the block interface allows tagging bigger chunks by
way of interleaving.  This way, 8*16 bits of information can be
attached to a typical 4KB filesystem block.
一些存储设备允许每一个扇区被一个16位的值标记。标记空间的所有者是块设备的所有者。
I.e.文件系统是十分多的类型。
文件系统可以使用这个额外的空间来标记扇区,当他们看起来合适的话。
因为标记空间是十分有限的,块接口允许标记大的块,通过交错的方式。
这种方法,8*16为的信息可以关联一个典型的4KB文件系统块。

This also means that applications such as fsck and mkfs will need
access to manipulate the tags from user space.  A passthrough
interface for this is being worked on.
这意味着像fsck和mkfs这样的应用程序需要有操作标记的权限,从用户空间。
一个贯穿的接口,为这个权限,始终在工作。

----------------------------------------------------------------------
4. BLOCK LAYER IMPLEMENTATION DETAILS
块层实现细节

4.1 BIO

The data integrity patches add a new field to struct bio when
CONFIG_BLK_DEV_INTEGRITY is enabled.  bio->bi_integrity is a pointer
to a struct bip which contains the bio integrity payload.  Essentially
a bip is a trimmed down struct bio which holds a bio_vec containing
the integrity metadata and the required housekeeping information (bvec
pool, vector count, etc.)
数据完整性补丁给结构bio添加了一个新的域,当CONFIG_BLK_DEV_INTEGRITY被激活。
bio->bi_integrity是一个指向结构bip的指针,bip包含了bio完整性有效载荷。
特别地,一个bip是一个下降的结构bio,维持一个包含完整性元数据和
需要的自主信息(如bvec 池,向量数)的bio_vec。

A kernel subsystem can enable data integrity protection on a bio by
calling bio_integrity_alloc(bio).  This will allocate and attach the
bip to the bio.
一个内核子系统可以激活在一个bio上的数据完整性保护,通过调用函数bio_integrity_alloc(bio)。
这个函数会申请空间并且关联bip到制定的bio。

Individual pages containing integrity metadata can subsequently be
attached using bio_integrity_add_page().
独立页包含了完整性元数据可以以后被关联,使用函数bio_integrity_add_page()。

bio_free() will automatically free the bip.
函数bio_free()将动态的释放bip。

4.2 BLOCK DEVICE
块设备

Because the format of the protection data is tied to the physical
disk, each block device has been extended with a block integrity
profile (struct blk_integrity).  This optional profile is registered
with the block layer using blk_integrity_register().
因为保护数据的格式是绑定了物理磁盘,每一个块设备会被一个块完整性概括扩展
(结构 blk_integrity)。
这个供选择的概括是注册了的,通过块层使用blk_integrity_register()函数。

The profile contains callback functions for generating and verifying
the protection data, as well as getting and setting application tags.
The profile also contains a few constants to aid in completing,
merging and splitting the integrity metadata.
这个概括包含了生成和验证保护数据的函数调用,同时还获得和设置应用程序标记的函数调用。
这个概括包含了一些常量,补救完整性,合并,还有分割完整性元数据

Layered block devices will need to pick a profile that's appropriate
for all subdevices.  blk_integrity_compare() can help with that.  DM
and MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
will require extra work due to the application tag.
底层的块设备需要采取一个适合所有的子设备概括。
blk_integrity_compare()可以起到这方面的帮助。
DM和MD线性,RAID0和RAID1目前是支持的。
RAID4/5/6需要额外的工作,因为应用程序标记。

----------------------------------------------------------------------
5.0 BLOCK LAYER INTEGRITY API
块层完整性API

5.1 NORMAL FILESYSTEM
普通的文件系统
    The normal filesystem is unaware that the underlying block device
    is capable of sending/receiving integrity metadata.  The IMD will
    be automatically generated by the block layer at submit_bio() time
    in case of a WRITE.  A READ request will cause the I/O integrity
    to be verified upon completion.
普通的文件系统并不注意相关的块设备是否适合发送/接收完整性元数。
IMD会有块层在调用submit_bio()的时候自动生成。
一个读请求会导致I/O完整性在函数完成时验证。

    IMD generation and verification can be toggled using the
IMD生成和验证可以使用下面的标识切换。
      /sys/block/<bdev>/integrity/write_generate

    and

      /sys/block/<bdev>/integrity/read_verify

    flags.

5.2 INTEGRITY-AWARE FILESYSTEM
注意完整性的文件系统
    A filesystem that is integrity-aware can prepare I/Os with IMD
    attached.  It can also use the application tag space if this is
    supported by the block device.
一个注意完整性文件系统,可以准备关联了IMD的I/O设备。
它可以使用应用程序标记空间,如果块设备支持。

    int bdev_integrity_enabled(block_device, int rw);

      bdev_integrity_enabled() will return 1 if the block device
      supports integrity metadata transfer for the data direction
      specified in 'rw'.
 如果块设备支持完整性元数据转换用'rw'指定的数据方向,
 bdev_integrity_enabled()函数调用会返回1.
 
 
      bdev_integrity_enabled() honors the write_generate and
      read_verify flags in sysfs and will respond accordingly.
 在文件系统中bdev_integrity_enabled()需要注意
 write_generate和read_verify标志来响应。
 
    int bio_integrity_prep(bio);

      To generate IMD for WRITE and to set up buffers for READ, the
      filesystem must call bio_integrity_prep(bio).
 为写操作生成IMD并且为读设置缓冲区,文件系统必须调用bio_integrity_prep(bio)
 
      Prior to calling this function, the bio data direction and start
      sector must be set, and the bio should have all data pages
      added.  It is up to the caller to ensure that the bio does not
      change while I/O is in progress.
 前一次调用这个函数,bio数据方向和开始扇区必须被设置,bio应该添加所有的数据页。 
 调用者需保证,当I/O正在进行的时候,bio不会改变。
 
      bio_integrity_prep() should only be called if
      bio_integrity_enabled() returned 1.
 bio_integrity_prep()应该只在bio_integrity_enabled()返回1的时候。

    int bio_integrity_tag_size(bio);
 
      If the filesystem wants to use the application tag space it will
      first have to find out how much storage space is available.
      Because tag space is generally limited (usually 2 bytes per
      sector regardless of sector size), the integrity framework
      supports interleaving the information between the sectors in an
      I/O.
      如果文件系统想要使用程序标签空间,他会第一个找出,有多少存储空间是可以利用的。
 因为标签空间通常是有限的(一般2字节每个扇区,与扇区的大小无关),
 完整性框架支持I/O内的扇区间交错信息。
 
      Filesystems can call bio_integrity_tag_size(bio) to find out how
      many bytes of storage are available for that particular bio.
 文件系统可以调用bio_integrity_tag_size(bio)来找到有多少存储字节
 对一般的bio是可利用的。
 
      Another option is bdev_get_tag_size(block_device) which will
      return the number of available bytes per hardware sector.
 另一个选项是bdev_get_tag_size(block_device),会返回每个扇区可利用的字节数量。

    int bio_integrity_set_tag(bio, void *tag_buf, len);

      After a successful return from bio_integrity_prep(),
      bio_integrity_set_tag() can be used to attach an opaque tag
      buffer to a bio.  Obviously this only makes sense if the I/O is
      a WRITE.
      在bio_integrity_prep()成功返回时,bio_integrity_set_tag()可以给bio附加
 不透明的标签缓冲区。显然的只有在写I/O的时候才有意义。

    int bio_integrity_get_tag(bio, void *tag_buf, len);

      Similarly, at READ I/O completion time the filesystem can
      retrieve the tag buffer using bio_integrity_get_tag().
 类似的,在读I/O完成的时候,文件系统可以使用bio_integrity_get_tag()取回标签缓冲区。

5.3 PASSING EXISTING INTEGRITY METADATA

    Filesystems that either generate their own integrity metadata or
    are capable of transferring IMD from user space can use the
    following calls:
不仅可以生成他们自己的完整性元数据,而且能从用户空间传输IMD的文件系统,
可以使用下面的调用:

    struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);

      Allocates the bio integrity payload and hangs it off of the bio.
      nr_pages indicate how many pages of protection data need to be
      stored in the integrity bio_vec list (similar to bio_alloc()).
      申请bio完整性有效载荷,并且将它从bio挂起。
 nr_pages指明有多少页保护数据,需要存储在完整性bio_vec表中(与bio_alloc()相似)。
    
 The integrity payload will be freed at bio_free() time.
 调用bio_free释放完整性有效载荷。

    int bio_integrity_add_page(bio, page, len, offset);

      Attaches a page containing integrity metadata to an existing
      bio.  The bio must have an existing bip,
      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
      the integrity metadata in the pages must be in a format
      understood by the target device with the notable exception that
      the sector numbers will be remapped as the request traverses the
      I/O stack.  This implies that the pages added using this call
      will be modified during I/O!  The first reference tag in the
      integrity metadata must have a value of bip->bip_sector.
      给bio附加包含完整性元数据的一页。
 bio必须有一个bip,例如bio_integrity_alloc()必须被调用
 对于写操作,一页中的完整性元数据必须是能被目标设备理解的格式,并且带有
 有显著的异常,即扇区数重新映射为需求阻碍了I/O栈。
 
      Pages can be added using bio_integrity_add_page() as long as
      there is room in the bip bio_vec array (nr_pages).
 使用bio_integrity_add_page添加页,当bip的bio_vec数组中有空间。
 
      Upon completion of a READ operation, the attached pages will
      contain the integrity metadata received from the storage device.
      It is up to the receiver to process them and verify data
      integrity upon completion.
      当一个读操作完成,附加页的内容中包含从存储设备接收到的完整性元数据。
 由接受者处理这些数据,验证他们的完整性。
 

5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
    METADATA
注册一个块设备作为交换完整性元数据的条件

    To enable integrity exchange on a block device the gendisk must be
    registered as capable:
激活完整性在块设备上交换,产生设备必须注册,以此作为条件:

    int blk_integrity_register(gendisk, blk_integrity);

      The blk_integrity struct is a template and should contain the
      following:
 blk_integrity结构是一个模板需要包含下面的内容:
        static struct blk_integrity my_profile = {
            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
            .generate_fn            = my_generate_fn,
           .verify_fn              = my_verify_fn,
           .get_tag_fn             = my_get_tag_fn,
           .set_tag_fn             = my_set_tag_fn,
   .tuple_size             = sizeof(struct my_tuple_size),
   .tag_size               = <tag bytes per hw sector>,
        };

      'name' is a text string which will be visible in sysfs.  This is
      part of the userland API so chose it carefully and never change
      it.  The format is standards body-type-variant.
      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
 'name'是一个可以在文件系统中查看的文本字符串。
 这是用户空间的组成部分,因此谨慎的选择,永远不要改变它。
 格式是标准的 体-类型-变量。
 如T10-DIF-TYPE1-IP 或 T13-EPP-0-CRC。
 
      'generate_fn' generates appropriate integrity metadata (for WRITE).
 'generate_fn' 生成合适的完整性元数据(为写操作)。
 
      'verify_fn' verifies that the data buffer matches the integrity
      metadata.
      'verify_fn'验证数据缓冲区匹配完整性元数据。
 
      'tuple_size' must be set to match the size of the integrity
      metadata per sector.  I.e. 8 for DIF and EPP.
      'tuple_size'必须设置为没扇区完整性元数据的大小。例如DIF和EPP为8。
 
      'tag_size' must be set to identify how many bytes of tag space
      are available per hardware sector.  For DIF this is either 2 or
      0 depending on the value of the Control Mode Page ATO bit.
 'tag_size'必须设置为能指出,每个硬件扇区有多少标签空间。
 DIF为2或0,由控制模式也ATO位的值决定。

      See 6.2 for a description of get_tag_fn and set_tag_fn.
      查看6.2,描述get_tag_fn和set_tag_fn。
----------------------------------------------------------------------
2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>

抱歉!评论已关闭.