【hbase】bulkload数据到hbase表中

现在的位置: 首页 > 综合 > 正文

【hbase】bulkload数据到hbase表中

2013年02月04日 ⁄ 综合 ⁄ 共 5557字 ⁄ 字号小中大 ⁄ 评论关闭

一、概述
HBase有很多种方法将数据加载到表中，最简单直接的方法就是通过MapReduce调用TableOutputFormat方法，或者在client上调用API写入数据。但是，这都不是最有效的方式。

这篇文档将向你描述如何在HBase中加载大数据。采用MapReduce作业，将数据以HBase内部的组织格式输出成文件，然后将数据文件加载到已运行的集群中。（注：就是生成HFile，然后加载到HBase中。）

二、大数据载入的步骤
大数据的加载包含了2个步骤：
1、通过MapReduce的作业进行数据准备过程
首先，通过MapReduce使用HFileOutputFormat来生成HBase的数据文件格式。这样格式的数据文件就是HBase内部的文件组织格式，并且在将数据写入到集群的过程中是相当容易的。
为了使该方法更有效，HFileOutputFormat必须通过配置，每个输出的HFile必须适应单个的region。为了实现此功能，MapReduce的Job采用了Hadoop的TotalOrderPartitioner类，通过进行分区操作用以对应表中各个region。
同时，HFileOutputFormat包含有一个非常方便的方法，configureIncrementalLoad(), 这个方法会基于表的当前区域边界自动设置一个TotalOrderPartitioner。
2、数据加载过程
通过HFileOutputFormat准备好数据之后，使用命令行工具将数据加载到集群中。这个命令行工具遍历准备好的数据文件，并确定每一个文件所属的region。然后，当连接到对应的Region Server，移动到HFile到存储目录为用户提供数据。
如果在数据准备或者数据载入的时候，region边界发生了变化，那么HBase将自动进行块分割，用以适应新的边界变化。这个过程效率是很低下的，特别是有其他的client在做数据录入操作。所以需要注意，尽量使用少的时间去创造数据文件以及录入该数据文件进入集群。
3、使用importtsv为大数据加载做准备
HBase自带了importtsv命令工具。通过hadoop jar /path/to/hbase-VERSION.jar importtsv 来使用这个命令。如果不带参数的执行会打印以下帮助信息：
Usage: importtsv -Dimporttsv.columns=a,b,c
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns option.
This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier.
The special column name HBASE_ROW_KEY is used to designate that this column should be usedas the row key for each imported record.
You must specify exactly one column to be the row key.
In order to prepare data for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
4、使用completebulkload来载入数据
当使用importtsv导入数据之后，completebulkload 是用来导入数据到在运行的集群中。
completebulkload就是采用与importtsv 相同的输出路径和表的名称来执行。例如：
$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable
这个命令会执行的非常快，完成之后在集群中就能看到新的数据。
5、高级用法
虽然importtsv 命令很有用，但是在许多情况下，用户可能需要通过编写代码或其他形式的导入数据。
如果要这样做，可以查看ImportTsv.java 源代码，并阅读HFileOutputFormat的Javadoc帮助文档。

===============================================================================================================================

参考官网：http://hbase.apache.org/book.html#arch.bulk.load

9.8. Bulk Loading

9.8.1. Overview

HBase includes several methods of loading data into tables. The most straightforward method is to either use theTableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient
methods.

The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API.

9.8.2. Bulk Load Architecture

The HBase bulk load process consists of two main steps.

9.8.2.1. Preparing data via a MapReduce job

The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job usingHFileOutputFormat. This output format writes out data in HBase's internal storage format so that they can be later loaded very efficiently
into the cluster.

In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop'sTotalOrderPartitioner class
to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.

HFileOutputFormat includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the current region boundaries of a table.

9.8.2.2. Completing the data load

After the data has been prepared using HFileOutputFormat, it is loaded into the cluster usingcompletebulkload. This command line tool iterates through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, thecompletebulkloads utility will automatically split the data files into pieces corresponding to the
new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.

9.8.3. Importing the prepared data using the completebulkload tool

After a data import has been prepared, either by using the importtsv tool with the "importtsv.bulk.output" option or by some other MapReduce job using theHFileOutputFormat, thecompletebulkload tool
is used to import the data into the running cluster.

The completebulkload tool simply takes the output path whereimporttsv or your MapReduce job put its results, and the table name to import into. For example:

$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable

The -c config-file option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the
zookeeper configuration file if zookeeper is NOT managed by HBase).

Note: If the target table does not already exist in HBase, this tool will create the table automatically.

This tool will run quickly, after which point the new data will be visible in the cluster.

9.8.4. See Also

For more information about the referenced utilities, see Section 14.1.9, “ImportTsv” and Section
14.1.10, “CompleteBulkLoad”.

9.8.5. Advanced Usage

Although the importtsv tool is useful in many cases, advanced users may want to generate data programatically, or import data from other formats. To get started doing so, dig intoImportTsv.java and check the
JavaDoc for HFileOutputFormat.

The import step of the bulk load can also be done programatically. See the LoadIncrementalHFiles class for more information.

【上篇】IBM，开始了……
【下篇】GCC-3.4.6源代码学习笔记（145）

作者: chlamydia

该日志由 chlamydia 于11年前发表在综合分类下，最后更新于 2013年02月04日.
转载请注明: 【hbase】bulkload数据到hbase表中 | 学步园 +复制链接

抱歉!评论已关闭.

学步园