Hive几种数据导入方式

现在的位置: 首页 > 综合 > 正文

RSS

Hive几种数据导入方式

2018年06月05日 ⁄ 综合 ⁄ 共 7665字 ⁄ 字号小中大 ⁄ 评论关闭

Hive常见的数据导入方式有4种：

(1)：从本地文件系统中导入数据到Hive表。

(2)：从HDFS上导入数据到Hive表。

(3)：从别的表中查询出相应的数据并导入到Hive表中。

(4)：在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中。

我会对每一种数据的导入进行实际的操作，因为纯粹的文字让人看起来很枯燥，而且学起来也很抽象。好了，开始操作！

一、从本地文件系统中导入数据到Hive表

现在Hive里面创建好表，如下：

hive (hive)> create table wyp
           > (id int,name string,
           > age int,tel string)
           > row format delimited
           > fields terminated by '\t'
           > stored as textfile;
OK
Time taken: 0.563 seconds

这个表很简单，只有四个字段，具体含义我就不解释了。本地文件系统里面有个/usr/local/src/wyp.txt问加你，内容如下：

[root@liaozhongmin src]# more wyp.txt 
1	lavimer	23	13878789088
2	liaozhongmin	24	13787896578
3	liaozemin	25	13409876785

wyp.txt文件中的数据列之间是使用制表符分割的，可以通过下面的语句将这个文件里面的数据导入到wyp表中，操作如下：

hive (hive)> load data local inpath '/usr/local/src/wyp.txt' into table wyp;
Copying data from file:/usr/local/src/wyp.txt
Copying file: file:/usr/local/src/wyp.txt
Loading data to table hive.wyp
OK
Time taken: 1.495 seconds

我们可以到wyp表的数据目录下查看，命令如下：

hive (hive)> dfs -ls /hive/hive.db/wyp
           > ;
Found 1 items
-rw-r--r--   1 root supergroup         82 2015-01-26 15:00 /hive/hive.db/wyp/wyp.txt

数据的确导入到wyp表里面去了。

注：和我们熟悉的关系型数据库不一样，Hive现在还不支持在insert语句里面直接给出一组记录的文字形式，也就是说，Hive并不支持INSERT INTO...VALUES形式的语句。

二、HDFS上导入数据到Hive表

从本地文件系统中将数据导入到Hive表的过程中，其实是先将数据临时复制到HDFS的一个目录下(典型的情况是复制到上传用户的HDFS home目录下，比如：/home/wyp/)，然后再将数据从那个临时目录下移动(注意：这里是移动，而不是复制！)到对应的Hive表的数据目录里面。既然如此，那么Hive肯定支持将数据直接从HDFS上的某个目录移动到相应Hive表的数据目录下，假设HDFS中有下面这个文件/hive_load_data/wyp.txt，具体的内容如下：

[root@liaozhongmin src]# hadoop fs -text /usr/local/src/wyp.txt 
Warning: $HADOOP_HOME is deprecated.

1	lavimer	23	13878789088
2	liaozhongmin	24	13787896578
3	liaozemin	25	13409876785

上面的是需要插入的数据的内容，数据存放在HDFS中/hive_load_data/wyp.txt中(和一中提到的不同，一中的文件是放在本地文件系统上)，我们可以通过下面的命令将这个文件里面的内容导入懂啊Hive表中，具体操作如下：

hive (hive)> load data inpath '/hive_load_data/wyp.txt' into table wyp; 
Loading data to table hive.wyp
OK
Time taken: 1.133 seconds

注：这里也是将HDFS中的内容移动到Hive中，而不是复制。另外一个一定要注意，如果有一个同样名称的文件，你再使用命令加载数据是会报错的，如下：

15/01/26 15:58:57 ERROR exec.Task: Failed with exception null
java.lang.NullPointerException
	at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:1962)
	at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:545)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1261)
	at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:234)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

解决方案可以是先将文件重命名在上传或者是只能覆盖掉原来的内容，语句如下：

hive (hive)> load data inpath '/hive_load_data/wyp.txt' overwrite into table wyp;
Loading data to table hive.wyp
Deleted hdfs://liaozhongmin:9000/hive/hive.db/wyp
OK
Time taken: 0.542 seconds
hive (hive)> [root@liaozhongmin src]#

注：从HDFS中加载数据使用的命令是load data inpath 'XXX' into table XXX。这里面没有local关键字，这是和第一种方式的主要区别。

三、从别的表中查询相应的数据并导入到hive表

在Hive的hive数据库中创建test表，建表语句如下：

hive (hive)> create table test(
           > id int,name string,
           > tel string)
           > partitioned by (age int)
           > row format delimited   
           > fields terminated by '\t'
           > stored as textfile;
OK
Time taken: 0.488 seconds
hive (hive)>

大体和wyp表的建表语句类似，只不过test表里面用age作为了分区字段，下面语句就是将wyp中的结果查询出来并插入到test表中：

hive (hive)> insert into table test 
           > partition (age='25')
           > select id,name,tel
           > from wyp;
Automatically selecting local only mode for query
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20150126161313_569631cb-6661-44fe-a78f-b83682ada03b.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-01-26 16:13:07,275 null map = 100%,  reduce = 0%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 502340710, job is filtered out (removed at runtime).
Moving data to: hdfs://liaozhongmin:9000/tmp/hive-root/hive_2015-01-26_16-13-02_201_3561493006175965099/-ext-10000
Loading data to table hive.test partition (age=25)
Partition hive.test{age=25} stats: [num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70]
Table hive.test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70]
OK
id	name	tel
Time taken: 6.754 seconds
hive (hive)> select * from test;
OK
id	name	tel	age
1	lavimer	13878789088	25
2	liaozhongmin	13787896578	25
3	liaozemin	13409876785	25
Time taken: 0.137 seconds
hive (hive)>

通过上面的输出，我们可以看到从wyp表中查询出来的东西已经成功插入到test表中去了，如果目标表(test)中不存在分区字段，可以去掉partition(age='25')语句。当然，我们也可以在select语句里面通过使用分区值来动态指明分区：

hive (hive)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (hive)> insert into table test
           > partition (age)
           > select id,name,
           > tel,age
           > from wyp;
Automatically selecting local only mode for query
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
	Loading partition {age=25}
	Loading partition {age=23}
	Loading partition {age=24}
Partition hive.test{age=23} stats: [num_files: 1, num_rows: 1, total_size: 22, raw_data_size: 21]
Partition hive.test{age=24} stats: [num_files: 1, num_rows: 1, total_size: 27, raw_data_size: 26]
Partition hive.test{age=25} stats: [num_files: 2, num_rows: 4, total_size: 97, raw_data_size: 93]
Table hive.test stats: [num_partitions: 3, num_files: 4, num_rows: 6, total_size: 146, raw_data_size: 140]
OK
id	name	tel	age
Time taken: 7.3 seconds
hive (hive)> select * from test;
OK
id	name	tel	age
1	lavimer	13878789088	23
2	liaozhongmin	13787896578	24
1	lavimer	13878789088	25
2	liaozhongmin	13787896578	25
3	liaozemin	13409876785	25
3	liaozemin	13409876785	25
Time taken: 0.246 seconds
hive (hive)>

这种方法叫做动态分区插入，但是Hive中默认是关闭的，所以在使用前需要把hive.exec.dynamic.partition.modec设置为nonstrict。当然，Hive也支持insert overwrite方式来插入数据，从字面我们就可以看出，overwrite是覆盖的意思，是的，执行完这条语句的时候，相应数据目录下的数据将会被覆盖，而insert
into则不会，注意两者之间的区别。例子如下：

hive> insert overwrite table test
    > PARTITION (age)
    > select id, name, tel, age
    > from wyp;

四、在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中

在实际情况中，表的输出结果可能太多，不适于显示在控制台上，这时候，将Hive的查询输出结果直接存储在一个新的表中是非常方便的，我们称这种情况为CTAS(create table ...as select)如下：

hive (hive)> create table test4
           > as   
           > select id,name,tel
           > from wyp;
Automatically selecting local only mode for query
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Execution log at: /tmp/root/root_20150126162727_e1ba78c6-1696-4a7d-b3a9-740657eb2576.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2015-01-26 16:27:28,620 null map = 100%,  reduce = 0%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 366870212, job is filtered out (removed at runtime).
Moving data to: hdfs://liaozhongmin:9000/tmp/hive-root/hive_2015-01-26_16-27-23_012_214489060288615790/-ext-10001
Moving data to: hdfs://liaozhongmin:9000/hive/hive.db/test4
Table hive.test4 stats: [num_partitions: 0, num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70]
OK
id	name	tel
Time taken: 6.166 seconds
hive (hive)> select * from test4;
OK
id	name	tel
1	lavimer	13878789088
2	liaozhongmin	13787896578
3	liaozemin	13409876785
Time taken: 0.09 seconds

数据就插入到test4表中去了，CTAS操作是原子的，因此如果select查询由于某种原因而失败，新表是不会被创建的！

文章来自过往记忆：http://www.iteblog.com/archives/949

【上篇】Hive几种数据导出方式
【下篇】Hadoop多文件输出之MultipleOutputFormat和MultipleOutputs

作者: knut

该日志由 knut 于6年前发表在综合分类下，最后更新于 2018年06月05日.
转载请注明: Hive几种数据导入方式 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

Hive几种数据导入方式

作者: knut

书签

最新文章New

本站推荐

返回首页