Hive常见的数据导入方式有4种:
(1):从本地文件系统中导入数据到Hive表。
(2):从HDFS上导入数据到Hive表。
(3):从别的表中查询出相应的数据并导入到Hive表中。
(4):在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中。
我会对每一种数据的导入进行实际的操作,因为纯粹的文字让人看起来很枯燥,而且学起来也很抽象。好了,开始操作!
一、从本地文件系统中导入数据到Hive表
现在Hive里面创建好表,如下:
hive (hive)> create table wyp > (id int,name string, > age int,tel string) > row format delimited > fields terminated by '\t' > stored as textfile; OK Time taken: 0.563 seconds
这个表很简单,只有四个字段,具体含义我就不解释了。本地文件系统里面有个/usr/local/src/wyp.txt问加你,内容如下:
[root@liaozhongmin src]# more wyp.txt 1 lavimer 23 13878789088 2 liaozhongmin 24 13787896578 3 liaozemin 25 13409876785
wyp.txt文件中的数据列之间是使用制表符分割的,可以通过下面的语句将这个文件里面的数据导入到wyp表中,操作如下:
hive (hive)> load data local inpath '/usr/local/src/wyp.txt' into table wyp; Copying data from file:/usr/local/src/wyp.txt Copying file: file:/usr/local/src/wyp.txt Loading data to table hive.wyp OK Time taken: 1.495 seconds
我们可以到wyp表的数据目录下查看,命令如下:
hive (hive)> dfs -ls /hive/hive.db/wyp > ; Found 1 items -rw-r--r-- 1 root supergroup 82 2015-01-26 15:00 /hive/hive.db/wyp/wyp.txt
数据的确导入到wyp表里面去了。
注:和我们熟悉的关系型数据库不一样,Hive现在还不支持在insert语句里面直接给出一组记录的文字形式,也就是说,Hive并不支持INSERT INTO...VALUES形式的语句。
二、HDFS上导入数据到Hive表
从本地文件系统中将数据导入到Hive表的过程中,其实是先将数据临时复制到HDFS的一个目录下(典型的情况是复制到上传用户的HDFS home目录下,比如:/home/wyp/),然后再将数据从那个临时目录下移动(注意:这里是移动,而不是复制!)到对应的Hive表的数据目录里面。既然如此,那么Hive肯定支持将数据直接从HDFS上的某个目录移动到相应Hive表的数据目录下,假设HDFS中有下面这个文件/hive_load_data/wyp.txt,具体的内容如下:
[root@liaozhongmin src]# hadoop fs -text /usr/local/src/wyp.txt Warning: $HADOOP_HOME is deprecated. 1 lavimer 23 13878789088 2 liaozhongmin 24 13787896578 3 liaozemin 25 13409876785
上面的是需要插入的数据的内容,数据存放在HDFS中/hive_load_data/wyp.txt中(和一中提到的不同,一中的文件是放在本地文件系统上),我们可以通过下面的命令将这个文件里面的内容导入懂啊Hive表中,具体操作如下:
hive (hive)> load data inpath '/hive_load_data/wyp.txt' into table wyp; Loading data to table hive.wyp OK Time taken: 1.133 seconds
注:这里也是将HDFS中的内容移动到Hive中,而不是复制。另外一个一定要注意,如果有一个同样名称的文件,你再使用命令加载数据是会报错的,如下:
15/01/26 15:58:57 ERROR exec.Task: Failed with exception null java.lang.NullPointerException at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:1962) at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:545) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1261) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:234) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
解决方案可以是先将文件重命名在上传或者是只能覆盖掉原来的内容,语句如下:
hive (hive)> load data inpath '/hive_load_data/wyp.txt' overwrite into table wyp; Loading data to table hive.wyp Deleted hdfs://liaozhongmin:9000/hive/hive.db/wyp OK Time taken: 0.542 seconds hive (hive)> [root@liaozhongmin src]#
注:从HDFS中加载数据使用的命令是load data inpath 'XXX' into table XXX。这里面没有local关键字,这是和第一种方式的主要区别。
三、从别的表中查询相应的数据并导入到hive表
在Hive的hive数据库中创建test表,建表语句如下:
hive (hive)> create table test( > id int,name string, > tel string) > partitioned by (age int) > row format delimited > fields terminated by '\t' > stored as textfile; OK Time taken: 0.488 seconds hive (hive)>
大体和wyp表的建表语句类似,只不过test表里面用age作为了分区字段,下面语句就是将wyp中的结果查询出来并插入到test表中:
hive (hive)> insert into table test > partition (age='25') > select id,name,tel > from wyp; Automatically selecting local only mode for query Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/root/root_20150126161313_569631cb-6661-44fe-a78f-b83682ada03b.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2015-01-26 16:13:07,275 null map = 100%, reduce = 0% Ended Job = job_local_0001 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Ended Job = 502340710, job is filtered out (removed at runtime). Moving data to: hdfs://liaozhongmin:9000/tmp/hive-root/hive_2015-01-26_16-13-02_201_3561493006175965099/-ext-10000 Loading data to table hive.test partition (age=25) Partition hive.test{age=25} stats: [num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70] Table hive.test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70] OK id name tel Time taken: 6.754 seconds hive (hive)> select * from test; OK id name tel age 1 lavimer 13878789088 25 2 liaozhongmin 13787896578 25 3 liaozemin 13409876785 25 Time taken: 0.137 seconds hive (hive)>
通过上面的输出,我们可以看到从wyp表中查询出来的东西已经成功插入到test表中去了,如果目标表(test)中不存在分区字段,可以去掉partition(age='25')语句。当然,我们也可以在select语句里面通过使用分区值来动态指明分区:
hive (hive)> set hive.exec.dynamic.partition.mode=nonstrict; hive (hive)> insert into table test > partition (age) > select id,name, > tel,age > from wyp; Automatically selecting local only mode for query Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Loading partition {age=25} Loading partition {age=23} Loading partition {age=24} Partition hive.test{age=23} stats: [num_files: 1, num_rows: 1, total_size: 22, raw_data_size: 21] Partition hive.test{age=24} stats: [num_files: 1, num_rows: 1, total_size: 27, raw_data_size: 26] Partition hive.test{age=25} stats: [num_files: 2, num_rows: 4, total_size: 97, raw_data_size: 93] Table hive.test stats: [num_partitions: 3, num_files: 4, num_rows: 6, total_size: 146, raw_data_size: 140] OK id name tel age Time taken: 7.3 seconds hive (hive)> select * from test; OK id name tel age 1 lavimer 13878789088 23 2 liaozhongmin 13787896578 24 1 lavimer 13878789088 25 2 liaozhongmin 13787896578 25 3 liaozemin 13409876785 25 3 liaozemin 13409876785 25 Time taken: 0.246 seconds hive (hive)>
这种方法叫做动态分区插入,但是Hive中默认是关闭的,所以在使用前需要把hive.exec.dynamic.partition.modec设置为nonstrict。当然,Hive也支持insert overwrite方式来插入数据,从字面我们就可以看出,overwrite是覆盖的意思,是的,执行完这条语句的时候,相应数据目录下的数据将会被覆盖,而insert
into则不会,注意两者之间的区别。例子如下:
hive> insert overwrite table test > PARTITION (age) > select id, name, tel, age > from wyp;
四、在创建表的时候通过从别的表中查询出相应的记录并插入到所创建的表中
在实际情况中,表的输出结果可能太多,不适于显示在控制台上,这时候,将Hive的查询输出结果直接存储在一个新的表中是非常方便的,我们称这种情况为CTAS(create table ...as select)如下:
hive (hive)> create table test4 > as > select id,name,tel > from wyp; Automatically selecting local only mode for query Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/root/root_20150126162727_e1ba78c6-1696-4a7d-b3a9-740657eb2576.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2015-01-26 16:27:28,620 null map = 100%, reduce = 0% Ended Job = job_local_0001 Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Ended Job = 366870212, job is filtered out (removed at runtime). Moving data to: hdfs://liaozhongmin:9000/tmp/hive-root/hive_2015-01-26_16-27-23_012_214489060288615790/-ext-10001 Moving data to: hdfs://liaozhongmin:9000/hive/hive.db/test4 Table hive.test4 stats: [num_partitions: 0, num_files: 1, num_rows: 3, total_size: 73, raw_data_size: 70] OK id name tel Time taken: 6.166 seconds hive (hive)> select * from test4; OK id name tel 1 lavimer 13878789088 2 liaozhongmin 13787896578 3 liaozemin 13409876785 Time taken: 0.09 seconds
数据就插入到test4表中去了,CTAS操作是原子的,因此如果select查询由于某种原因而失败,新表是不会被创建的!