1. 概述
一个文本f1.txt的格式如下:
1 tom 2 jame 3 mango
它的第一列是id,第二列是name,第一列和第二列间通过不固定长度的空白(如空格 制表符等)分割;
我们希望创建一个user表,能够识别f1.txt ,通过创建表时执行分隔符的方法就不行了,这就需要用到hive的序列化(SerDe)了。
2. 新建一个maven项目,添加hive-serde 0.11.0 , hadoop-core 1.0.3的依赖。
创建SerdeTest类,实现Deserializer接口,
- 在initialize()方法中,描述表的各个字段及其类型
- 在deserialize(Writable text)方法中将text解析成id和name
- getObjectInspector()方法返回ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames,structFieldObjectInspectors)
package com.renren.hive.tools; public class SerdeTest implements Deserializer { private List<String> structFieldNames = new ArrayList<String>(); private List<ObjectInspector> structFieldObjectInspectors = new ArrayList<ObjectInspector>(); @Override public ObjectInspector getObjectInspector() throws SerDeException { // TODO Auto-generated method stub return ObjectInspectorFactory.getStandardStructObjectInspector( structFieldNames, structFieldObjectInspectors); } @Override public Object deserialize(Writable text) throws SerDeException { // TODO Auto-generated method stub List<Object> result = new ArrayList<Object>(); StringTokenizer tokenizer = new StringTokenizer(text.toString()); int index = 0; while (tokenizer.hasMoreTokens()) { if (index == 0) { result.add(Integer.valueOf(tokenizer.nextToken()).intValue()); } else { result.add(tokenizer.nextToken()); } index++; } return result; } @Override public void initialize(Configuration arg0, Properties arg1) throws SerDeException { // TODO Auto-generated method stub structFieldNames.add("id"); structFieldObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(Integer.TYPE, ObjectInspectorOptions.JAVA)); structFieldNames.add("name"); structFieldObjectInspectors.add(ObjectInspectorFactory .getReflectionObjectInspector(String.class, ObjectInspectorOptions.JAVA)); } @Override public SerDeStats getSerDeStats() { // TODO Auto-generated method stub return null; } }
3. 生成jar包添加到hive/lib下:
mvn clean package
将生成的jar包:hive-serde-tool-1.0.1-SNAPSHOT.jar 添加到hive_home/lib下,并在hive-site.xml中添加:
<property> <name>hive.aux.jars.path</name> <value>file:///home/dp/hive/lib/hive-serde-tool-1.0.1-SNAPSHOT.jar</value> </property>
4. 创建hive表,指定serde
hive -e "create table test row formated serde 'com.renren.hive.tools.SerdeTest'"
5.加载并查询数据
hive -e "load data local inpath 'f1.txt' overwrite into table test" hive -e "select * from test"