hive语句优化-通过groupby实现distinct

现在的位置: 首页 > 综合 > 正文

hive语句优化-通过groupby实现distinct

2018年04月08日 ⁄ 综合 ⁄ 共 1135字 ⁄ 字号小中大 ⁄ 评论关闭

同事写了个hive的sql语句，执行效率特别慢，跑了一个多小时程序只是map完了，reduce进行到20%。
该Hive语句如下：

select count(distinct ip)
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d

分析：select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"这个语句筛选出来的数据约有10亿条，select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"约有10亿条条，select ip as ip from format_log.format_pv1 where year="2013"
and month="10" and url_first_id=1 筛选出来的数据约有10亿条，总的数据量大约30亿条。这么大的数据量，使用disticnt函数，所有的数据只会shuffle到一个reducer上，导致reducer数据倾斜严重。
解决办法：
首先，通过使用groupby，按照ip进行分组。改写后的sql语句如下：

select count(*)
from
(select ip
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
group by ip ) b

然后，合理的设置reducer数量，将数据分散到多台机器上。set mapred.reduce.tasks=50;
经过优化后，速度提高非常明显。整个作业跑完大约只需要20多分钟的时间。

【上篇】JVM系列三:JVM参数设置、分析
【下篇】Scala 访问权限控制——Scala Access Modifiers

作者: coiffeur

该日志由 coiffeur 于6年前发表在综合分类下，最后更新于 2018年04月08日.
转载请注明: hive语句优化-通过groupby实现distinct | 学步园 +复制链接

抱歉!评论已关闭.

学步园

hive语句优化-通过groupby实现distinct

作者: coiffeur

书签

最新文章New

本站推荐

返回首页