impala hive sql 使用中的常见查询与统计操作


概述

本篇主要是记录impala hive sql 使用中的常见查询与分析函数

数据引用地址
博客引用地址

数据样本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cookie1 2015-04-10      1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
cookie2 2015-04-10 2
cookie2 2015-04-11 3
cookie2 2015-04-12 5
cookie2 2015-04-13 6
cookie2 2015-04-14 3
cookie2 2015-04-15 9
cookie2 2015-04-16 7

1. 分析窗口函数应用场景

  • 用于分区排序

    C3

  • 动态Group By

  • Top N

  • 累计计算

  • 层次查询

ROW_NUMBER() OVER函数的基本用法用法

row_number() OVER (PARTITION BY COL1 ORDER BY COL2) 表示根据COL1分组,在分组内部根据 COL2排序,而此函数计算的值就表示每组内部排序后的顺序编号(组内连续的唯一的)

1
2
3
4
5
SELECT * FROM (
SELECT partner_hotel_id, hotel_seq, task_date, link_status, row_number()
OVER(PARTITION BY partner_hotel_id ORDER BY task_date DESC) AS num
FROM clues_mapping.qunar_mapping_houses
) tmp WHERE tmp.num = 1 and link_status = 'LINKED'

NTILE() OVER

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值, 假如统计一个cookie,pv数最多的前1/3的天,下边的数据其实我们只需rn2 = 1,如果是前2/3那rn2 in (1, 2)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
   SELECT 
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY pv) AS rn1, --分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv) AS rn2, --分组内将数据分成3片
NTILE(4) OVER(ORDER BY pv) AS rn3 --将所有数据分成4片
FROM lxw1234
ORDER BY cookieid,createtime;

cookieid day pv rn1 rn2 rn3
-------------------------------------------------
cookie1 2015-04-10 1 1 1 1
cookie1 2015-04-11 5 1 1 1
cookie1 2015-04-12 7 1 1 2
cookie1 2015-04-13 3 1 2 2
cookie1 2015-04-14 2 2 2 3
cookie1 2015-04-15 4 2 3 3
cookie1 2015-04-16 4 2 3 4
cookie2 2015-04-10 2 1 1 1
cookie2 2015-04-11 3 1 1 1
cookie2 2015-04-12 5 1 1 2
cookie2 2015-04-13 6 1 2 2
cookie2 2015-04-14 3 2 2 3
cookie2 2015-04-15 9 2 3 4
cookie2 2015-04-16 7 2 3 4

RANK 和 DENSE_RANK

—RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
—DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM lxw1234
WHERE cookieid = 'cookie1';

cookieid day pv rn1 rn2 rn3
--------------------------------------------------
cookie1 2015-04-12 7 1 1 1
cookie1 2015-04-11 5 2 2 2
cookie1 2015-04-15 4 3 3 3
cookie1 2015-04-16 4 3 3 4
cookie1 2015-04-13 3 5 4 5
cookie1 2015-04-14 2 6 5 6
cookie1 2015-04-10 1 7 6 7

rn1: 15号和16号并列第3, 13号排第5
rn2: 15号和16号并列第3, 13号排第4
rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。

窗口统计函数

关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

表结构

  • cookieid STRING
  • createtime STRING
  • pv INT

sum 窗口统计的使用,AVG,MIN,MAX均是同样的用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid
ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid
ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
SUM(pv) OVER(PARTITION BY cookieid
ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid
ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid
ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
FROM lxw1234;

hive中的行列转换函数

hive 中的行转列

关键点:使用到explode()、split()和LATERAL VIEW

1
2
3
SELECT rateable_id AS house_id,
ids
FROM vdayu_production.rates AS h lateral view explode(split(sight_photo_ids_str, ",")) adtable AS ids

hive 中的列转行

关键点:使用到concat_ws()collect_set(),collect_list()
collect_set()collect_list() 的区别:collect_set()会对合并后的数据去重而 collect_list()不会去重

1
2
3
4
SELECT sale_ord_id ,
concat_ws(',',collect_set(item_sku_id))as item_sku_id
FROM table1
GROUP BY sale_ord_id

impala 时间戳转时间

1
from_unixtime(cast(substr(created_time,1, 10) as int), 'yyyy-MM-dd')

hive 中的json串解析

1
get_json_object(string json_string, string path)

get_json_object(regexp_extract(houses.accept_time,’^\[(.+)\]$’, 1), ‘$.start_accept_time’) AS start_accept_time,

这里是为了处理类似于这样的数据

1
[{"start_accept_time":"08:00","end_accept_time":"24:00"}]

先去除 [] 然后拿相应的字段

hive 中用sort_array函数解决collet_list,collet_set列表排序混乱问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT 
memberid,
regexp_replace(
concat_ws('-',
sort_array(
collect_set(
concat_ws(':',cast(legcount as string),airways)
)
)
),'\\d\:','') hs
from
(
select 1 as memberid,'A' as airways,2 as legcount
union ALL
select 1 as memberid,'B' as airways,3 as legcount
union ALL
select 2 as memberid,'C' as airways,4 as legcount
union ALL
select 2 as memberid,'D' as airways,1 as legcount
union ALL
select 2 as memberid,'D' as airways,1 as legcount
) as t
group by memberid

hive中时间转换

1
from_unixtime(unix_timestamp('20190525', 'yyyyMMdd'), 'yyyy-MM-dd')
1
2
3
4
SELECT 
from_unixtime( to_unixtime(cast ('2017-12-01' as timestamp))

-- format_datetime(from_unixtime(cast(to_unixtime(cast ('2017-12-01' as timestamp) as double)),'yyyy-MM-dd HH:mm:ss')

hive中的透视表实现

有时候,我们在hive中编写hql的时候,可以用键-值对的形式存储结果。比如有一些同学在一些课程中的学习行为,我们可以记录成如下形式:
| student | class_name | index_key | index_value |
| ——- | ———- | ———— | ———– |
| A | c1 | answer_num | 10 |
| B | c2 | homework_num | 10 |

这样做的好处有以下几点。一是可以避免hql中出现大量的join连接,使得hql的冗长;二是便于修改表,比如说,在有个同学的学习特征提取错误的时候,可以直接删除这个key,而不用再重写修改hql,重新跑hql。在将数据存储成这种形式后,我们需要将其转换成普通的形式。本文主要记录如何通过hive中的内置函数,来进行数据的转换。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
create table tmp as
select 'A' as student, 'english' as class_name, 'answer_num' as key, 9 as value from dual
union all
select 'A' as student, 'chinese' as class_name, 'answer_num' as key, 15 as value from dual
union all
select 'B' as student, 'english' as class_name, 'answer_num' as key, 12 as value from dual
union all
select 'A' as student, 'english' as class_name, 'homework_num' as key, 5 as value from dual
union all
select 'B' as student, 'english' as class_name, 'homework_num' as key, 7 as value from dual
union all
select 'B' as student, 'chinese' as class_name, 'homework_num' as key, 10 as value from dual;
---------------------
作者:SK_Lavender
来源:CSDN
原文:https://blog.csdn.net/u010705209/article/details/53164118

Pivot using Hivemall to_map function

1
2
3
4
5
6
7
8
9
10
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
uid c1 c2 c3
A 11 10 10
B 12 11 10

Unpivot

1
2
3
4
5
6
7
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
uid key value
A C1 11
A C2 10
A C3 10
B C1 12
B C2 11
B C3 10

cc


文章作者: Callable
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Callable !
评论
  目录