hive中order by,sort by,distribute by,cluster by作用和用法


hive中order by,sort by,distribute by,cluster by作用和用法

转载

数据准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 --  zxz_5.name       | zxz_5.nid  |  zxz_5.phone  | zxz_5.ntime  | zxz_5.year  | zxz_5.month 
CREATE TABLE (
name string,
nid BIGINT,
phone string,
ntime string,
year string,
month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS ORC;

insert into zxz_5 values
('Cleon Haresnape ',100 ,'625-338-3965', '2014-12-04' , '2018','8'),
('Blaine Fawdry ',99 ,'477-109-9014', '2012-07-14' , '2018','8'),
('Zoe Adacot ',98 ,'704-987-0702', '2015-09-29' , '2018','8'),
('Orlando Pallatina ',97 ,'354-125-1208', '2012-07-12' , '2018','8'),
('Alyssa Prydden ',96 ,'963-170-0545', '2014-11-07' , '2018','8'),
('Skipp Spurden ',95 ,'747-133-1382', '2012-03-15' , '2018','8'),
('Auguste Kares ',94 ,'230-184-3438', '2014-03-13' , '2018','8'),
('Ciro Arendt ',93 ,'792-967-0588', '2015-11-07' , '2018','8'),
('Floyd Cano ',92 ,'133-768-6535', '2016-02-27' , '2018','8'),
('Sayre Osbaldeston ',91 ,'340-132-2361', '2011-11-30' , '2018','8'),
('Pearline Marcq ',90 ,'200-835-9497', '2016-02-10' , '2018','8'),
('Zackariah Pietrusiak ',89 ,'810-738-9846', '2012-02-25' , '2018','8'),
('Auguste Kobel ',88 ,'562-494-1360', '2012-02-29' , '2018','8'),
('Jilly Eisikowitz ',87 ,'431-355-2777', '2017-02-18' , '2018','8'),
('Wes Jaffrey ',86 ,'848-465-5131', '2016-02-14' , '2018','8'),
('Casey Greenfield ',85 ,'204-108-7707', '2012-03-18' , '2018','8'),
('Maurise Gallico ',84 ,'546-158-7983', '2011-12-21' , '2018','8'),
('Cilka Heijne ',83 ,'772-704-7366', '2011-08-27' , '2018','8');

order by

hive中的order by 语句和其他的sql方言中定义是一样的,其会对查询结果集执行一个全局排序,这也就是说会有一个所有的数据都通过一个reduce进行处理的过程,对于大数据集,这个过程将消耗很大的时间来执行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
select * from zxz_5 order by nid desc,year(ntime);
INFO : Compiling command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085): select * from zxz_5 order by nid desc,year(ntime)
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:zxz_5.name, type:string, comment:null), FieldSchema(name:zxz_5.nid, type:bigint, comment:null), FieldSchema(name:zxz_5.phone, type:string, comment:null), FieldSchema(name:zxz_5.ntime, type:string, comment:null), FieldSchema(name:zxz_5.year, type:string, comment:null), FieldSchema(name:zxz_5.month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085); Time taken: 0.062 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085): select * from zxz_5 order by nid desc,year(ntime)
INFO : Query ID = hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_1576480232731_832153
INFO : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832153/
INFO : Starting Job = job_1576480232731_832153, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832153/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1576480232731_832153
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2021-06-23 11:31:56,382 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-23 11:32:00,469 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.88 sec
INFO : 2021-06-23 11:32:05,582 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.6 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 600 msec
INFO : Ended Job = job_1576480232731_832153
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.6 sec HDFS Read: 15125 HDFS Write: 1027 SUCCESS
INFO : Total MapReduce CPU Time Spent: 3 seconds 600 msec
INFO : Completed executing command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085); Time taken: 13.874 seconds
INFO : OK

sort by

hive增加了一个可供选择的方式,也就是sort by ,其实会在每个reduce中对数据进行排序,也就是执行一个局部排序过程。这可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。对于这两种情况,语法区别仅仅是,一个关键字是order,另一个关键字是sort。用户可以指定任意期望进行排序的字段,并可以在字段后面加上asc关键字(默认)表示升序,desc关键字是降序排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- 在sort by之前我们还有配置属性:
set mapreduce.job.reduces=2; -- 配置两个ruduce默认一个,不然sort by是没有用处的。
select * from zxz_5 sort by nid ,year(ntime) desc;
INFO : Compiling command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09): select * from zxz_5 sort by nid,year(ntime) desc
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:zxz_5.name, type:string, comment:null), FieldSchema(name:zxz_5.nid, type:bigint, comment:null), FieldSchema(name:zxz_5.phone, type:string, comment:null), FieldSchema(name:zxz_5.ntime, type:string, comment:null), FieldSchema(name:zxz_5.year, type:string, comment:null), FieldSchema(name:zxz_5.month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09); Time taken: 0.05 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09): select * from zxz_5 sort by nid,year(ntime) desc
INFO : Query ID = hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_1576480232731_832167
INFO : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832167/
INFO : Starting Job = job_1576480232731_832167, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832167/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1576480232731_832167
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO : 2021-06-23 11:50:46,334 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-23 11:50:50,424 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.98 sec
INFO : 2021-06-23 11:50:55,516 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.57 sec
INFO : MapReduce Total cumulative CPU time: 5 seconds 570 msec
INFO : Ended Job = job_1576480232731_832167
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 5.57 sec HDFS Read: 19684 HDFS Write: 1027 SUCCESS
INFO : Total MapReduce CPU Time Spent: 5 seconds 570 msec
INFO : Completed executing command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09); Time taken: 13.929 seconds
INFO : OK

sort by 和 distribute by

distribute by 控制 map的输出在reduer中是如何划分的,mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的,
因此hive在将用户的查询语句转换成mapreduce job时,其必须在内部使用这个功能。
默认情况下,MapReduce计算框架会依据map输入的键计算相应的哈希值,然后按照得到的哈希值将键-值对均匀分发到多个reducer中去,
不过不幸的是,这也是意味着当我们使用sort by 时,不同reducer的输出内容会有明显的重叠,至少对于排序顺序而已只这样,
即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理,那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理,
然后使用sort by 来安装我们的期望对数据进行排序:

Distribute by 和 group by 在其控制着reducer是如何接受一行行数据进行处理这方面类似的,而sort by 则控制着reducer内的数据是如何进行排序的,需要注意的是,hive需求distribute by 语句写在sort by 语句前

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
set mapreduce.job.reduces=2;
select * from zxz_5 z distribute by year(z.ntime) sort by z.nid ;
INFO : Compiling command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31): select year(z.ntime),* from zxz_5 z distribute by year(z.ntime) sort by z.nid
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:int, comment:null), FieldSchema(name:z.name, type:string, comment:null), FieldSchema(name:z.nid, type:bigint, comment:null), FieldSchema(name:z.phone, type:string, comment:null), FieldSchema(name:z.ntime, type:string, comment:null), FieldSchema(name:z.year, type:string, comment:null), FieldSchema(name:z.month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31); Time taken: 0.068 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31): select year(z.ntime),* from zxz_5 z distribute by year(z.ntime) sort by z.nid
INFO : Query ID = hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_1576480232731_832170
INFO : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832170/
INFO : Starting Job = job_1576480232731_832170, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832170/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1576480232731_832170
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO : 2021-06-23 11:57:40,857 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-23 11:57:44,954 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
INFO : 2021-06-23 11:57:50,049 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.81 sec
INFO : MapReduce Total cumulative CPU time: 5 seconds 810 msec
INFO : Ended Job = job_1576480232731_832170
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 5.81 sec HDFS Read: 20456 HDFS Write: 1117 SUCCESS
INFO : Total MapReduce CPU Time Spent: 5 seconds 810 msec
INFO : Completed executing command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31); Time taken: 15.431 seconds
INFO : OK

cluster by

在之前列子中,year(ntime)用在distribute by 语句中,而nid列位于sort by语句中,如果这两个语句涉及到列的完全相同,而且采用的是升序的方式,那么就可以用到cluster by ,他等价于 distribute by 和sort by 是一个简写方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
select * from zxz_5 z cluster by z.nid ;

(同似select * from zxz_5 z distribute by z.nid sort by z.nid ;)

INFO : Compiling command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4): select * from zxz_5 z cluster by z.nid
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:z.name, type:string, comment:null), FieldSchema(name:z.nid, type:bigint, comment:null), FieldSchema(name:z.phone, type:string, comment:null), FieldSchema(name:z.ntime, type:string, comment:null), FieldSchema(name:z.year, type:string, comment:null), FieldSchema(name:z.month, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4); Time taken: 0.045 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4): select * from zxz_5 z cluster by z.nid
INFO : Query ID = hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_1576480232731_832175
INFO : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832175/
INFO : Starting Job = job_1576480232731_832175, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832175/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1576480232731_832175
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO : 2021-06-23 12:14:17,886 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-23 12:14:21,976 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.21 sec
INFO : 2021-06-23 12:14:27,073 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.01 sec
INFO : MapReduce Total cumulative CPU time: 5 seconds 10 msec
INFO : Ended Job = job_1576480232731_832175
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 5.01 sec HDFS Read: 18771 HDFS Write: 1027 SUCCESS
INFO : Total MapReduce CPU Time Spent: 5 seconds 10 msec
INFO : Completed executing command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4); Time taken: 13.908 seconds
INFO : OK


文章作者: Callable
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Callable !
评论
  目录