hive中order by,sort by,distribute by,cluster by作用和用法

数据准备

 --  zxz_5.name       | zxz_5.nid  |  zxz_5.phone  | zxz_5.ntime  | zxz_5.year  | zxz_5.month 
CREATE TABLE (
  name string,
  nid BIGINT,
  phone string,
  ntime string,
  year string,
  month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS ORC;

 insert into zxz_5 values
 ('Cleon Haresnape       ',100        ,'625-338-3965', '2014-12-04'  , '2018','8'),
 ('Blaine Fawdry         ',99         ,'477-109-9014', '2012-07-14'  , '2018','8'),
 ('Zoe Adacot            ',98         ,'704-987-0702', '2015-09-29'  , '2018','8'),
 ('Orlando Pallatina     ',97         ,'354-125-1208', '2012-07-12'  , '2018','8'),
 ('Alyssa Prydden        ',96         ,'963-170-0545', '2014-11-07'  , '2018','8'),
 ('Skipp Spurden         ',95         ,'747-133-1382', '2012-03-15'  , '2018','8'),
 ('Auguste Kares         ',94         ,'230-184-3438', '2014-03-13'  , '2018','8'),
 ('Ciro Arendt           ',93         ,'792-967-0588', '2015-11-07'  , '2018','8'),
 ('Floyd Cano            ',92         ,'133-768-6535', '2016-02-27'  , '2018','8'),
 ('Sayre Osbaldeston     ',91         ,'340-132-2361', '2011-11-30'  , '2018','8'),
 ('Pearline Marcq        ',90         ,'200-835-9497', '2016-02-10'  , '2018','8'),
 ('Zackariah Pietrusiak  ',89         ,'810-738-9846', '2012-02-25'  , '2018','8'),
 ('Auguste Kobel         ',88         ,'562-494-1360', '2012-02-29'  , '2018','8'),
 ('Jilly Eisikowitz      ',87         ,'431-355-2777', '2017-02-18'  , '2018','8'),
 ('Wes Jaffrey           ',86         ,'848-465-5131', '2016-02-14'  , '2018','8'),
 ('Casey Greenfield      ',85         ,'204-108-7707', '2012-03-18'  , '2018','8'),
 ('Maurise Gallico       ',84         ,'546-158-7983', '2011-12-21'  , '2018','8'),
 ('Cilka Heijne          ',83         ,'772-704-7366', '2011-08-27'  , '2018','8');

order by

hive中的order by 语句和其他的sql方言中定义是一样的，其会对查询结果集执行一个全局排序，这也就是说会有一个所有的数据都通过一个reduce进行处理的过程，对于大数据集，这个过程将消耗很大的时间来执行。

select * from zxz_5 order by nid desc,year(ntime);
INFO  : Compiling command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085): select * from zxz_5 order by nid desc,year(ntime)
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:zxz_5.name, type:string, comment:null), FieldSchema(name:zxz_5.nid, type:bigint, comment:null), FieldSchema(name:zxz_5.phone, type:string, comment:null), FieldSchema(name:zxz_5.ntime, type:string, comment:null), FieldSchema(name:zxz_5.year, type:string, comment:null), FieldSchema(name:zxz_5.month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085); Time taken: 0.062 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085): select * from zxz_5 order by nid desc,year(ntime)
INFO  : Query ID = hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1576480232731_832153
INFO  : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832153/
INFO  : Starting Job = job_1576480232731_832153, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832153/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job  -kill job_1576480232731_832153
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2021-06-23 11:31:56,382 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-06-23 11:32:00,469 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.88 sec
INFO  : 2021-06-23 11:32:05,582 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.6 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 600 msec
INFO  : Ended Job = job_1576480232731_832153
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.6 sec   HDFS Read: 15125 HDFS Write: 1027 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 600 msec
INFO  : Completed executing command(queryId=hive_20210623113131_2b07dab5-8ce0-44f9-8b29-563567aa2085); Time taken: 13.874 seconds
INFO  : OK

sort by

hive增加了一个可供选择的方式，也就是sort by ,其实会在每个reduce中对数据进行排序，也就是执行一个局部排序过程。这可以保证每个reduce的输出数据都是有序的(但并非全局有效)。这样就可以提高后面进行的全局排序的效率了。对于这两种情况，语法区别仅仅是，一个关键字是order，另一个关键字是sort。用户可以指定任意期望进行排序的字段，并可以在字段后面加上asc关键字(默认)表示升序，desc关键字是降序排序

-- 在sort by之前我们还有配置属性:
set mapreduce.job.reduces=2; -- 配置两个ruduce默认一个，不然sort by是没有用处的。
select * from zxz_5 sort by nid ，year(ntime) desc;
INFO  : Compiling command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09): select * from zxz_5 sort by nid,year(ntime) desc
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:zxz_5.name, type:string, comment:null), FieldSchema(name:zxz_5.nid, type:bigint, comment:null), FieldSchema(name:zxz_5.phone, type:string, comment:null), FieldSchema(name:zxz_5.ntime, type:string, comment:null), FieldSchema(name:zxz_5.year, type:string, comment:null), FieldSchema(name:zxz_5.month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09); Time taken: 0.05 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09): select * from zxz_5 sort by nid,year(ntime) desc
INFO  : Query ID = hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1576480232731_832167
INFO  : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832167/
INFO  : Starting Job = job_1576480232731_832167, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832167/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job  -kill job_1576480232731_832167
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO  : 2021-06-23 11:50:46,334 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-06-23 11:50:50,424 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.98 sec
INFO  : 2021-06-23 11:50:55,516 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.57 sec
INFO  : MapReduce Total cumulative CPU time: 5 seconds 570 msec
INFO  : Ended Job = job_1576480232731_832167
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 5.57 sec   HDFS Read: 19684 HDFS Write: 1027 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 570 msec
INFO  : Completed executing command(queryId=hive_20210623115050_cc564610-d28e-40a1-aa4a-8514d16f2d09); Time taken: 13.929 seconds
INFO  : OK

sort by 和 distribute by

distribute by 控制 map的输出在reduer中是如何划分的，mapreduce job 中传输的所有数据都是按照键-值对的方式进行组织的，
因此hive在将用户的查询语句转换成mapreduce job时，其必须在内部使用这个功能。
默认情况下，MapReduce计算框架会依据map输入的键计算相应的哈希值，然后按照得到的哈希值将键-值对均匀分发到多个reducer中去，
不过不幸的是，这也是意味着当我们使用sort by 时，不同reducer的输出内容会有明显的重叠，至少对于排序顺序而已只这样，
即使每个reducer的输出的数据都有序的。如果我们想让同一年的数据一起处理，那么就可以使用distribute by 来保证具有相同年份的数据分发到同一个reducer中进行处理，
然后使用sort by 来安装我们的期望对数据进行排序:

Distribute by 和 group by 在其控制着reducer是如何接受一行行数据进行处理这方面类似的，而sort by 则控制着reducer内的数据是如何进行排序的，需要注意的是，hive需求distribute by 语句写在sort by 语句前

set mapreduce.job.reduces=2;
select * from zxz_5 z distribute by year(z.ntime) sort by z.nid ;
INFO  : Compiling command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31): select year(z.ntime),* from zxz_5 z distribute by year(z.ntime) sort by z.nid
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:int, comment:null), FieldSchema(name:z.name, type:string, comment:null), FieldSchema(name:z.nid, type:bigint, comment:null), FieldSchema(name:z.phone, type:string, comment:null), FieldSchema(name:z.ntime, type:string, comment:null), FieldSchema(name:z.year, type:string, comment:null), FieldSchema(name:z.month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31); Time taken: 0.068 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31): select year(z.ntime),* from zxz_5 z distribute by year(z.ntime) sort by z.nid
INFO  : Query ID = hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1576480232731_832170
INFO  : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832170/
INFO  : Starting Job = job_1576480232731_832170, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832170/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job  -kill job_1576480232731_832170
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO  : 2021-06-23 11:57:40,857 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-06-23 11:57:44,954 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.87 sec
INFO  : 2021-06-23 11:57:50,049 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.81 sec
INFO  : MapReduce Total cumulative CPU time: 5 seconds 810 msec
INFO  : Ended Job = job_1576480232731_832170
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 5.81 sec   HDFS Read: 20456 HDFS Write: 1117 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 810 msec
INFO  : Completed executing command(queryId=hive_20210623115757_6d3b292c-1839-402a-bf13-679c698c7e31); Time taken: 15.431 seconds
INFO  : OK

cluster by

在之前列子中，year（ntime）用在distribute by 语句中，而nid列位于sort by语句中，如果这两个语句涉及到列的完全相同，而且采用的是升序的方式，那么就可以用到cluster by ，他等价于 distribute by 和sort by 是一个简写方式

select * from zxz_5 z cluster by z.nid ;

(同似select * from zxz_5 z distribute by z.nid sort by z.nid ;)

INFO  : Compiling command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4): select * from zxz_5 z cluster by z.nid
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:z.name, type:string, comment:null), FieldSchema(name:z.nid, type:bigint, comment:null), FieldSchema(name:z.phone, type:string, comment:null), FieldSchema(name:z.ntime, type:string, comment:null), FieldSchema(name:z.year, type:string, comment:null), FieldSchema(name:z.month, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4); Time taken: 0.045 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4): select * from zxz_5 z cluster by z.nid
INFO  : Query ID = hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Defaulting to jobconf value of: 2
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1576480232731_832175
INFO  : The url to track the job: http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832175/
INFO  : Starting Job = job_1576480232731_832175, Tracking URL = http://emr-master.izhikang.com:8088/proxy/application_1576480232731_832175/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job  -kill job_1576480232731_832175
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
INFO  : 2021-06-23 12:14:17,886 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-06-23 12:14:21,976 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
INFO  : 2021-06-23 12:14:27,073 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.01 sec
INFO  : MapReduce Total cumulative CPU time: 5 seconds 10 msec
INFO  : Ended Job = job_1576480232731_832175
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 5.01 sec   HDFS Read: 18771 HDFS Write: 1027 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 10 msec
INFO  : Completed executing command(queryId=hive_20210623121414_96d861b4-dec0-4e9e-aa2e-3c3fdb8e86d4); Time taken: 13.908 seconds
INFO  : OK