Hive-偏移/排序窗口函数&随机抽样

1. 偏移函数

偏移函数	备注
LEAD(col,n,DEFAULT)	用于统计窗口内往下第n行值从当前行下移几行的值（最后一行取值为null）
LAG(col,n,DEFAULT)	用于统计窗口内往上第n行值从当前行上移几行的值（第一行取值为null）
first_value(col, DEFAULT)	取分组内排序后，截止到当前行，第一个值
last_value(col, DEFAULT)	取分组内排序后，截止到当前行,最后一个值

2. 偏移函数示例

需求：检验emp表的 userid 是否为连续的 (userid 排序后，差值都为1就表示连续)

先进行排序

select * from emp order by userid

偏移

select *,lead(userid,1) over( order by userid asc) as next_userid
from emp
order by userid

userid 相减，取出不为1的数据，则为不连续的

select * from (
select *,lead(userid,1) over( order by userid asc) as next_userid
from emp
order by userid)
where next_userid-userid>1;

3. 排序函数

排序函数	备注
ROW_NUMBER()	从1开始，按照顺序，生成分组内记录的序列
RANK()	生成数据项在分组中的排名，排名相等会在名次中留下空位 1 2 2 4
DENSE_RANK()	生成数据项在分组中的排名，排名相等在名次中不会留下空位。1223
NTILE(n)	用于将分组数据按照顺序切分成n片，返回当前切片值, 等频切片

4. 排序函数示例

需求：查找每个部门工资最高的前5人

分析：只要5名 → ROW_NUMBER()，RANK()【别是第5名重复 1 2 2 4 5 5 5】

前五名→ dense_rank()

select * from (
select *, row_number() over(partition by department order by salar desc) as row_salary
from emp)
where row_salary<=5;

需求：查看每个部门工资排名前10%的员工

select * from (
select *, ntile(10) over(partition by department order by salar desc) as level_salary
from emp)
where level_salary=1;

5. 随机抽样

5.1 rand()函数

select * from emp order by rand() limit 10; -- 全局抽样，随机抽10条

select * from emp distribute by rand() sort by rand() limit 10;
--distribute和sort关键字可以保证数据在map和reduce阶段是随机分布的，更加随机精确

5.2 tablesample(n percent)

tablesample(n percent) 根据hive表数据的大小（不是行数，而是数据大小）按比例抽取数据，并保存到新的hive表中. 由于在HDFS块层级进行抽样，所以抽样粒度为块的大小，例如如果块大小为128MB，即使输入的 n%仅为50MB，也会得到128MB的数据

百分比抽样

create table sample_table as 
select * from emp tablesample(10 percent)
-- 10%的数据

大小抽样

tablesample(nM) 指定抽样数据的大小，单位为M

SELECT * FROM emp TABLESAMPLE(1M) ;

5. 3 分桶抽样

hive中分桶其实就是根据某一个字段Hash取模，放入指定数据的桶中. 语法：TABLESAMPLE(BUCKET x OUT OF y) y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例

未分桶的表

--将表随机分成10组，抽取其中的第一个桶的数据
select * from emp tablesample(bucket 1 out of 10 on rand())

已分桶的表

-- 对第一个桶抽样一半
select * from emp_bucket tablesample(bucket 2 out of 4 on gender) --总桶数/4 =
0.5 从第一个桶开始取，取0.5个桶的数据
--emp_bucket_id 有6个分桶
select * from emp_bucket_id tablesample(bucket 1 out of 3 on gender) --从第一个桶
开始取，取2个桶的数据，第二个桶是 1+3