Hive - 窗口函数

Hive的窗口函数用法基本和SQL一致，不过为了加强印象，还是在再归纳一下hive的窗口函数

1. 窗口函数语法结构

Group by 普通聚合函数每组只有一条记录，简单来说返回的行数以group by后面的分类为准，一般搭配的聚合函数sum，avg，max，min等；

而窗口函数则可以为窗口中的每行都返回一个值。普通聚合函数聚合的行集是组，开窗函数聚合的行集是窗口。

窗口函数语法结构：

聚合函数+ over (partition by [] order by [] asc/desc rows between [])，用rows between 来限制窗口函数的计算范围

rows between 的相关窗口字句如下：

窗口子句	备注
PRECEDING	往前 n preceding 从当前行向前n行
FOLLOWING	往后 n following 从当前行向后n行
CURRENT ROW	当前行
UNBOUNDED	起点
UNBOUNDED PRECEDING	表示该窗口最前面的行（起点）
UNBOUNDED FOLLOWING	表示该窗口最后面的行（终点）

rows between 10 preceding and current row：当前行的前十行；

rows between current row and 10 following：当前行的后10行；

row between unbounded preceding and current row：当前行到第一行（最前行）。一般求累计求和的时候使用

rows between 必须和order by 使用

2. 示例

现有表格结构如下：

create table emp(
        userid bigint,
        emp_name array<string>, -- 数组类型：外国人的名字包含 名 和 姓
        emp_date map<string,date>, -- 将入职日期和出生日期组合，入职日期和出生日期字段类型（key_value）为string，
        -- 其值本身（value_type)类型为date
        other_info struct<deparment:string,gender:string>
 row format delimited -- 指定分隔符
    fields terminated by '\t' -- 字段分隔为\t，表示一个字段为一个制表位
    collection terminated by ',' -- 元素分隔符：将array/struct中各个字段用,隔开；将map中的key_value用,隔开
    map keys terminated by ':'   -- 将map类型的key和value用:隔开
     
    -- 三类分隔符指令都存在时，顺序不能错

需求：部门人数从多到少排序,计算累计人数及累计占比

部门人数从多到少排序

select other_info.department,count(emp) as emp_count from emp
group by other_info.department
order by other_info.department desc

计算累计人数

select *, sum(emp_count) over(order by emp_count desc rows between unbounded preceding and current row) as add_count
from 
(select other_info.department,count(emp) as emp_count from emp
group by other_info.department
order by other_info.department desc)a

计算占比和累计占比

select *, sum(emp_count) over(order by emp_count desc rows between unbounded preceding and current row) as add_count,
emp_count/sum(emp_count) over() as sum_p ,-- 人数占比
sum(emp_count) over(order by emp_count desc rows between unbounded preceding and current row) /sum(emp_count) over() as add_p -- 累计占比
from 
(select other_info.department,count(emp) as emp_count from emp
group by other_info.department
order by other_info.department desc)a

格式调整，显示为百分比，保留小数两位

select *, sum(emp_count) over(order by emp_count desc rows between unbounded preceding and current row) as add_count,
concat(round((emp_count/sum(emp_count) over() as sum_p)*100,2),'%'),
concat(round((sum(emp_count) over(order by emp_count desc rows between unbounded preceding and current row) /sum(emp_count) over())*100,2),'%') as add_p
from 
(select other_info.department,count(emp) as emp_count from emp
group by other_info.department
order by other_info.department desc)a 
order by emp_count desc -- 人数降序