从hive表中随机选择行

从表中随机选择10000条记录:

1
2
3
select * from my_table
order by rand()
limit 10000;

速度更快的方法如下:

1
2
3
4
select * from my_table
distribute by rand()
sort by rand()
limit 10000;

从表中随机选择20%的记录:

1
2
3
4
5
6
7
select name, age
from
(
select name, age, rand(12345) rand_v
from my_table
) t
where rand_v between 0 and 0.2;

References