HiveQL資料抽樣Sample查詢

當資料量很大時，需要查詢乙個資料的子集用於加快資料的分析，這種技術就是抽樣技術。hive中，資料抽樣分為以下三種：

隨機抽樣；

桶表抽樣；

塊抽樣；

1 隨機抽樣

語法結構語法：

select * from distribute by rand() sort by rand()

limit ;

示例

hive> select * from ana distribute by rand() sort by rand() limit 3;
+-----------+-------------+-------------+
| ana.name  | ana.depart  | ana.salary  |
+-----------+-------------+-------------+
| mike      | 1001        | 6400        |
| will      | 1000        | 4000        |
| richard   | 1002        | 8000        |
+-----------+-------------+-------------+
3 rows selected (123.999 seconds)

2 桶表抽樣

語法結構

桶表抽樣是桶表已優化的特殊的抽樣方法，colname指定列在**取樣資料，當在整個行取樣時，可以使用rand（）函式，如果抽樣列是clustered by列，tablesample語句會更高效。

語法：

select * from

tablesample(bucket out of on [colname|rand()]) table_alias;

示例,略

3 塊抽樣

語法結構

塊抽樣允許hive隨機從資料中挑選n行、資料量的百分比或者資料的n位元組大小。該種抽樣的粒度是hdfs的塊大小。

語法：

select *

from tablesample(n percent|bytelengthliteral|n rows) s;

-- bytelengthliteral

-- (digit)+ ('b' | 'b' | 'k' | 'k' | 'm' | 'm' | 'g' | 'g')

示例

示例一：按行抽樣

hive> select name from ana  tablesample(4 rows) a;
+----------+
|   name   |
+----------+
| lucy     |
| michael  |
| steven   |
| will     |
+----------+
4 rows selected (0.29 seconds)

示例二：按資料大小的百分比抽樣

hive> select name from ana  tablesample(10 percent) a;
+----------+
|   name   |
+----------+
| lucy     |
| michael  |
+----------+
2 rows selected (0.345 seconds)

示例三：按資料大小抽樣

hive> select name from ana  tablesample(2m) a;
+----------+
|   name   |
+----------+
| lucy     |
| michael  |
| steven   |
| will     |
| will     |
| jess     |
| lily     |
| mike     |
| richard  |
| wei      |
| yun      |
+----------+
11 rows selected (0.264 seconds)

**：

HiveQL 資料定義

一.資料庫部分 1.建立資料庫 create database dw 或者create database ifnot exists dw create database dw comment this is a test database create database dw location my...

hiveQL資料定義

hive不支援行級插入操作更新操作刪除操作，hive也不支援事物。1，建立資料庫 create database show databases use database hive 會為每個資料庫建立乙個目錄，資料庫中的表將會以這個資料庫目錄的子目錄形式儲存。有乙個例外就是default資料庫中的...

HiveQL 資料定義

掌握應用hiveql建立資料庫掌握應用hiveql建立表掌握應用hiveql建立檢視硬體環境要求 pc機至少4g記憶體，硬碟至少預留50g空間。軟體要求已安裝並啟動hadoop 已安裝並啟動hive 應用hiveql建立資料庫應用hiveql建立表應用hiveql建立檢視第5章 hiv...

HiveQL資料抽樣Sample查詢

HiveQL 資料定義

hiveQL資料定義

HiveQL 資料定義

相關推薦