（五）直方图（histograms）

2013-02-24

（5）直方图（histograms）1、入门SQL create table t12as3with kilo_row as (4select /*+ materialize */5row

（5）直方图（histograms）
1、入门

SQL> create table t1   2  as  3  with kilo_row as (  4    select /*+ materialize */  5      rownum   6    from all_objects  7    where rownum <= 1000  8  )  9  select  10    trunc(7000 * dbms_random.normal)  normal 11  from 12    kilo_row  k1, 13    kilo_row  k2 14  where 15    rownum <= 1000000 16  ;表已创建。

上面将产生一个包含1 000 000个随机数的表。第一行调用了函数seed()，是为了重复执行该示例产生相同的结果。

SQL> select tenth tenth,  2         min(normal) low_val,  3         max(normal) high_val,  4         max(normal) - min(normal) width,  5         round(100000 / (max(normal) - min(normal)), 2) height  from (select normal, ntile(10) over(order by normal) tenth from t1) group by tenth  8   order by tenth;     TENTH    LOW_VAL   HIGH_VAL      WIDTH     HEIGHT---------- ---------- ---------- ---------- ----------         1     -31491      -8969      22522       4.44         2      -8969      -5884       3085      32.41         3      -5884      -3661       2223      44.98         4      -3661      -1766       1895      52.77         5      -1766          3       1769      56.53         6          3       1774       1771      56.47         7       1775       3674       1899      52.66         8       3674       5904       2230      44.84         9       5904       8970       3066      32.62        10       8970      33974      25004          4已选择10行。

ntile() over()子句对数据进行分类，并将这些类别均匀的分成10个部分（桶）——每一部分对应100 000行数据。从-31491和-8969之间（第1个桶）取出任意一个值，直方图中对应的高度表明表中不会有太多的行与之相匹配（高度为4.44）。类似地，表中又不会有太多的数据行能够月8970和33974之间的值相匹配（第10个桶）。绝大部分数据聚集于直方图的中间部分。实际上，80%的数据（10个桶中的8个桶）聚集在整个区间的27%的空间内（从-8969到+8970）。

SQL> begin  2  dbms_stats.gather_table_stats(  3  user,  4  't1',  5  cascade => true,  6  estimate_percent => null,  7  method_opt => 'for columns normal size 10'  8  );  9  end; 10  /

在列normal上创建一个有10个桶的直方图。

SQL> select rownum tenth,  2         prev low_val,  3         curr high_val,  4         curr - prev width,  5         round(100000 / (curr - prev), 2) height  6    from (select endpoint_value curr,  7                 lag(endpoint_value, 1) over(order by endpoint_number) prev  8            from user_tab_histograms  9           where table_name = 'T1' 10             and column_name = 'NORMAL') 11   where prev is not null 12   order by curr;     TENTH    LOW_VAL   HIGH_VAL      WIDTH     HEIGHT---------- ---------- ---------- ---------- ----------         1     -31491      -8969      22522       4.44         2      -8969      -5884       3085      32.41         3      -5884      -3661       2223      44.98         4      -3661      -1766       1895      52.77         5      -1766          3       1769      56.53         6          3       1774       1771      56.47         7       1774       3674       1900      52.63         8       3674       5904       2230      44.84         9       5904       8970       3066      32.62        10       8970      33974      25004          4

查询user_tab_histograms视图得到的结果与针对原始数据原来的查询得到的结果是一致的。

2、频率直方图

创建一个表，其中定义了一个列skew，其定义方式为数值1出现一次，数值2出现两次，以此类推直到80，因此一共有3240行数据。

SQL> create table t1 (  2    skew not null,    3    padding  4  )  5  as  6  with generator as (  7    select  --+ materialize  8      rownum   id  9    from all_objects  10    where  rownum <= 5000 11  ) 12  select 13    /*+ ordered use_nl(v2) */ 14    v1.id, 15    rpad('x',400) 16  from 17    generator  v1, 18    generator  v2 19  where 20    v1.id <= 80 21  and  v2.id <= 80 22  and  v2.id <= v1.id 23  order by  24    v2.id,v1.id 25  ;表已创建。SQL> select skew, count(*) from t1 group by skew order by skew;      SKEW   COUNT(*)---------- ----------         1          1         2          2         3          3         4          4         5          5         6          6  ... ...        78         78        79         79        80         80已选择80行。

SQL> begin  2  dbms_stats.gather_table_stats(  3  user,  4  't1',  5  cascade => true,  6  estimate_percent => null,  7  method_opt => 'for all columns size 80'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> select endpoint_value row_value, curr_num - nvl(prev_num, 0) row_count  2    from (select endpoint_value,  3                 endpoint_number curr_num,  4                 lag(endpoint_number, 1) over(order by endpoint_number) prev_num  5            from user_tab_histograms  6           where column_name = 'SKEW'  7             and table_name = 'T1')  8   order by endpoint_value; ROW_VALUE  ROW_COUNT---------- ----------         1          1         2          2         3          3         4          4         5          5 ... ...        78         78        79         79        80         80已选择80行。

频率直方图（user_tab_columns.num_buckets）中桶的数目与表中不同值的数目相匹配。

3、“高度均衡”直方图

如果在上面通过的示例数据集基础上创建一个58个桶的直方图。

SQL> begin  2  dbms_stats.gather_table_stats(  3  user,  4  't1',  5  cascade => true,  6  estimate_percent => null,  7  method_opt => 'for all columns size 58'  8  );  9  end; 10  /PL/SQL 过程已成功完成。SQL> select num_distinct, density, num_Buckets  2    from user_tab_columns  3   where table_name = 'T1'  4     and column_name = 'SKEW';NUM_DISTINCT    DENSITY NUM_BUCKETS------------ ---------- -----------          80 .015598596          58SQL> select endpoint_number, endpoint_value  2    from user_tab_histograms  3   where column_name = 'SKEW'  4     and table_name = 'T1'  5   order by endpoint_number;ENDPOINT_NUMBER ENDPOINT_VALUE--------------- --------------              0              1              1             11              2             15... ...             40             67             41             68             43             69             44             70             45             71ENDPOINT_NUMBER ENDPOINT_VALUE--------------- --------------             46             72             48             73             49             74             50             75             52             76             53             77             55             78             56             79             58             80已选择53行。

user_tab_columns中的信息，可以看出oracle已经正确的数出列中有80个不同的值。还注意到oracle要求建立一个58个桶的直方图。可以看出很明显有一些行丢失了——例如，ENDPOINT_NUMBER=42时没有对应的行。

4、总结

直方图能够通过列出每个值有多少行来对数据进行描述。

如果数据库中某一列的数据分布比较奇特，而且还使用了where子句，那么就可能需要针对该列建立直方图。
如果列中的数值非常少（低于255个），那么最好建立频率直方图，但是需要注意数据的变化率，并且必须保证直方图保持更新。
如果列中不同值的数量比较多，则必须对桶的数目进行估计。在绝大部分情况下，使用最大的估计值将是最安全的选择。必须对高频率出现的或者奇异值非常清楚，并确保它们在直方图中可见。

热点排行

其他数据库

（五）直方图（histograms）