手动模拟哈希索引

摘自 PostgreSQL wiki
跳转至导航跳转至搜索

作者:Emanuel Calvo Franco

简介和环境设置

本文不提供新方法,而是一种了解哈希类型索引的工作原理以及如何轻松“模拟”其功能的指南。

因此,我们将使用文本文件中的一些名称,然后通过“copy”工具将其导入。

以下测试在 Postgres 8.4 Beta 1 上进行。

CREATE TABLE hashy (name text, namehash int);
CREATE RULE hashy_rule AS ON insert TO hashy DO update hashy set namehash = hashbpchar(hashy.name) where namehash is NULL; 
-- o utilizar hashtext...
CREATE OR REPLACE RULE hashy_rule AS ON insert TO hashy DO update hashy set namehash = hashtext(hashy.name) where namehash is NULL; 
COPY hashy(name) from '/usr/local/pgsql84/listado.txt';
insert into hashy(name) (select name || '_1' from hashy);
insert into hashy(name) (select name || '_2' from hashy);
...
insert into hashy(name) (select name || '_5' from hashy);
insert into hashy(name) (select '0' || name from hashy);

主要思路是生成一个包含名称哈希的字段。由于整数字段比文本类型字段更快,因此从理论上来说,使用经过“哈希处理”的字段将显著提高搜索速度。

示例

postgres=# explain analyze verbose select * from hashy where namehash = hashbpchar('picolo');
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on hashy  (cost=0.00..123.78 rows=27 width=36) (actual time=1.522..1.523 rows=1 loops=1)
   Output: name, namehash
   Filter: (namehash = 564982275)
 Total runtime: 1.554 ms
(4 rows)

postgres=# explain analyze verbose select * from hashy where namehash = hashtext('picolo');
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on hashy  (cost=0.00..123.78 rows=27 width=36) (actual time=1.549..1.551 rows=1 loops=1)
   Output: name, namehash
   Filter: (namehash = 564982275)
 Total runtime: 1.578 ms
(4 rows)

postgres=# explain analyze verbose select * from hashy where name = 'picolo';
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on hashy  (cost=0.00..123.78 rows=27 width=36) (actual time=1.571..1.573 rows=1 loops=1)
   Output: name, namehash
   Filter: (name = 'picolo'::text)
 Total runtime: 1.605 ms
(4 rows)

我们可以看到,hashtext 和 hashbpchar 在性能方面存在细微差异。至于字段“name”上的顺序搜索,其结果与字段哈希的结果非常相似。从理论上来说,大量数据可能会产生更大差异。

不只是简单模拟

现在,我们使用实际索引来测试其功能。首先,我们在 namehash 列上创建索引

postgres=# create index hashy_namehash on hashy USING hash (namehash);
CREATE INDEX
postgres=# explain analyze verbose select * from hashy where namehash = hashtext('picolo');
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..51.77 rows=26 width=36) (actual time=0.024..0.025 rows=1 loops=1)
   Output: name, namehash
   Recheck Cond: (namehash = 564982275)
   ->  Bitmap Index Scan on hashy_namehash  (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (namehash = 564982275)
 Total runtime: 0.062 ms
(6 rows)

已显著提升,这是合乎情理的。但是,从理论上来说,'name' 字段上的索引应该更慢。


postgres=# create index hashy_ix on hashy using btree (name);
CREATE INDEX
postgres=# explain analyze verbose select * from hashy where name = 'picolo';
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..51.77 rows=26 width=36) (actual time=0.101..0.102 rows=1 loops=1)
   Output: name, namehash
   Recheck Cond: (name = 'picolo'::text)
   ->  Bitmap Index Scan on hashy_ix  (cost=0.00..4.45 rows=26 width=0) (actual time=0.093..0.093 rows=1 loops=1)
         Index Cond: (name = 'picolo'::text)
 Total runtime: 0.143 ms
(6 rows)

下面,我们对包含名称的列创建 B 树类型索引。为了公平起见,还将对该列创建哈希类型索引

postgres=# drop index hashy_ix;
DROP INDEX
postgres=# create index hashy_ix on hashy using hash (name);
CREATE INDEX
postgres=# explain analyze verbose select * from hashy where name = 'picolo';
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..51.77 rows=26 width=36) (actual time=0.022..0.023 rows=1 loops=1)
   Output: name, namehash
   Recheck Cond: (name = 'picolo'::text)
   ->  Bitmap Index Scan on hashy_ix  (cost=0.00..4.45 rows=26 width=0) (actual time=0.011..0.011 rows=1 loops=1)
         Index Cond: (name = 'picolo'::text)
 Total runtime: 0.059 ms
(6 rows)

我们看到,在此特定场景下,哈希索引优于 B 树(默认索引类型)。有一个合理的解释,即“name”列的所有出现都不会在“namehash”列中产生冲突。如果我们采用 CRC 32 位算法,可能就会产生一些冲突(或至少 1 个,也有可能一个都没有)。大量数据中的冲突会产生额外的处理(双哈希处理)。如需更深入了解冲突,请参阅“冲突”一节。

postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4';
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..51.77 rows=26 width=36) (actual time=0.027..0.029 rows=1 loops=1)
   Output: name, namehash
   Recheck Cond: (name = 'emanuel_1_2_3_4'::text)
   ->  Bitmap Index Scan on hashy_ix  (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (name = 'emanuel_1_2_3_4'::text)
 Total runtime: 0.067 ms
(6 rows)

postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4');
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..51.77 rows=26 width=36) (actual time=0.024..0.025 rows=1 loops=1)
   Output: name, namehash
   Recheck Cond: (namehash = (-1849844477))
   ->  Bitmap Index Scan on hashy_namehash  (cost=0.00..4.45 rows=26 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: (namehash = (-1849844477))
 Total runtime: 0.062 ms
(6 rows)

合乎情理的是,由于记录的物理“位置”不同,结果也会有差异。在此场景下,“picolo”位于表末,而“emanuel_1_2_3_4”位于表中。

冲突

冲突是指,当两条或更多字符串通过哈希算法处理后,得到相同的输出结果时发生。就好像两条字符串有相同的“地址”一样。

为了明确说明冲突问题,我将展示一个测试

CREATE TABLE i(i text, hash int);
insert into i (select random()::text from generate_series(1,1000000));
update i set hash = hashtext(i);
select * from i a,i b where a.hash = b.hash and a.i <> b.i;

该查询为我们返回若干冲突……在这个特定情况下,约 133 个冲突(超过 1000001 表示 0.01 %),1 个约占 7518.80。

postgres=# select * from i a right join i b USING (hash) where a.i <> b.i limit 20;
    hash     |         i         |         i         
-------------+-------------------+-------------------
   688880719 | 0.861061272677034 | 0.776890672277659
  1837677207 | 0.436147989239544 | 0.160792944487184
   688880719 | 0.776890672277659 | 0.861061272677034 <----- Se repite debido a que estamos
  1837677207 | 0.160792944487184 | 0.436147989239544        juntando la misma tabla !! (OJO)
   -98745840 | 0.431216648779809 | 0.905636982992291
(...)

postgres=# select count(*) from (select distinct(hash) from i a left join i b USING (hash) where a.i <> b.i) k;
 count 
-------
   133
(1 row)

postgres=# select count(hash) from i group by hash having count(hash) > 2;
 count 
-------
(0 rows) <----- Esto nos indica que no hay colisiones de a 3 ocurrencias.

在这种情况下,我们必须解决冲突,因为如果执行查询,它会为我们带来两个记录

postgres=# select * from i where hash = hashtext('0.545927470549941');
         i         |   hash    
-------------------+-----------
 0.545927470549941 | 994753854
 0.571486297063529 | 994753854
(2 rows)

为此,我们必须区分使用哪两个记录。

postgres=# select * from i where hash = hashtext('0.545927470549941') and i = '0.545927470549941';
         i         |   hash    
-------------------+-----------
 0.545927470549941 | 994753854
(1 row)


稍作进一步的测试

到目前为止,我们只针对两列表执行该操作,但理想情况下应在“宽度”稍大的表上测试此操作。

postgres=# alter table hashy add c3 text;
ALTER TABLE
postgres=# alter table hashy add c4 text;
ALTER TABLE
postgres=# alter table hashy add c5 text;
ALTER TABLE
postgres=# update hashy set c3 = 'columna' || name where namehash > 100000;
UPDATE 2575
postgres=# update hashy set c4 = 'col ' || name where namehash < 100000;
UPDATE 2674

这只是为了至少让这两列中的某个包含数据。

我们甚至在索引中也发现了差异

postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4');
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..60.45 rows=26 width=132) (actual time=0.021..0.022 rows=1 loops=1)
   Output: name, namehash, c3, c4, c5
   Recheck Cond: (namehash = (-1849844477))
   ->  Bitmap Index Scan on hashy_hashname  (cost=0.00..4.45 rows=26 width=0) (actual time=0.011..0.011 rows=1 loops=1)
         Index Cond: (namehash = (-1849844477))
 Total runtime: 0.061 ms
(6 rows)

postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4';
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on hashy  (cost=4.45..60.46 rows=26 width=132) (actual time=0.027..0.029 rows=1 loops=1)
   Output: name, namehash, c3, c4, c5
   Recheck Cond: (name = 'emanuel_1_2_3_4'::text)
   ->  Bitmap Index Scan on hashy_ix  (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (name = 'emanuel_1_2_3_4'::text)
 Total runtime: 0.069 ms
(6 rows)


没有索引

postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4';
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Seq Scan on hashy  (cost=0.00..178.54 rows=38 width=132) (actual time=2.587..2.892 rows=1 loops=1)
   Output: name, namehash, c3, c4, c5
   Filter: (name = 'emanuel_1_2_3_4'::text)
 Total runtime: 2.927 ms
(4 rows)

postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4');
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Seq Scan on hashy  (cost=0.00..178.54 rows=38 width=132) (actual time=1.504..1.747 rows=1 loops=1)
   Output: name, namehash, c3, c4, c5
   Filter: (namehash = (-1849844477))
 Total runtime: 1.779 ms
(4 rows)

这里我们会看到一个稍大的差异。那时便可证明该算法的优势。

在此测试中,我们所做的就是添加更多数据负载,以通过造成稍大压力来核实该算法的功能(尽管 PostgreSQL 明显没有受到很大的影响)。

关于 CRC32 算法的更多信息

我们创建了一个名为 crc.c 的文件,并将下面的代码保存到该文件中

/*
 * CYCLIC REDUNDANCY CHECK
 */

#include "postgres.h"
#include "fmgr.h"

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

PG_FUNCTION_INFO_V1(pg_crc_simple);


Datum 
pg_crc_simple(PG_FUNCTION_ARGS){

        char *message = PG_GETARG_TEXT_P(0);
        //size_t argumentlen = VARSIZE(message)-VARHDRSZ;

        int i, j;
        unsigned int byte, crc, mask;
        static unsigned int table[256];
        /* Seteamos la tabla de ser necesario. */
        if (table[1] == 0) {
                for (byte = 0; byte <= 255; byte++) {
                        crc = byte;
                        for (j = 7; j >= 0; j--) { // 8veces
                                mask = -(crc & 1);
                                crc = (crc >> 1) ^ (0xEDB88320 & mask);
                        }
                        table[byte] = crc;
                }
        }

        /* tenemos la tabla y calculamos*/
        i = 0;
        crc = 0xFFFFFFFF;
        while (message[i] != 0) {
                byte = message[i]; // Get next byte.
                crc = (crc >> 8) ^ table[(crc ^ byte) & 0xFF];
                i = i + 1;
        }
        
        PG_RETURN_INT32(~crc);
}


然后,我们将其编译为一个共享对象

gcc -I $DIRECTORIO_POSTGRES/include/server/ --shared crc.c

然后,我们必须在引擎中包含该函数

CREATE OR REPLACE FUNCTION pg_crc_simple(text) RETURNS integer
     AS '/usr/local/pgsql84/a.out', 'pg_crc_simple'
     LANGUAGE C STRICT;

这将使我们能够创建自己的哈希算法。

请注意,我在 /usr/local/pgsql84 文件夹中编译代码,默认情况下,“gcc”会将输出文件命名为 a.out

让我们看看这是如何工作的!

postgres=# select pg_crc_simple(random()::text);
 pg_crc_simple_2 
-----------------
      1107002784
(1 row)