手动模拟哈希索引
作者:Emanuel Calvo Franco
简介和环境设置
本文不提供新方法,而是一种了解哈希类型索引的工作原理以及如何轻松“模拟”其功能的指南。
因此,我们将使用文本文件中的一些名称,然后通过“copy”工具将其导入。
以下测试在 Postgres 8.4 Beta 1 上进行。
CREATE TABLE hashy (name text, namehash int);
CREATE RULE hashy_rule AS ON insert TO hashy DO update hashy set namehash = hashbpchar(hashy.name) where namehash is NULL;
-- o utilizar hashtext...
CREATE OR REPLACE RULE hashy_rule AS ON insert TO hashy DO update hashy set namehash = hashtext(hashy.name) where namehash is NULL;
COPY hashy(name) from '/usr/local/pgsql84/listado.txt';
insert into hashy(name) (select name || '_1' from hashy);
insert into hashy(name) (select name || '_2' from hashy);
...
insert into hashy(name) (select name || '_5' from hashy);
insert into hashy(name) (select '0' || name from hashy);
主要思路是生成一个包含名称哈希的字段。由于整数字段比文本类型字段更快,因此从理论上来说,使用经过“哈希处理”的字段将显著提高搜索速度。
示例
postgres=# explain analyze verbose select * from hashy where namehash = hashbpchar('picolo'); QUERY PLAN --------------------------------------------------------------------------------------------------- Seq Scan on hashy (cost=0.00..123.78 rows=27 width=36) (actual time=1.522..1.523 rows=1 loops=1) Output: name, namehash Filter: (namehash = 564982275) Total runtime: 1.554 ms (4 rows) postgres=# explain analyze verbose select * from hashy where namehash = hashtext('picolo'); QUERY PLAN --------------------------------------------------------------------------------------------------- Seq Scan on hashy (cost=0.00..123.78 rows=27 width=36) (actual time=1.549..1.551 rows=1 loops=1) Output: name, namehash Filter: (namehash = 564982275) Total runtime: 1.578 ms (4 rows) postgres=# explain analyze verbose select * from hashy where name = 'picolo'; QUERY PLAN --------------------------------------------------------------------------------------------------- Seq Scan on hashy (cost=0.00..123.78 rows=27 width=36) (actual time=1.571..1.573 rows=1 loops=1) Output: name, namehash Filter: (name = 'picolo'::text) Total runtime: 1.605 ms (4 rows)
我们可以看到,hashtext 和 hashbpchar 在性能方面存在细微差异。至于字段“name”上的顺序搜索,其结果与字段哈希的结果非常相似。从理论上来说,大量数据可能会产生更大差异。
不只是简单模拟
现在,我们使用实际索引来测试其功能。首先,我们在 namehash 列上创建索引
postgres=# create index hashy_namehash on hashy USING hash (namehash); CREATE INDEX postgres=# explain analyze verbose select * from hashy where namehash = hashtext('picolo'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..51.77 rows=26 width=36) (actual time=0.024..0.025 rows=1 loops=1) Output: name, namehash Recheck Cond: (namehash = 564982275) -> Bitmap Index Scan on hashy_namehash (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1) Index Cond: (namehash = 564982275) Total runtime: 0.062 ms (6 rows)
已显著提升,这是合乎情理的。但是,从理论上来说,'name' 字段上的索引应该更慢。
postgres=# create index hashy_ix on hashy using btree (name); CREATE INDEX postgres=# explain analyze verbose select * from hashy where name = 'picolo'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..51.77 rows=26 width=36) (actual time=0.101..0.102 rows=1 loops=1) Output: name, namehash Recheck Cond: (name = 'picolo'::text) -> Bitmap Index Scan on hashy_ix (cost=0.00..4.45 rows=26 width=0) (actual time=0.093..0.093 rows=1 loops=1) Index Cond: (name = 'picolo'::text) Total runtime: 0.143 ms (6 rows)
下面,我们对包含名称的列创建 B 树类型索引。为了公平起见,还将对该列创建哈希类型索引
postgres=# drop index hashy_ix; DROP INDEX postgres=# create index hashy_ix on hashy using hash (name); CREATE INDEX postgres=# explain analyze verbose select * from hashy where name = 'picolo'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..51.77 rows=26 width=36) (actual time=0.022..0.023 rows=1 loops=1) Output: name, namehash Recheck Cond: (name = 'picolo'::text) -> Bitmap Index Scan on hashy_ix (cost=0.00..4.45 rows=26 width=0) (actual time=0.011..0.011 rows=1 loops=1) Index Cond: (name = 'picolo'::text) Total runtime: 0.059 ms (6 rows)
我们看到,在此特定场景下,哈希索引优于 B 树(默认索引类型)。有一个合理的解释,即“name”列的所有出现都不会在“namehash”列中产生冲突。如果我们采用 CRC 32 位算法,可能就会产生一些冲突(或至少 1 个,也有可能一个都没有)。大量数据中的冲突会产生额外的处理(双哈希处理)。如需更深入了解冲突,请参阅“冲突”一节。
postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..51.77 rows=26 width=36) (actual time=0.027..0.029 rows=1 loops=1) Output: name, namehash Recheck Cond: (name = 'emanuel_1_2_3_4'::text) -> Bitmap Index Scan on hashy_ix (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1) Index Cond: (name = 'emanuel_1_2_3_4'::text) Total runtime: 0.067 ms (6 rows) postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..51.77 rows=26 width=36) (actual time=0.024..0.025 rows=1 loops=1) Output: name, namehash Recheck Cond: (namehash = (-1849844477)) -> Bitmap Index Scan on hashy_namehash (cost=0.00..4.45 rows=26 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (namehash = (-1849844477)) Total runtime: 0.062 ms (6 rows)
合乎情理的是,由于记录的物理“位置”不同,结果也会有差异。在此场景下,“picolo”位于表末,而“emanuel_1_2_3_4”位于表中。
冲突
冲突是指,当两条或更多字符串通过哈希算法处理后,得到相同的输出结果时发生。就好像两条字符串有相同的“地址”一样。
为了明确说明冲突问题,我将展示一个测试
CREATE TABLE i(i text, hash int); insert into i (select random()::text from generate_series(1,1000000)); update i set hash = hashtext(i); select * from i a,i b where a.hash = b.hash and a.i <> b.i;
该查询为我们返回若干冲突……在这个特定情况下,约 133 个冲突(超过 1000001 表示 0.01 %),1 个约占 7518.80。
postgres=# select * from i a right join i b USING (hash) where a.i <> b.i limit 20; hash | i | i -------------+-------------------+------------------- 688880719 | 0.861061272677034 | 0.776890672277659 1837677207 | 0.436147989239544 | 0.160792944487184 688880719 | 0.776890672277659 | 0.861061272677034 <----- Se repite debido a que estamos 1837677207 | 0.160792944487184 | 0.436147989239544 juntando la misma tabla !! (OJO) -98745840 | 0.431216648779809 | 0.905636982992291 (...) postgres=# select count(*) from (select distinct(hash) from i a left join i b USING (hash) where a.i <> b.i) k; count ------- 133 (1 row) postgres=# select count(hash) from i group by hash having count(hash) > 2; count ------- (0 rows) <----- Esto nos indica que no hay colisiones de a 3 ocurrencias.
在这种情况下,我们必须解决冲突,因为如果执行查询,它会为我们带来两个记录
postgres=# select * from i where hash = hashtext('0.545927470549941'); i | hash -------------------+----------- 0.545927470549941 | 994753854 0.571486297063529 | 994753854 (2 rows)
为此,我们必须区分使用哪两个记录。
postgres=# select * from i where hash = hashtext('0.545927470549941') and i = '0.545927470549941'; i | hash -------------------+----------- 0.545927470549941 | 994753854 (1 row)
稍作进一步的测试
到目前为止,我们只针对两列表执行该操作,但理想情况下应在“宽度”稍大的表上测试此操作。
postgres=# alter table hashy add c3 text; ALTER TABLE postgres=# alter table hashy add c4 text; ALTER TABLE postgres=# alter table hashy add c5 text; ALTER TABLE postgres=# update hashy set c3 = 'columna' || name where namehash > 100000; UPDATE 2575 postgres=# update hashy set c4 = 'col ' || name where namehash < 100000; UPDATE 2674
这只是为了至少让这两列中的某个包含数据。
我们甚至在索引中也发现了差异
postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..60.45 rows=26 width=132) (actual time=0.021..0.022 rows=1 loops=1) Output: name, namehash, c3, c4, c5 Recheck Cond: (namehash = (-1849844477)) -> Bitmap Index Scan on hashy_hashname (cost=0.00..4.45 rows=26 width=0) (actual time=0.011..0.011 rows=1 loops=1) Index Cond: (namehash = (-1849844477)) Total runtime: 0.061 ms (6 rows) postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on hashy (cost=4.45..60.46 rows=26 width=132) (actual time=0.027..0.029 rows=1 loops=1) Output: name, namehash, c3, c4, c5 Recheck Cond: (name = 'emanuel_1_2_3_4'::text) -> Bitmap Index Scan on hashy_ix (cost=0.00..4.45 rows=26 width=0) (actual time=0.013..0.013 rows=1 loops=1) Index Cond: (name = 'emanuel_1_2_3_4'::text) Total runtime: 0.069 ms (6 rows)
没有索引
postgres=# explain analyze verbose select * from hashy where name = 'emanuel_1_2_3_4'; QUERY PLAN ---------------------------------------------------------------------------------------------------- Seq Scan on hashy (cost=0.00..178.54 rows=38 width=132) (actual time=2.587..2.892 rows=1 loops=1) Output: name, namehash, c3, c4, c5 Filter: (name = 'emanuel_1_2_3_4'::text) Total runtime: 2.927 ms (4 rows) postgres=# explain analyze verbose select * from hashy where namehash = hashtext('emanuel_1_2_3_4'); QUERY PLAN ---------------------------------------------------------------------------------------------------- Seq Scan on hashy (cost=0.00..178.54 rows=38 width=132) (actual time=1.504..1.747 rows=1 loops=1) Output: name, namehash, c3, c4, c5 Filter: (namehash = (-1849844477)) Total runtime: 1.779 ms (4 rows)
这里我们会看到一个稍大的差异。那时便可证明该算法的优势。
在此测试中,我们所做的就是添加更多数据负载,以通过造成稍大压力来核实该算法的功能(尽管 PostgreSQL 明显没有受到很大的影响)。
关于 CRC32 算法的更多信息
我们创建了一个名为 crc.c 的文件,并将下面的代码保存到该文件中
/*
* CYCLIC REDUNDANCY CHECK
*/
#include "postgres.h"
#include "fmgr.h"
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
PG_FUNCTION_INFO_V1(pg_crc_simple);
Datum
pg_crc_simple(PG_FUNCTION_ARGS){
char *message = PG_GETARG_TEXT_P(0);
//size_t argumentlen = VARSIZE(message)-VARHDRSZ;
int i, j;
unsigned int byte, crc, mask;
static unsigned int table[256];
/* Seteamos la tabla de ser necesario. */
if (table[1] == 0) {
for (byte = 0; byte <= 255; byte++) {
crc = byte;
for (j = 7; j >= 0; j--) { // 8veces
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
table[byte] = crc;
}
}
/* tenemos la tabla y calculamos*/
i = 0;
crc = 0xFFFFFFFF;
while (message[i] != 0) {
byte = message[i]; // Get next byte.
crc = (crc >> 8) ^ table[(crc ^ byte) & 0xFF];
i = i + 1;
}
PG_RETURN_INT32(~crc);
}
然后,我们将其编译为一个共享对象
gcc -I $DIRECTORIO_POSTGRES/include/server/ --shared crc.c
然后,我们必须在引擎中包含该函数
CREATE OR REPLACE FUNCTION pg_crc_simple(text) RETURNS integer
AS '/usr/local/pgsql84/a.out', 'pg_crc_simple'
LANGUAGE C STRICT;
这将使我们能够创建自己的哈希算法。
请注意,我在 /usr/local/pgsql84 文件夹中编译代码,默认情况下,“gcc”会将输出文件命名为 a.out。
让我们看看这是如何工作的!
postgres=# select pg_crc_simple(random()::text); pg_crc_simple_2 ----------------- 1107002784 (1 row)