Discussion:
Understanding and implementing a GiST Index
Connor Wolf
2014-10-09 07:09:58 UTC
Permalink
Hi there!

I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search
across a set of 64 bit integers by hamming distance. This is intended to
be used in searching for similar images, where the 64 bit integer is
actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html> of
an image, and similarity between two images is reflected in the hamming
distance between two integers.

Anyways, The appropriate approach here is to use something called a BK
tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work, in any
event).

That said:

* Is there a /simple/ piece of example-code for implementing a GiST
index for a single index? I've been looking through the /contrib/
directory, particularly the /contrib/btree_gist/ files, but it looks
like 90% of the complexity there is related to supporting all the
column types Postgres has, rather then actually tied to producing a
functional index.
* Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my
compiled extension using `CREATE EXTENSION`, but is there an option
for `CREATE INDEX test_index ON testing USING gist (val);` that lets
me specify *which* GiST index is actually employed? Is this even a
valid question?
This is probably something that's available in one of the system
tables somewhere, but I haven't had much luck with google finding
out where.
* Testing: What's the appropriate way to examine the generated tree
structure of the index? I certainly went through a few bugs with my
test tree system (and that was in python!). Is there any way to
examine the index structure for debugging purposes?
* Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
* In that vein, is there any way to have information that is on the
operations of an entire query? Looking at the number of tree nodes
touched for a scan would be nice (and I would not be surprised if
there is already a facility for it).

Project code is here <https://github.com/fake-name/pg-spgist_hamming> if
anyone is interested, any help would be great. I have very little idea
what I'm doing.

Thanks,
Connor
Sergey Konoplev
2014-10-09 15:19:36 UTC
Permalink
On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf
Post by Connor Wolf
I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across
a set of 64 bit integers by hamming distance. This is intended to be used in
searching for similar images, where the 64 bit integer is actually a phash
of an image, and similarity between two images is reflected in the hamming
distance between two integers.
Have you seen the smlar extension?

http://www.pgcon.org/2012/schedule/events/443.en.html
http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README
http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/
Post by Connor Wolf
Anyways, The appropriate approach here is to use something called a BK tree,
for which I've written some test code and I think I have a decent grip of
(my test code seems to work, in any event).
Is there a simple piece of example-code for implementing a GiST index for a
single index? I've been looking through the /contrib/ directory,
particularly the /contrib/btree_gist/ files, but it looks like 90% of the
complexity there is related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my compiled
extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX
test_index ON testing USING gist (val);` that lets me specify *which* GiST
index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables
somewhere, but I haven't had much luck with google finding out where.
Testing: What's the appropriate way to examine the generated tree structure
of the index? I certainly went through a few bugs with my test tree system
(and that was in python!). Is there any way to examine the index structure
for debugging purposes?
Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
In that vein, is there any way to have information that is on the operations
of an entire query? Looking at the number of tree nodes touched for a scan
would be nice (and I would not be surprised if there is already a facility
for it).
Project code is here if anyone is interested, any help would be great. I
have very little idea what I'm doing.
Thanks,
Connor
--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (499) 346-7196, +7 (988) 888-1979
***@gmail.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Connor Wolf
2014-10-10 03:09:07 UTC
Permalink
I had skimmed the presentation slides, but I hadn't looked that closely
because it appeared to be using cosine similarity metrics, and only
operated on matrices, neither of which are what I wanted.

On closer examination, I think I could explode my packed hash values to
a matrix. I'm not sure how the cosine similarity metric would work given
that the arrays would only contain the values either 0 or 1 (as my hash
value is fundamentally a integer of configurable length (you can alter
the hash size, I'm using 64 bits)).

I'll check out the code and poke it a bit.

I'll probably also move this discussion to the hackers mailing list (the
instructions say to ask here first, but Oleg suggested I go there, and I
generally agree).

Connor
Post by Sergey Konoplev
On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf
Post by Connor Wolf
I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across
a set of 64 bit integers by hamming distance. This is intended to be used in
searching for similar images, where the 64 bit integer is actually a phash
of an image, and similarity between two images is reflected in the hamming
distance between two integers.
Have you seen the smlar extension?
http://www.pgcon.org/2012/schedule/events/443.en.html
http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README
http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/
Post by Connor Wolf
Anyways, The appropriate approach here is to use something called a BK tree,
for which I've written some test code and I think I have a decent grip of
(my test code seems to work, in any event).
Is there a simple piece of example-code for implementing a GiST index for a
single index? I've been looking through the /contrib/ directory,
particularly the /contrib/btree_gist/ files, but it looks like 90% of the
complexity there is related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my compiled
extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX
test_index ON testing USING gist (val);` that lets me specify *which* GiST
index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables
somewhere, but I haven't had much luck with google finding out where.
Testing: What's the appropriate way to examine the generated tree structure
of the index? I certainly went through a few bugs with my test tree system
(and that was in python!). Is there any way to examine the index structure
for debugging purposes?
Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
In that vein, is there any way to have information that is on the operations
of an entire query? Looking at the number of tree nodes touched for a scan
would be nice (and I would not be surprised if there is already a facility
for it).
Project code is here if anyone is interested, any help would be great. I
have very little idea what I'm doing.
Thanks,
Connor
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Loading...