Connor Wolf
2014-10-09 07:09:58 UTC
Hi there!
I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search
across a set of 64 bit integers by hamming distance. This is intended to
be used in searching for similar images, where the 64 bit integer is
actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html> of
an image, and similarity between two images is reflected in the hamming
distance between two integers.
Anyways, The appropriate approach here is to use something called a BK
tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work, in any
event).
That said:
* Is there a /simple/ piece of example-code for implementing a GiST
index for a single index? I've been looking through the /contrib/
directory, particularly the /contrib/btree_gist/ files, but it looks
like 90% of the complexity there is related to supporting all the
column types Postgres has, rather then actually tied to producing a
functional index.
* Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my
compiled extension using `CREATE EXTENSION`, but is there an option
for `CREATE INDEX test_index ON testing USING gist (val);` that lets
me specify *which* GiST index is actually employed? Is this even a
valid question?
This is probably something that's available in one of the system
tables somewhere, but I haven't had much luck with google finding
out where.
* Testing: What's the appropriate way to examine the generated tree
structure of the index? I certainly went through a few bugs with my
test tree system (and that was in python!). Is there any way to
examine the index structure for debugging purposes?
* Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
* In that vein, is there any way to have information that is on the
operations of an entire query? Looking at the number of tree nodes
touched for a scan would be nice (and I would not be surprised if
there is already a facility for it).
Project code is here <https://github.com/fake-name/pg-spgist_hamming> if
anyone is interested, any help would be great. I have very little idea
what I'm doing.
Thanks,
Connor
I'm trying to implement a custom indexing system using the GiST index
framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search
across a set of 64 bit integers by hamming distance. This is intended to
be used in searching for similar images, where the 64 bit integer is
actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html> of
an image, and similarity between two images is reflected in the hamming
distance between two integers.
Anyways, The appropriate approach here is to use something called a BK
tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work, in any
event).
That said:
* Is there a /simple/ piece of example-code for implementing a GiST
index for a single index? I've been looking through the /contrib/
directory, particularly the /contrib/btree_gist/ files, but it looks
like 90% of the complexity there is related to supporting all the
column types Postgres has, rather then actually tied to producing a
functional index.
* Once I have something compiling, how can I check to be sure that I'm
actually using the indexing module I created? I can enable my
compiled extension using `CREATE EXTENSION`, but is there an option
for `CREATE INDEX test_index ON testing USING gist (val);` that lets
me specify *which* GiST index is actually employed? Is this even a
valid question?
This is probably something that's available in one of the system
tables somewhere, but I haven't had much luck with google finding
out where.
* Testing: What's the appropriate way to examine the generated tree
structure of the index? I certainly went through a few bugs with my
test tree system (and that was in python!). Is there any way to
examine the index structure for debugging purposes?
* Also, it looks like `ereport()` is the proper way to emit debugging
information. Is this correct?
* In that vein, is there any way to have information that is on the
operations of an entire query? Looking at the number of tree nodes
touched for a scan would be nice (and I would not be surprised if
there is already a facility for it).
Project code is here <https://github.com/fake-name/pg-spgist_hamming> if
anyone is interested, any help would be great. I have very little idea
what I'm doing.
Thanks,
Connor