Discussion:
table versioning approach (not auditing)
Abelard Hoffman
2014-09-29 02:00:32 UTC
Permalink
Hi. I need to maintain a record of all changes to certain tables so assist
in viewing history and reverting changes when necessary (customer service
makes an incorrect edit, etc.).

I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger
https://wiki.postgresql.org/wiki/Audit_trigger_91plus

I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the
current version
2. have a separate versions table for each real table, and insert into the
associated version table whenever an update or insert is done.

My current implementation is based on the wiki trigger examples, using a
single table, and a json column to record the row changes (rather than
hstore). What I like about that, in particular, is I can have a "global,"
chronological view of all versioned changes very easily.

But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record

#1 is simple to do. The versioning table has a user_id column of who made
the change, so I can query on that.

#2 is more difficult. I may want to fetch all changes to a group of tables
that are all related by foreign keys (e.g., find all changes to "user"
record 849, along with any changes to their "articles," "photos," etc.).
All of the data is in the json column, of course, but it seems like a pain
to try and build a query on the json column that can fetch all those
relationships (and if I mess it up, I probably won't generate any errors,
since the json is so free-form).

So my question is, do you think using the json approach is wrong for this
case? Does it seem better to have separate versioning tables associated
with each real table? Or another approach?

Thanks
Gavin Flower
2014-09-29 03:00:00 UTC
Permalink
On 29/09/14 15:00, Abelard Hoffman wrote:
> Hi. I need to maintain a record of all changes to certain tables so
> assist in viewing history and reverting changes when necessary
> (customer service makes an incorrect edit, etc.).
>
> I have studied these two audit trigger examples:
> https://wiki.postgresql.org/wiki/Audit_trigger
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which
> is the current version
> 2. have a separate versions table for each real table, and insert into
> the associated version table whenever an update or insert is done.
>
> My current implementation is based on the wiki trigger examples, using
> a single table, and a json column to record the row changes (rather
> than hstore). What I like about that, in particular, is I can have a
> "global," chronological view of all versioned changes very easily.
>
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
>
> #1 is simple to do. The versioning table has a user_id column of who
> made the change, so I can query on that.
>
> #2 is more difficult. I may want to fetch all changes to a group of
> tables that are all related by foreign keys (e.g., find all changes to
> "user" record 849, along with any changes to their "articles,"
> "photos," etc.). All of the data is in the json column, of course, but
> it seems like a pain to try and build a query on the json column that
> can fetch all those relationships (and if I mess it up, I probably
> won't generate any errors, since the json is so free-form).
>
> So my question is, do you think using the json approach is wrong for
> this case? Does it seem better to have separate versioning tables
> associated with each real table? Or another approach?
>
> Thanks
>
>
I implemented a 2 table approach over 15 years ago for an insurance
application. I used both an /effective_date/ & and an /as_at_date/, no
triggers were involved. I think a 2 table approach gives you more
flexibility.

The /effective_date/ allowed changes to be made to the table in advance
of when they were to become effective.

The /as_at_date/ allowed quotes to be made, valid for a period starting
at the as_at_date.

End users did not query the database directly, all queries were precoded
in a 4GL called Progress backed by an Oracle database. The same could
be done with a WildFly Java Enterprise AppSever (or some other
middleware) and a PostgreSQL backend.

Different use case, but the concept is probably adaptable to your situation.

You may want a change table, that has a change_number that is in each
type of table affected by a change. This would help for query type #2.

I would be quite happy to contract to work out the appropriate schema
and develop some SQL scripts to query & update the database, if you were
interested. My approach would be to create a minimal database with
sample data to validate the schema design and SQL scripts.

Using a flag to indicate current record, seems inflexible. As some
changes may not take affect until some time in the future, and you can't
query the database to see what was the situation at a particular point
in the past. For example: somebody complains about something that
happened last Saturday near noon, how would you query the database to
what it was like then?


Cheers,
Gavin
Felix Kunde
2014-09-29 07:26:53 UTC
Permalink
Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <***@gmail.com>
An: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Abelard Hoffman
2014-09-29 21:25:31 UTC
Permalink
Thank you Felix, Gavin, and Jonathan for your responses.

Felix & Jonathan: both of you mention just storing deltas. But if you do
that, how do you associate the delta record with the original row? Where's
the PK stored, if it wasn't part of the delta?

Felix, thank you very much for the example code. I took a look at your
table schemas. I need to study it more, but it looks like the way you're
handling the PK, is you're adding a separate synthethic key (audit_id) to
each table that's being versioned. And then storing that key along with the
delta.

So then to find all the versions of a given row, you just need to join the
audit row with the schema_name.table_name.audit_id column. Is that right?
The only potential drawback there is there's no referential integrity
between the audit_log.audit_id and the actual table.

I do like that approach very much though, in that it eliminates the need to
interrogate the json data in order to perform most queries.

AH



On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de> wrote:

> Hey
>
> i've also tried to implement a database versioning using JSON to log
> changes in tables. Here it is: https://github.com/fxku/audit
> I've got two versioning tables, one storing information about all
> transactions that happened and one where i put the JSON logs of row changes
> of each table. I'm only logging old values and not complete rows.
>
> Then I got a function that recreates a database state at a given time into
> a separate schema - either to VIEWs, MVIEWs or TABLES. This database state
> could then be indexed in order to work with it. You can also reset the
> production state to the recreated past state.
>
> Unfortunately I've got no time to further work on it at the moment + I
> have not done tests with many changes in the database so I can't say if the
> recreation process scales well. On downside I've realised is that using the
> json_agg function has limits when I've got binary data. It gets too long.
> So I'm really looking forward using JSONB.
>
> There are more plans in my mind. By having a Transaction_Log table it
> should be possible to revert only certain transactions. I'm also thinking
> of parallel versioning, e.g. different users are all working with their
> version of the database and commit their changes to the production state.
> As I've got a unique history ID for each table and each row, I should be
> able to map the affected records.
>
> Have a look and tell me what you think of it.
>
> Cheers
> Felix
>
>
> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
> Von: "Abelard Hoffman" <***@gmail.com>
> An: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
> Betreff: [GENERAL] table versioning approach (not auditing)
>
> Hi. I need to maintain a record of all changes to certain tables so assist
> in viewing history and reverting changes when necessary (customer service
> makes an incorrect edit, etc.).
>
> I have studied these two audit trigger examples:
> https://wiki.postgresql.org/wiki/Audit_trigger
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which is
> the current version
> 2. have a separate versions table for each real table, and insert into the
> associated version table whenever an update or insert is done.
>
> My current implementation is based on the wiki trigger examples, using a
> single table, and a json column to record the row changes (rather than
> hstore). What I like about that, in particular, is I can have a "global,"
> chronological view of all versioned changes very easily.
>
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
>
> #1 is simple to do. The versioning table has a user_id column of who made
> the change, so I can query on that.
>
> #2 is more difficult. I may want to fetch all changes to a group of tables
> that are all related by foreign keys (e.g., find all changes to "user"
> record 849, along with any changes to their "articles," "photos," etc.).
> All of the data is in the json column, of course, but it seems like a pain
> to try and build a query on the json column that can fetch all those
> relationships (and if I mess it up, I probably won't generate any errors,
> since the json is so free-form).
>
> So my question is, do you think using the json approach is wrong for this
> case? Does it seem better to have separate versioning tables associated
> with each real table? Or another approach?
>
> Thanks
>
>
>
Felix Kunde
2014-09-30 11:22:43 UTC
Permalink
Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <***@gmail.com>
An: "Felix Kunde" <felix-***@gmx.de>
Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <***@gmail.com>
An: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adam Brusselback
2014-09-30 19:16:09 UTC
Permalink
Felix, I'd love to see a single, well maintained project. For example, I
just found yours, and gave it a shot today after seeing this post. I found
a bug when an update command is issued, but the old and new values are all
the same. The trigger will blow up. I've got a fix for that, but if we
had one project that more than a handful of people used, stuff like that
would be quashed very quickly.

I love the design of it by the way. Any idea what it will take to move to
JSONB for 9.4?


On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-***@gmx.de> wrote:

> Hey
>
> yes i'm adding an additional key to each of my tables. First i wanted to
> use the primary key as one column in my audit_log table, but in some of my
> tables the PK consists of more than one column. Plus it's nice to have one
> key that is called the same over all tables.
>
> To get a former state for one row at date x I need to join the latest
> delta BEFORE date x with each delta AFTER date x. If I would log complete
> rows, this joining part would not be neccessary, but as I usually work with
> spatial databases that have complex geometries and also image files, this
> strategy is too harddisk consuming.
>
> If there are more users following a similar approach, I wonder why we not
> throw all the good ideas together, to have one solution that is tested,
> maintained and improved by more developpers. This would be great.
>
> Felix
>
>
> Gesendet: Montag, 29. September 2014 um 23:25 Uhr
> Von: "Abelard Hoffman" <***@gmail.com>
> An: "Felix Kunde" <felix-***@gmx.de>
> Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>
> Thank you Felix, Gavin, and Jonathan for your responses.
>
> Felix & Jonathan: both of you mention just storing deltas. But if you do
> that, how do you associate the delta record with the original row? Where's
> the PK stored, if it wasn't part of the delta?
>
> Felix, thank you very much for the example code. I took a look at your
> table schemas. I need to study it more, but it looks like the way you're
> handling the PK, is you're adding a separate synthethic key (audit_id) to
> each table that's being versioned. And then storing that key along with the
> delta.
>
> So then to find all the versions of a given row, you just need to join the
> audit row with the schema_name.table_name.audit_id column. Is that right?
> The only potential drawback there is there's no referential integrity
> between the audit_log.audit_id and the actual table.
>
> I do like that approach very much though, in that it eliminates the need
> to interrogate the json data in order to perform most queries.
>
> AH
>
>
>
> On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de>
> wrote:Hey
>
> i've also tried to implement a database versioning using JSON to log
> changes in tables. Here it is:
> https://github.com/fxku/audit[https://github.com/fxku/audit]
> I've got two versioning tables, one storing information about all
> transactions that happened and one where i put the JSON logs of row changes
> of each table. I'm only logging old values and not complete rows.
>
> Then I got a function that recreates a database state at a given time into
> a separate schema - either to VIEWs, MVIEWs or TABLES. This database state
> could then be indexed in order to work with it. You can also reset the
> production state to the recreated past state.
>
> Unfortunately I've got no time to further work on it at the moment + I
> have not done tests with many changes in the database so I can't say if the
> recreation process scales well. On downside I've realised is that using the
> json_agg function has limits when I've got binary data. It gets too long.
> So I'm really looking forward using JSONB.
>
> There are more plans in my mind. By having a Transaction_Log table it
> should be possible to revert only certain transactions. I'm also thinking
> of parallel versioning, e.g. different users are all working with their
> version of the database and commit their changes to the production state.
> As I've got a unique history ID for each table and each row, I should be
> able to map the affected records.
>
> Have a look and tell me what you think of it.
>
> Cheers
> Felix
>
>
> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
> Von: "Abelard Hoffman" <***@gmail.com>
> An: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
> Betreff: [GENERAL] table versioning approach (not auditing)
>
> Hi. I need to maintain a record of all changes to certain tables so assist
> in viewing history and reverting changes when necessary (customer service
> makes an incorrect edit, etc.).
>
> I have studied these two audit trigger examples:
>
> https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which is
> the current version
> 2. have a separate versions table for each real table, and insert into the
> associated version table whenever an update or insert is done.
>
> My current implementation is based on the wiki trigger examples, using a
> single table, and a json column to record the row changes (rather than
> hstore). What I like about that, in particular, is I can have a "global,"
> chronological view of all versioned changes very easily.
>
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
>
> #1 is simple to do. The versioning table has a user_id column of who made
> the change, so I can query on that.
>
> #2 is more difficult. I may want to fetch all changes to a group of tables
> that are all related by foreign keys (e.g., find all changes to "user"
> record 849, along with any changes to their "articles," "photos," etc.).
> All of the data is in the json column, of course, but it seems like a pain
> to try and build a query on the json column that can fetch all those
> relationships (and if I mess it up, I probably won't generate any errors,
> since the json is so free-form).
>
> So my question is, do you think using the json approach is wrong for this
> case? Does it seem better to have separate versioning tables associated
> with each real table? Or another approach?
>
> Thanks
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
Felix Kunde
2014-10-01 09:19:01 UTC
Permalink
Hey there. Thank you very much for that fix! Thats why I'd like to have a joint development and joint testing. It's way more convincing for users to go for a solution that is tested by some experts than just by a random developer :)

I'm open to create a new project and push the code there. Don't care about the name. Then we might figure out which parts are already good, which parts could be improved and where to go next. I think switching to JSONB for example will be easy, as it offers the same functions than JSON afaik.
 

Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
Von: "Adam Brusselback" <***@gmail.com>
An: "Felix Kunde" <felix-***@gmx.de>
Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Felix, I'd love to see a single, well maintained project. For example, I just found yours, and gave it a shot today after seeing this post.  I found a bug when an update command is issued, but the old and new values are all the same.  The trigger will blow up.  I've got a fix for that, but if we had one project that more than a handful of people used, stuff like that would be quashed very quickly.
 
I love the design of it by the way. Any idea what it will take to move to JSONB for 9.4? 
 
 
On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-***@gmx.de> wrote:Hey
 
yes i'm adding an additional key to each of my tables. First i wanted to use the primary key as one column in my audit_log table, but in some of my tables the PK consists of more than one column. Plus it's nice to have one key that is called the same over all tables.
 
To get a former state for one row at date x I need to join the latest delta BEFORE date x with each delta AFTER date x. If I would log complete rows, this joining part would not be neccessary, but as I usually work with spatial databases that have complex geometries and also image files, this strategy is too harddisk consuming.
 
If there are more users following a similar approach, I wonder why we not throw all the good ideas together, to have one solution that is tested, maintained and improved by more developpers. This would be great.
 
Felix
 

Gesendet: Montag, 29. September 2014 um 23:25 Uhr
Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com]>
An: "Felix Kunde" <felix-***@gmx.de[felix-***@gmx.de]>
Cc: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
Betreff: Re: [GENERAL] table versioning approach (not auditing)

Thank you Felix, Gavin, and Jonathan for your responses.
 
Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?
 
Felix, thank you very much for the example code. I took a look at your table schemas. I need to study it more, but it looks like the way you're handling the PK, is you're adding a separate synthethic key (audit_id) to each table that's being versioned. And then storing that key along with the delta.
 
So then to find all the versions of a given row, you just need to join the audit row with the schema_name.table_name.audit_id column. Is that right? The only potential drawback there is there's no referential integrity between the audit_log.audit_id and the actual table.
 
I do like that approach very much though, in that it eliminates the need to interrogate the json data in order to perform most queries.
 
AH
 
 
 
On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de[felix-***@gmx.de]> wrote:Hey
 
i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
 
Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
 
Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.

There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.

Have a look and tell me what you think of it.

Cheers
Felix
 

Gesendet: Montag, 29. September 2014 um 04:00 Uhr
Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com]>
An: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
Betreff: [GENERAL] table versioning approach (not auditing)

Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
 
I have studied these two audit trigger examples:
https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
https://wiki.postgresql.org/wiki/Audit_trigger_91plus
 
I've also read about two other approaches to versioning:
1. maintain all versions in one table, with a flag to indicate which is the current version
2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
 
My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
 
But there are two types of queries I need to run.
1. Find all changes made by a specific user
2. Find all changes related to a specific record
 
#1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
 
#2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
 
So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
 
Thanks
 
 

 

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org[pgsql-***@postgresql.org])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adam Brusselback
2014-10-01 15:09:15 UTC
Permalink
I know we're kinda hijacking this thread, so sorry for that. If you'd like
to do that, i'd be more than happy to use it and push any fixes / changes
upstream. I don't have much of a preference on the name either, as long as
it's something that makes sense.

I would consider myself far from an expert though! Either way, more people
using a single solution is a good thing.

As a side note, I did some benchmarking this morning and wanted to share
the results:
pgbench -i -s 140 -U postgres pgbench

pgbench -c 4 -j 4 -T 600 -U postgres pgbench
no auditing tps: 2854
NOTE: Accounts are audited
auditing tps: 1278

pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
no auditing tps: 2504
NOTE: Accounts are audited
auditing tps: 822

pgbench -c 2 -j 2 -T 300 -U postgres pgbench
no auditing tps: 1836
NOTE: branches and tellers are audited, accounts are not
auditing tps: 505

I'd love to see if there are some easy wins to boost the performance.

On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-***@gmx.de> wrote:

> Hey there. Thank you very much for that fix! Thats why I'd like to have a
> joint development and joint testing. It's way more convincing for users to
> go for a solution that is tested by some experts than just by a random
> developer :)
>
> I'm open to create a new project and push the code there. Don't care about
> the name. Then we might figure out which parts are already good, which
> parts could be improved and where to go next. I think switching to JSONB
> for example will be easy, as it offers the same functions than JSON afaik.
>
>
> Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
> Von: "Adam Brusselback" <***@gmail.com>
> An: "Felix Kunde" <felix-***@gmx.de>
> Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>
> Felix, I'd love to see a single, well maintained project. For example, I
> just found yours, and gave it a shot today after seeing this post. I found
> a bug when an update command is issued, but the old and new values are all
> the same. The trigger will blow up. I've got a fix for that, but if we
> had one project that more than a handful of people used, stuff like that
> would be quashed very quickly.
>
> I love the design of it by the way. Any idea what it will take to move to
> JSONB for 9.4?
>
>
> On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-***@gmx.de>
> wrote:Hey
>
> yes i'm adding an additional key to each of my tables. First i wanted to
> use the primary key as one column in my audit_log table, but in some of my
> tables the PK consists of more than one column. Plus it's nice to have one
> key that is called the same over all tables.
>
> To get a former state for one row at date x I need to join the latest
> delta BEFORE date x with each delta AFTER date x. If I would log complete
> rows, this joining part would not be neccessary, but as I usually work with
> spatial databases that have complex geometries and also image files, this
> strategy is too harddisk consuming.
>
> If there are more users following a similar approach, I wonder why we not
> throw all the good ideas together, to have one solution that is tested,
> maintained and improved by more developpers. This would be great.
>
> Felix
>
>
> Gesendet: Montag, 29. September 2014 um 23:25 Uhr
> Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com
> ]>
> An: "Felix Kunde" <felix-***@gmx.de[felix-***@gmx.de]>
> Cc: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>
> Thank you Felix, Gavin, and Jonathan for your responses.
>
> Felix & Jonathan: both of you mention just storing deltas. But if you do
> that, how do you associate the delta record with the original row? Where's
> the PK stored, if it wasn't part of the delta?
>
> Felix, thank you very much for the example code. I took a look at your
> table schemas. I need to study it more, but it looks like the way you're
> handling the PK, is you're adding a separate synthethic key (audit_id) to
> each table that's being versioned. And then storing that key along with the
> delta.
>
> So then to find all the versions of a given row, you just need to join the
> audit row with the schema_name.table_name.audit_id column. Is that right?
> The only potential drawback there is there's no referential integrity
> between the audit_log.audit_id and the actual table.
>
> I do like that approach very much though, in that it eliminates the need
> to interrogate the json data in order to perform most queries.
>
> AH
>
>
>
> On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de[
> felix-***@gmx.de]> wrote:Hey
>
> i've also tried to implement a database versioning using JSON to log
> changes in tables. Here it is:
> https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
> I've got two versioning tables, one storing information about all
> transactions that happened and one where i put the JSON logs of row changes
> of each table. I'm only logging old values and not complete rows.
>
> Then I got a function that recreates a database state at a given time into
> a separate schema - either to VIEWs, MVIEWs or TABLES. This database state
> could then be indexed in order to work with it. You can also reset the
> production state to the recreated past state.
>
> Unfortunately I've got no time to further work on it at the moment + I
> have not done tests with many changes in the database so I can't say if the
> recreation process scales well. On downside I've realised is that using the
> json_agg function has limits when I've got binary data. It gets too long.
> So I'm really looking forward using JSONB.
>
> There are more plans in my mind. By having a Transaction_Log table it
> should be possible to revert only certain transactions. I'm also thinking
> of parallel versioning, e.g. different users are all working with their
> version of the database and commit their changes to the production state.
> As I've got a unique history ID for each table and each row, I should be
> able to map the affected records.
>
> Have a look and tell me what you think of it.
>
> Cheers
> Felix
>
>
> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
> Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com
> ]>
> An: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
> Betreff: [GENERAL] table versioning approach (not auditing)
>
> Hi. I need to maintain a record of all changes to certain tables so assist
> in viewing history and reverting changes when necessary (customer service
> makes an incorrect edit, etc.).
>
> I have studied these two audit trigger examples:
>
> https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which is
> the current version
> 2. have a separate versions table for each real table, and insert into the
> associated version table whenever an update or insert is done.
>
> My current implementation is based on the wiki trigger examples, using a
> single table, and a json column to record the row changes (rather than
> hstore). What I like about that, in particular, is I can have a "global,"
> chronological view of all versioned changes very easily.
>
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
>
> #1 is simple to do. The versioning table has a user_id column of who made
> the change, so I can query on that.
>
> #2 is more difficult. I may want to fetch all changes to a group of tables
> that are all related by foreign keys (e.g., find all changes to "user"
> record 849, along with any changes to their "articles," "photos," etc.).
> All of the data is in the json column, of course, but it seems like a pain
> to try and build a query on the json column that can fetch all those
> relationships (and if I mess it up, I probably won't generate any errors,
> since the json is so free-form).
>
> So my question is, do you think using the json approach is wrong for this
> case? Does it seem better to have separate versioning tables associated
> with each real table? Or another approach?
>
> Thanks
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-***@postgresql.org[
> pgsql-***@postgresql.org])
> To make changes to your subscription:
>
> http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]
>
Adam Brusselback
2014-10-02 14:27:44 UTC
Permalink
Testing that now. Initial results are not looking too performant.
I have one single table which had 234575 updates done to it. I am rolling
back 13093 of them. It's been running 20 min now, using 100% of a single
core, and almost 0 disk. No idea how long it'll run at this point.

This is on an i5 desktop with 16 gigs of ram and an ssd.

This is a pretty good test though, as it's a real world use case (even if
the data was generated with PGBench). We now know that area needs some
work before it can be used for anything more than a toy database.

Thanks,
-Adam

On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-***@gmx.de> wrote:

> Hey there
>
> Thanks again for the fix. I was able to merge it into my repo.
> Also thanks for benchmarking audit. Very interesting results.
> I wonder how the recreation of former database states scales when
> processing many deltas.
> Haven’t done a lot of testing in that direction.
>
> I will transfer the code soon to a more public repo on GitHub. As far as I
> see I have to create an organization for that.
>
> Cheers
> Felix
>
> *Gesendet:* Mittwoch, 01. Oktober 2014 um 17:09 Uhr
>
> *Von:* "Adam Brusselback" <***@gmail.com>
> *An:* "Felix Kunde" <felix-***@gmx.de>
> *Cc:* "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
> *Betreff:* Re: [GENERAL] table versioning approach (not auditing)
> I know we're kinda hijacking this thread, so sorry for that. If you'd
> like to do that, i'd be more than happy to use it and push any fixes /
> changes upstream. I don't have much of a preference on the name either, as
> long as it's something that makes sense.
>
> I would consider myself far from an expert though! Either way, more people
> using a single solution is a good thing.
>
> As a side note, I did some benchmarking this morning and wanted to share
> the results:
> pgbench -i -s 140 -U postgres pgbench
>
> pgbench -c 4 -j 4 -T 600 -U postgres pgbench
> no auditing tps: 2854
> NOTE: Accounts are audited
> auditing tps: 1278
>
> pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
> no auditing tps: 2504
> NOTE: Accounts are audited
> auditing tps: 822
>
> pgbench -c 2 -j 2 -T 300 -U postgres pgbench
> no auditing tps: 1836
> NOTE: branches and tellers are audited, accounts are not
> auditing tps: 505
>
> I'd love to see if there are some easy wins to boost the performance.
>
> On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-***@gmx.de> wrote:
>>
>> Hey there. Thank you very much for that fix! Thats why I'd like to have a
>> joint development and joint testing. It's way more convincing for users to
>> go for a solution that is tested by some experts than just by a random
>> developer :)
>>
>> I'm open to create a new project and push the code there. Don't care
>> about the name. Then we might figure out which parts are already good,
>> which parts could be improved and where to go next. I think switching to
>> JSONB for example will be easy, as it offers the same functions than JSON
>> afaik.
>>
>>
>> Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
>> Von: "Adam Brusselback" <***@gmail.com>
>> An: "Felix Kunde" <felix-***@gmx.de>
>> Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>
>> Felix, I'd love to see a single, well maintained project. For example, I
>> just found yours, and gave it a shot today after seeing this post. I found
>> a bug when an update command is issued, but the old and new values are all
>> the same. The trigger will blow up. I've got a fix for that, but if we
>> had one project that more than a handful of people used, stuff like that
>> would be quashed very quickly.
>>
>> I love the design of it by the way. Any idea what it will take to move to
>> JSONB for 9.4?
>>
>>
>> On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-***@gmx.de>
>> wrote:Hey
>>
>> yes i'm adding an additional key to each of my tables. First i wanted to
>> use the primary key as one column in my audit_log table, but in some of my
>> tables the PK consists of more than one column. Plus it's nice to have one
>> key that is called the same over all tables.
>>
>> To get a former state for one row at date x I need to join the latest
>> delta BEFORE date x with each delta AFTER date x. If I would log complete
>> rows, this joining part would not be neccessary, but as I usually work with
>> spatial databases that have complex geometries and also image files, this
>> strategy is too harddisk consuming.
>>
>> If there are more users following a similar approach, I wonder why we not
>> throw all the good ideas together, to have one solution that is tested,
>> maintained and improved by more developpers. This would be great.
>>
>> Felix
>>
>>
>> Gesendet: Montag, 29. September 2014 um 23:25 Uhr
>> Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com
>> ]>
>> An: "Felix Kunde" <felix-***@gmx.de[felix-***@gmx.de]>
>> Cc: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
>> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>
>> Thank you Felix, Gavin, and Jonathan for your responses.
>>
>> Felix & Jonathan: both of you mention just storing deltas. But if you do
>> that, how do you associate the delta record with the original row? Where's
>> the PK stored, if it wasn't part of the delta?
>>
>> Felix, thank you very much for the example code. I took a look at your
>> table schemas. I need to study it more, but it looks like the way you're
>> handling the PK, is you're adding a separate synthethic key (audit_id) to
>> each table that's being versioned. And then storing that key along with the
>> delta.
>>
>> So then to find all the versions of a given row, you just need to join
>> the audit row with the schema_name.table_name.audit_id column. Is that
>> right? The only potential drawback there is there's no referential
>> integrity between the audit_log.audit_id and the actual table.
>>
>> I do like that approach very much though, in that it eliminates the need
>> to interrogate the json data in order to perform most queries.
>>
>> AH
>>
>>
>>
>> On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de[
>> felix-***@gmx.de]> wrote:Hey
>>
>> i've also tried to implement a database versioning using JSON to log
>> changes in tables. Here it is:
>> https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
>> I've got two versioning tables, one storing information about all
>> transactions that happened and one where i put the JSON logs of row changes
>> of each table. I'm only logging old values and not complete rows.
>>
>> Then I got a function that recreates a database state at a given time
>> into a separate schema - either to VIEWs, MVIEWs or TABLES. This database
>> state could then be indexed in order to work with it. You can also reset
>> the production state to the recreated past state.
>>
>> Unfortunately I've got no time to further work on it at the moment + I
>> have not done tests with many changes in the database so I can't say if the
>> recreation process scales well. On downside I've realised is that using the
>> json_agg function has limits when I've got binary data. It gets too long.
>> So I'm really looking forward using JSONB.
>>
>> There are more plans in my mind. By having a Transaction_Log table it
>> should be possible to revert only certain transactions. I'm also thinking
>> of parallel versioning, e.g. different users are all working with their
>> version of the database and commit their changes to the production state.
>> As I've got a unique history ID for each table and each row, I should be
>> able to map the affected records.
>>
>> Have a look and tell me what you think of it.
>>
>> Cheers
>> Felix
>>
>>
>> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
>> Von: "Abelard Hoffman" <***@gmail.com[***@gmail.com
>> ]>
>> An: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
>> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
>> Betreff: [GENERAL] table versioning approach (not auditing)
>>
>> Hi. I need to maintain a record of all changes to certain tables so
>> assist in viewing history and reverting changes when necessary (customer
>> service makes an incorrect edit, etc.).
>>
>> I have studied these two audit trigger examples:
>>
>> https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
>> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>>
>> I've also read about two other approaches to versioning:
>> 1. maintain all versions in one table, with a flag to indicate which is
>> the current version
>> 2. have a separate versions table for each real table, and insert into
>> the associated version table whenever an update or insert is done.
>>
>> My current implementation is based on the wiki trigger examples, using a
>> single table, and a json column to record the row changes (rather than
>> hstore). What I like about that, in particular, is I can have a "global,"
>> chronological view of all versioned changes very easily.
>>
>> But there are two types of queries I need to run.
>> 1. Find all changes made by a specific user
>> 2. Find all changes related to a specific record
>>
>> #1 is simple to do. The versioning table has a user_id column of who made
>> the change, so I can query on that.
>>
>> #2 is more difficult. I may want to fetch all changes to a group of
>> tables that are all related by foreign keys (e.g., find all changes to
>> "user" record 849, along with any changes to their "articles," "photos,"
>> etc.). All of the data is in the json column, of course, but it seems like
>> a pain to try and build a query on the json column that can fetch all those
>> relationships (and if I mess it up, I probably won't generate any errors,
>> since the json is so free-form).
>>
>> So my question is, do you think using the json approach is wrong for this
>> case? Does it seem better to have separate versioning tables associated
>> with each real table? Or another approach?
>>
>> Thanks
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-***@postgresql.org[
>> pgsql-***@postgresql.org])
>> To make changes to your subscription:
>>
>> http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]
>
>
Adam Brusselback
2014-10-02 14:38:20 UTC
Permalink
Ended up running for 28 min, but it did work as expected.

On Thu, Oct 2, 2014 at 10:27 AM, Adam Brusselback <***@gmail.com
> wrote:

> Testing that now. Initial results are not looking too performant.
> I have one single table which had 234575 updates done to it. I am rolling
> back 13093 of them. It's been running 20 min now, using 100% of a single
> core, and almost 0 disk. No idea how long it'll run at this point.
>
> This is on an i5 desktop with 16 gigs of ram and an ssd.
>
> This is a pretty good test though, as it's a real world use case (even if
> the data was generated with PGBench). We now know that area needs some
> work before it can be used for anything more than a toy database.
>
> Thanks,
> -Adam
>
> On Thu, Oct 2, 2014 at 7:52 AM, Felix Kunde <felix-***@gmx.de> wrote:
>
>> Hey there
>>
>> Thanks again for the fix. I was able to merge it into my repo.
>> Also thanks for benchmarking audit. Very interesting results.
>> I wonder how the recreation of former database states scales when
>> processing many deltas.
>> Haven’t done a lot of testing in that direction.
>>
>> I will transfer the code soon to a more public repo on GitHub. As far as
>> I see I have to create an organization for that.
>>
>> Cheers
>> Felix
>>
>> *Gesendet:* Mittwoch, 01. Oktober 2014 um 17:09 Uhr
>>
>> *Von:* "Adam Brusselback" <***@gmail.com>
>> *An:* "Felix Kunde" <felix-***@gmx.de>
>> *Cc:* "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
>> *Betreff:* Re: [GENERAL] table versioning approach (not auditing)
>> I know we're kinda hijacking this thread, so sorry for that. If you'd
>> like to do that, i'd be more than happy to use it and push any fixes /
>> changes upstream. I don't have much of a preference on the name either, as
>> long as it's something that makes sense.
>>
>> I would consider myself far from an expert though! Either way, more
>> people using a single solution is a good thing.
>>
>> As a side note, I did some benchmarking this morning and wanted to share
>> the results:
>> pgbench -i -s 140 -U postgres pgbench
>>
>> pgbench -c 4 -j 4 -T 600 -U postgres pgbench
>> no auditing tps: 2854
>> NOTE: Accounts are audited
>> auditing tps: 1278
>>
>> pgbench -c 2 -j 2 -N -T 300 -U postgres pgbench
>> no auditing tps: 2504
>> NOTE: Accounts are audited
>> auditing tps: 822
>>
>> pgbench -c 2 -j 2 -T 300 -U postgres pgbench
>> no auditing tps: 1836
>> NOTE: branches and tellers are audited, accounts are not
>> auditing tps: 505
>>
>> I'd love to see if there are some easy wins to boost the performance.
>>
>> On Wed, Oct 1, 2014 at 5:19 AM, Felix Kunde <felix-***@gmx.de> wrote:
>>>
>>> Hey there. Thank you very much for that fix! Thats why I'd like to have
>>> a joint development and joint testing. It's way more convincing for users
>>> to go for a solution that is tested by some experts than just by a random
>>> developer :)
>>>
>>> I'm open to create a new project and push the code there. Don't care
>>> about the name. Then we might figure out which parts are already good,
>>> which parts could be improved and where to go next. I think switching to
>>> JSONB for example will be easy, as it offers the same functions than JSON
>>> afaik.
>>>
>>>
>>> Gesendet: Dienstag, 30. September 2014 um 21:16 Uhr
>>> Von: "Adam Brusselback" <***@gmail.com>
>>> An: "Felix Kunde" <felix-***@gmx.de>
>>> Cc: "pgsql-***@postgresql.org" <pgsql-***@postgresql.org>
>>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>>
>>> Felix, I'd love to see a single, well maintained project. For example, I
>>> just found yours, and gave it a shot today after seeing this post. I found
>>> a bug when an update command is issued, but the old and new values are all
>>> the same. The trigger will blow up. I've got a fix for that, but if we
>>> had one project that more than a handful of people used, stuff like that
>>> would be quashed very quickly.
>>>
>>> I love the design of it by the way. Any idea what it will take to move
>>> to JSONB for 9.4?
>>>
>>>
>>> On Tue, Sep 30, 2014 at 7:22 AM, Felix Kunde <felix-***@gmx.de>
>>> wrote:Hey
>>>
>>> yes i'm adding an additional key to each of my tables. First i wanted to
>>> use the primary key as one column in my audit_log table, but in some of my
>>> tables the PK consists of more than one column. Plus it's nice to have one
>>> key that is called the same over all tables.
>>>
>>> To get a former state for one row at date x I need to join the latest
>>> delta BEFORE date x with each delta AFTER date x. If I would log complete
>>> rows, this joining part would not be neccessary, but as I usually work with
>>> spatial databases that have complex geometries and also image files, this
>>> strategy is too harddisk consuming.
>>>
>>> If there are more users following a similar approach, I wonder why we
>>> not throw all the good ideas together, to have one solution that is tested,
>>> maintained and improved by more developpers. This would be great.
>>>
>>> Felix
>>>
>>>
>>> Gesendet: Montag, 29. September 2014 um 23:25 Uhr
>>> Von: "Abelard Hoffman" <***@gmail.com[
>>> ***@gmail.com]>
>>> An: "Felix Kunde" <felix-***@gmx.de[felix-***@gmx.de]>
>>> Cc: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
>>> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
>>> Betreff: Re: [GENERAL] table versioning approach (not auditing)
>>>
>>> Thank you Felix, Gavin, and Jonathan for your responses.
>>>
>>> Felix & Jonathan: both of you mention just storing deltas. But if you do
>>> that, how do you associate the delta record with the original row? Where's
>>> the PK stored, if it wasn't part of the delta?
>>>
>>> Felix, thank you very much for the example code. I took a look at your
>>> table schemas. I need to study it more, but it looks like the way you're
>>> handling the PK, is you're adding a separate synthethic key (audit_id) to
>>> each table that's being versioned. And then storing that key along with the
>>> delta.
>>>
>>> So then to find all the versions of a given row, you just need to join
>>> the audit row with the schema_name.table_name.audit_id column. Is that
>>> right? The only potential drawback there is there's no referential
>>> integrity between the audit_log.audit_id and the actual table.
>>>
>>> I do like that approach very much though, in that it eliminates the need
>>> to interrogate the json data in order to perform most queries.
>>>
>>> AH
>>>
>>>
>>>
>>> On Mon, Sep 29, 2014 at 12:26 AM, Felix Kunde <felix-***@gmx.de[
>>> felix-***@gmx.de]> wrote:Hey
>>>
>>> i've also tried to implement a database versioning using JSON to log
>>> changes in tables. Here it is:
>>> https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]]
>>> I've got two versioning tables, one storing information about all
>>> transactions that happened and one where i put the JSON logs of row changes
>>> of each table. I'm only logging old values and not complete rows.
>>>
>>> Then I got a function that recreates a database state at a given time
>>> into a separate schema - either to VIEWs, MVIEWs or TABLES. This database
>>> state could then be indexed in order to work with it. You can also reset
>>> the production state to the recreated past state.
>>>
>>> Unfortunately I've got no time to further work on it at the moment + I
>>> have not done tests with many changes in the database so I can't say if the
>>> recreation process scales well. On downside I've realised is that using the
>>> json_agg function has limits when I've got binary data. It gets too long.
>>> So I'm really looking forward using JSONB.
>>>
>>> There are more plans in my mind. By having a Transaction_Log table it
>>> should be possible to revert only certain transactions. I'm also thinking
>>> of parallel versioning, e.g. different users are all working with their
>>> version of the database and commit their changes to the production state.
>>> As I've got a unique history ID for each table and each row, I should be
>>> able to map the affected records.
>>>
>>> Have a look and tell me what you think of it.
>>>
>>> Cheers
>>> Felix
>>>
>>>
>>> Gesendet: Montag, 29. September 2014 um 04:00 Uhr
>>> Von: "Abelard Hoffman" <***@gmail.com[
>>> ***@gmail.com]>
>>> An: "pgsql-***@postgresql.org[pgsql-***@postgresql.org]" <
>>> pgsql-***@postgresql.org[pgsql-***@postgresql.org]>
>>> Betreff: [GENERAL] table versioning approach (not auditing)
>>>
>>> Hi. I need to maintain a record of all changes to certain tables so
>>> assist in viewing history and reverting changes when necessary (customer
>>> service makes an incorrect edit, etc.).
>>>
>>> I have studied these two audit trigger examples:
>>>
>>> https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger][https://wiki.postgresql.org/wiki/Audit_trigger[https://wiki.postgresql.org/wiki/Audit_trigger]]
>>> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>>>
>>> I've also read about two other approaches to versioning:
>>> 1. maintain all versions in one table, with a flag to indicate which is
>>> the current version
>>> 2. have a separate versions table for each real table, and insert into
>>> the associated version table whenever an update or insert is done.
>>>
>>> My current implementation is based on the wiki trigger examples, using a
>>> single table, and a json column to record the row changes (rather than
>>> hstore). What I like about that, in particular, is I can have a "global,"
>>> chronological view of all versioned changes very easily.
>>>
>>> But there are two types of queries I need to run.
>>> 1. Find all changes made by a specific user
>>> 2. Find all changes related to a specific record
>>>
>>> #1 is simple to do. The versioning table has a user_id column of who
>>> made the change, so I can query on that.
>>>
>>> #2 is more difficult. I may want to fetch all changes to a group of
>>> tables that are all related by foreign keys (e.g., find all changes to
>>> "user" record 849, along with any changes to their "articles," "photos,"
>>> etc.). All of the data is in the json column, of course, but it seems like
>>> a pain to try and build a query on the json column that can fetch all those
>>> relationships (and if I mess it up, I probably won't generate any errors,
>>> since the json is so free-form).
>>>
>>> So my question is, do you think using the json approach is wrong for
>>> this case? Does it seem better to have separate versioning tables
>>> associated with each real table? Or another approach?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-***@postgresql.org[
>>> pgsql-***@postgresql.org])
>>> To make changes to your subscription:
>>>
>>> http://www.postgresql.org/mailpref/pgsql-general[http://www.postgresql.org/mailpref/pgsql-general]
>>
>>
>
Jim Nasby
2014-10-06 21:47:26 UTC
Permalink
On 10/2/14, 9:27 AM, Adam Brusselback wrote:
> i've also tried to implement a database versioning using JSON to log changes in tables. Here it is: https://github.com/fxku/audit[https://github.com/fxku/audit][https://github.com/fxku/audit[https://github.com/fxku/audit]] <https://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5Bhttps://github.com/fxku/audit%5Bhttps://github.com/fxku/audit%5D%5D>
> I've got two versioning tables, one storing information about all transactions that happened and one where i put the JSON logs of row changes of each table. I'm only logging old values and not complete rows.
>
> Then I got a function that recreates a database state at a given time into a separate schema - either to VIEWs, MVIEWs or TABLES. This database state could then be indexed in order to work with it. You can also reset the production state to the recreated past state.
>
> Unfortunately I've got no time to further work on it at the moment + I have not done tests with many changes in the database so I can't say if the recreation process scales well. On downside I've realised is that using the json_agg function has limits when I've got binary data. It gets too long. So I'm really looking forward using JSONB.
>
> There are more plans in my mind. By having a Transaction_Log table it should be possible to revert only certain transactions. I'm also thinking of parallel versioning, e.g. different users are all working with their version of the database and commit their changes to the production state. As I've got a unique history ID for each table and each row, I should be able to map the affected records.
Sorry I'm coming late to this thread. I agree that getting interested people together would be a good idea. Is there another mailing list we can do that with?

Versioning is also something I've interested in, and have put a lot of thought into (if not much actual code :( ). I'll also make some general comments, if I may...


I think timestamps should be *heavily avoided* in versioning, because they are frequently the wrong way to solve a problem. There are many use cases where you're trying to answer "What values were in place when X happened", and the simplest, most fool-proof way to answer that is that when you create a record for X, part of that record is a "history ID" that shows you the exact data used. For example, if you're creating an invoicing system that has versioning of customer addresses you would not try and join an invoice with it's address using a timestamp; you would put an actual address_history_id in the invoice table.

I thought I saw a reference to versioning sets of information. This is perhaps the trickiest part. You first have to think about the non-versioned sets (ie: a customer may have many phone numbers) before you think about versioning the set. In this example, you want the history of the *set* of phone numbers, not of each individual number. Design it with full duplication of data first, don't think about normalizing until you have the full set versioning design.

I understand the generic appeal of using something like JSON, but in reality I don't see it working terribly well. It's likely to be on the slow side, and it'll also be difficult to query from. Instead, I think it makes more sense to create actual history tables that derive their definition from the base table. I've got code that extracts information (column_name, data type, nullability) from a table (or even a table definition), and it's not that complex. With the work that's been done on capturing DDL changes it shouldn't be too hard to handle that automatically.


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Gavin Flower
2014-10-06 23:10:08 UTC
Permalink
This post might be inappropriate. Click to display it.
Jim Nasby
2014-10-08 00:29:22 UTC
Permalink
On 10/6/14, 6:10 PM, Gavin Flower wrote:
> Even if timestamps are used extensively, you'd have to be careful joining on them. You may have information valid at T1 and changing at T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate set of data would be associated with T1, would would not get anywhere trying to find data with a timestamp of T2 (unless you were very lucky!).

Yeah, this is why I think timestamps need to be shunned in favor of explicit pointers. Anyone that thinks timestamps are good enough hasn't thought the problem through completely. :)

I also think there's potential value to storing full transaction information (presumably in a separate table): txid_current(), txid_current_snapshot(), now(), current_user, maybe some other stuff (client IP address?). That way you can tell exactly what created a history record. With appropriate shenanigans you can theoretically determine exactly what other history data would be visible at that time without using pointers (but man would that bu ugly!)

> Actually things like phone numbers are tricky. Sometimes you may want to use the current phone number, and not the one extant at that time (as you want to phone the contact now), or you may still want the old phone number (was the call to a specific number at date/time legitimate & who do we charge the cost of the call too).

Yeah, I'm pretty convinced at this point that history/versioning should be built on top of a schema that always contains the current information, if for no other reason than so you always have a PK that points to what's current in addition to your history PKs.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Gavin Flower
2014-10-08 03:40:59 UTC
Permalink
On 08/10/14 13:29, Jim Nasby wrote:
> On 10/6/14, 6:10 PM, Gavin Flower wrote:
>> Even if timestamps are used extensively, you'd have to be careful
>> joining on them. You may have information valid at T1 and changing at
>> T3, but the transaction has T2, where T1 < T2 < T3 - the appropriate
>> set of data would be associated with T1, would would not get anywhere
>> trying to find data with a timestamp of T2 (unless you were very
>> lucky!).
>
> Yeah, this is why I think timestamps need to be shunned in favor of
> explicit pointers. Anyone that thinks timestamps are good enough
> hasn't thought the problem through completely. :)
>
> I also think there's potential value to storing full transaction
> information (presumably in a separate table): txid_current(),
> txid_current_snapshot(), now(), current_user, maybe some other stuff
> (client IP address?). That way you can tell exactly what created a
> history record. With appropriate shenanigans you can theoretically
> determine exactly what other history data would be visible at that
> time without using pointers (but man would that bu ugly!)
>
>> Actually things like phone numbers are tricky. Sometimes you may
>> want to use the current phone number, and not the one extant at that
>> time (as you want to phone the contact now), or you may still want
>> the old phone number (was the call to a specific number at date/time
>> legitimate & who do we charge the cost of the call too).
>
> Yeah, I'm pretty convinced at this point that history/versioning
> should be built on top of a schema that always contains the current
> information, if for no other reason than so you always have a PK that
> points to what's current in addition to your history PKs.
One of the motivations for having an effective_date, was being able to
put changes into the database ahead of time.

Finding the current value uses the same logic a find the value at any
other date/time - so you don't need a special schema to distinguish the
current state from anything else. For example:

DROP TABLE IF EXISTS stock;

CREATE TABLE stock
(
id text,
effective_date timestamptz,
price numeric
);

INSERT INTO stock
(
id,
effective_date,
price
)
VALUES
('y88', '2014-10-01', 12.0),
('x42', '2014-10-01', 12.1),
('x42', '2014-10-08', 12.2),
('x42', '2014-10-10', 12.3),
('x42', '2014-10-16', 12.4),
('z42', '2014-10-19', 12.5),
('z49', '2014-10-01', 12.6),
('z49', '2014-10-30', 12.7),
('z77', '2014-10-01', 12.8);

CREATE UNIQUE INDEX primary_key ON stock (id ASC, effective_date DESC);

SELECT
s.price
FROM
stock s
WHERE
s.id = 'x42'
AND s.effective_date <= '2014-10-11'
ORDER BY
s.effective_date DESC
LIMIT 1;



Cheers,
Gavin
Jim Nasby
2014-10-10 20:44:18 UTC
Permalink
On 10/7/14, 10:40 PM, Gavin Flower wrote:
>> Yeah, I'm pretty convinced at this point that history/versioning should be built on top of a schema that always contains the current information, if for no other reason than so you always have a PK that points to what's current in addition to your history PKs.
> One of the motivations for having an effective_date, was being able to put changes into the database ahead of time.

Yeah, allowing for future data makes things more interesting. My first inclination is that it's a completely separate requirement, and you would track the history of all records that you had at a point in time. Doing that means you can see things like someone changing the effective date from Nov. 1 to Dec. 1. But clearly this is an area where you have to take the business case into account.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Vanasco
2014-09-29 15:08:55 UTC
Permalink
In the past, to accomplish the same thing I've done this:

- store the data in hstore/json. instead of storing snapshots, I store deltas. i've been using a second table though, because it's improved performance on reads and writes.
- use a "transaction" log. every write session gets logged into the transaction table (serial, timestamp, user_id). all updates to the recorded tables include the transaction's serial. then there is a "transactions" table, that is just "transaction_serial , object_id , object_action".

whenever I have needs for auditing or versioning, I can just query the transaction table for the records I want... then use that to grab the data out of hstore.



On Sep 28, 2014, at 10:00 PM, Abelard Hoffman wrote:

> Hi. I need to maintain a record of all changes to certain tables so assist in viewing history and reverting changes when necessary (customer service makes an incorrect edit, etc.).
>
> I have studied these two audit trigger examples:
> https://wiki.postgresql.org/wiki/Audit_trigger
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
>
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which is the current version
> 2. have a separate versions table for each real table, and insert into the associated version table whenever an update or insert is done.
>
> My current implementation is based on the wiki trigger examples, using a single table, and a json column to record the row changes (rather than hstore). What I like about that, in particular, is I can have a "global," chronological view of all versioned changes very easily.
>
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
>
> #1 is simple to do. The versioning table has a user_id column of who made the change, so I can query on that.
>
> #2 is more difficult. I may want to fetch all changes to a group of tables that are all related by foreign keys (e.g., find all changes to "user" record 849, along with any changes to their "articles," "photos," etc.). All of the data is in the json column, of course, but it seems like a pain to try and build a query on the json column that can fetch all those relationships (and if I mess it up, I probably won't generate any errors, since the json is so free-form).
>
> So my question is, do you think using the json approach is wrong for this case? Does it seem better to have separate versioning tables associated with each real table? Or another approach?
>
> Thanks
Nick Guenther
2014-09-29 20:06:24 UTC
Permalink
On September 29, 2014 11:08:55 AM EDT, Jonathan Vanasco <***@2xlp.com> wrote:
>
>- use a "transaction" log. every write session gets logged into the
>transaction table (serial, timestamp, user_id). all updates to the
>recorded tables include the transaction's serial. then there is a
>"transactions" table, that is just "transaction_serial , object_id ,
>object_action".

A newbie tangent question: how do you access the transaction serial? Is it txid_current() as listed in http://www.postgresql.org/docs/9.3/static/functions-info.html?

And how do you actually make use of that information? I know from Bruce Momjians's excellent MVCC talk <http://momjian.us/main/writings/pgsql/mvcc.pdf> that postgres internally has a secret txid column on each row; can you somehow query on the secret column? And does your implementation worry about multiple timelines?

My use case is dynamically allocated replication. Broadly, my algorithm is that for each client
1) download a full copy of the current table
2) keep the connection open and send deltas (which are just inserts and deletes, for me)

I need 2 to begin *as if immediately* after 1. txids sound like they are exactly what I need but without knowing how to handle them, I fudged it by opening a query for 1 and for 2 immediately after each other so that they should be plugged to the same txid but before reading them. There's definitely a race condition that will show under load, though. I think the correct algorithm is:

1) ask the current txid X
2) start buffering deltas with txid > X
3) download the table as of X
4) download the buffer of deltas and listen for future ones
--


--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Vanasco
2014-09-29 21:56:31 UTC
Permalink
On Sep 29, 2014, at 4:06 PM, Nick Guenther wrote:

> A newbie tangent question: how do you access the transaction serial? Is it txid_current() as listed in http://www.postgresql.org/docs/9.3/static/functions-info.html?

My implementations were ridiculously simple/naive in design, and existed entirely with under defined serials. i'd just create a new record + id on a write operation, and then use it when logging all operations.

I had read up on a lot of (possibly better) ways to handle this using pg internals. They all seemed more advanced than I needed.


> And does your implementation worry about multiple timelines?

Not sure I understand this... but every object is given a revision id. edits between consecutive revisions are allowed, edits spanning multiple revisions are rejected.


On Sep 29, 2014, at 5:25 PM, Abelard Hoffman wrote:

> Felix & Jonathan: both of you mention just storing deltas. But if you do that, how do you associate the delta record with the original row? Where's the PK stored, if it wasn't part of the delta?

The logic I decided on, is this:

Revision 0
Only the original record is stored
Revision 1
• Copy the original record into revision store
Revision 1+
• Update the original record, store the deltas in the revision store

The reason why I chose this path, is that in my system:
• most records are not edited
• the records that are edited, are heavily edited

We use an ORM and it was simple to implement this pattern with it, and then write some functions in postgres to ensure it is adhered to.

When I need to pull data out:

• I can pull exact revisions out of the htstore for a given table/row using the revision ids as a key
• the revisions all contain the transaction id
• if i need to get more info about a given transaction, i can query the transactions table and get a list of all the objects that were edited within that transaction

if i wanted to ensure referential integrity, i could have used a table instead of an hstore (or json). If the application grows much larger, it will probably be migrated to a model like that. This approach just gave a lot of flexibility , minimized tables in the database, and was very easy to pull off. i went with hstore because json didn't allow in-place updates at the time (i think it does now).




--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Loading...