Expected behaviour of \d in regexp with exponent numbers ?

Discussion:

Arnaud Lesauvage

2014-09-01 13:14:24 UTC

Hi all,

I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :

SELECT substring('²' FROM E'\\d');

8.4 : NULL
9.3 : "²"

Am I correct to expect NULL in this case ?
Thanks !

--
Arnaud

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Albe Laurenz

2014-09-01 13:42:39 UTC

Permalink

Post by Arnaud Lesauvage
SELECT substring('²' FROM E'\\d');
8.4 : NULL
9.3 : "²"
Am I correct to expect NULL in this case ?

I get a different result on Linux:

test=> SHOW server_encoding;
server_encoding
-----------------
UTF8
(1 row)

test=> SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)

test=> SHOW lc_collate;
lc_collate
------------
de_DE.UTF8
(1 row)

test=> SHOW server_version;
server_version
----------------
9.3.5
(1 row)

test=> SELECT substring('²' FROM E'\\d') IS NULL;
?column?
----------
t
(1 row)

Maybe it is a collation problem.
Which operating system and collation are you using on each system?

Maybe there is an encoding problem involved.
What is your server and client encoding on each system?
What do you get for "SELECT '²'::bytea" on each system?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsq

Arnaud Lesauvage

2014-09-01 14:05:43 UTC

Permalink

Post by Albe Laurenz

Post by Arnaud Lesauvage
SELECT substring('²' FROM E'\\d');
8.4 : NULL
9.3 : "²"
Am I correct to expect NULL in this case ?

OK, first of all the servers are running on Windows 2003 R2. They are on
the same computer, the only difference is they use a different port.

Post by Albe Laurenz
test=> SHOW server_encoding;

UTF8 on both servers

Post by Albe Laurenz
test=> SHOW client_encoding;

UNICODE on both servers

Post by Albe Laurenz
test=> SHOW lc_collate;

I have a small difference here.
On 8.4 I have "French, Belgium"
On 9.3 I have "French_France.1252"
Quite confusing... I left the installer chose the collation for me,
French_France sounded close enough !

Post by Albe Laurenz
Maybe it is a collation problem.
Which operating system and collation are you using on each system?
What do you get for "SELECT '²'::bytea" on each system?

"\302\262" on each system.

--
Arnaud

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

2014-09-01 15:39:48 UTC

Permalink

Post by Arnaud Lesauvage
SELECT substring('²' FROM E'\\d');
8.4 : NULL
9.3 : "²"
Am I correct to expect NULL in this case ?

Not necessarily. \d will match any character that iswdigit() returns true
for. It looks like your new server is using a locale that considers "²"
to be a digit.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Arnaud Lesauvage

2014-09-01 15:49:05 UTC

Permalink

Post by Tom Lane

Post by Arnaud Lesauvage
SELECT substring('²' FROM E'\\d');
8.4 : NULL
9.3 : "²"
Am I correct to expect NULL in this case ?

Not necessarily. \d will match any character that iswdigit() returns true
for. It looks like your new server is using a locale that considers "²"
to be a digit.

Since both PostgreSQL servers run on the same computer, can I assume
that this is a collation problem ?

I tried to create a test table with different collations, but locale
names are a headache on windows, so I gave up for today. I will give it
another try tomorrow.

--
Arnaud

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

2014-09-01 16:11:58 UTC

Permalink

Post by Arnaud Lesauvage

Post by Tom Lane
Not necessarily. \d will match any character that iswdigit() returns true
for. It looks like your new server is using a locale that considers "²"
to be a digit.

Since both PostgreSQL servers run on the same computer, can I assume
that this is a collation problem ?

Ah: after consulting the commit history I realized that the regex
operators only base \d on iswdigit() in 9.2 and later. Before that
it was hardwired as [0-9]. So there might not be any difference
in the locale environment after all.

I wonder whether this was a bad idea. I think it's unsurprising for the
definition of "alphanumeric" to depend on locale, but I bet most people
are not expecting \d to vary that way.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Arnaud Lesauvage

2014-09-02 08:01:21 UTC

Permalink

Post by Tom Lane

Post by Arnaud Lesauvage

Post by Tom Lane
Not necessarily. \d will match any character that iswdigit() returns true
for. It looks like your new server is using a locale that considers "²"
to be a digit.

Since both PostgreSQL servers run on the same computer, can I assume
that this is a collation problem ?

Ah: after consulting the commit history I realized that the regex
operators only base \d on iswdigit() in 9.2 and later. Before that
it was hardwired as [0-9]. So there might not be any difference
in the locale environment after all.
I wonder whether this was a bad idea. I think it's unsurprising for the
definition of "alphanumeric" to depend on locale, but I bet most people
are not expecting \d to vary that way.

I guess the change in the way \d is behaving is OK as long as it is
documented in a changelog. I saw something about syncing the regexp code
with TCL somewhere, but I think there was a mention that this should not
change the regexp behaviour and that it was only to keep the code base
in sync.

The problem in my case is just that I expected any character in the
"digit" class to be a valid integer, which is wrong since the doc states
that numbers should be written with "decimal digits", not "locale digits".
So I guess I should just rewrite this regexp as [0-9]+ instead.

Thanks for the feedback !
Regards

--
Arnaud
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vick Khera

2014-09-02 13:25:01 UTC

Permalink

Post by Tom Lane
I wonder whether this was a bad idea. I think it's unsurprising for the
definition of "alphanumeric" to depend on locale, but I bet most people
are not expecting \d to vary that way.

FWIW, tha Perl man page on unicode (perldoc perlunicode) says:

<quote>
It is worth stressing that there are several different sets of digits
in Unicode that are equivalent to 0-9 and are matchable by "\d" in a
regular expression. If they are used in a single language only, they
are in that language's "Script" and "Script_Extension". ...
</quote>

When working with Unicode/UTF8, I do not think it is safe to assume \d
matches only ASCII [0-9].

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general