Discussion:
encoding confusion with \copy command
Martin Waite
2014-09-17 10:03:42 UTC
Permalink
Hi,

I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.

I have been exploring the use of the \copy command for importing CSV data
generated by SQL Server 2008. SQL Server 2008 export tool does not escape
quotes that are in the content of fields, and so it is useful to be able to
specify obscure characters in the quote option in the \copy command to work
around this issue.

When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is mentioned
in the error message if characters outside the range are chosen:

\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252' QUOTE
as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff



I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character as
the quote character. Instead, I am limited to the range of characters
that can be encoded as a single byte in UTF-8. Actually, 0x00 is not
accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that given
that my server is in UTF8, I am therefore limited to single-byte UTF8
characters ?

regards,
Martin
Martin Waite
2014-09-17 13:48:44 UTC
Permalink
Hi Adrian,

I apologise - I meant 9.4

regards,
Martin
Post by Martin Waite
Hi,
I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.
First I think we need to establish what version of Postgres you using. Are
you really using 7.4?
Post by Martin Waite
I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008. SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.
When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252'
QUOTE as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff
If you are actually on Postgres 7.4 the above would not be a viable
command.
Post by Martin Waite
I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character. Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.
Is this a bug or expected behaviour ?
Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?
Actually depending on version you may be limited to ASCII.
Post by Martin Waite
regards,
Martin
--
Adrian Klaver
Adrian Klaver
2014-09-17 14:40:49 UTC
Permalink
Post by Martin Waite
Hi Adrian,
I apologise - I meant 9.4
Looks like you will need an intermediate step. A quick search found the
below which might offer a solution:

http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm

FYI the good stuff is at the bottom.

There is also this SO answer that refers to SQL Server Management
Studio(not sure if you have):

http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio
Post by Martin Waite
regards,
Martin
--
Adrian Klaver
***@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Martin Waite
2014-09-17 16:52:25 UTC
Permalink
You are right that I need an intermediate step. I will probably use a CSV
parser that is liberal in what it accepts, but writes out strict CSV data
suitable for postgres.

Thanks for the help.
Post by Adrian Klaver
Post by Martin Waite
Hi Adrian,
I apologise - I meant 9.4
Looks like you will need an intermediate step. A quick search found the
http://www.excel-sql-server.com/sql-server-export-to-
excel-using-bcp-sqlcmd-csv.htm
FYI the good stuff is at the bottom.
There is also this SO answer that refers to SQL Server Management
http://stackoverflow.com/questions/6115054/how-to-get-
export-output-in-real-csv-format-in-sql-server-managment-studio
Post by Martin Waite
regards,
Martin
--
Adrian Klaver
John DeSoi
2014-09-18 01:08:04 UTC
Permalink
You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.
If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.

John DeSoi, Ph.D.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adrian Klaver
2014-09-18 01:53:05 UTC
Permalink
Post by John DeSoi
You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.
If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.
Have you tried Save As Unicode:

http://www.ablebits.com/office-addins-blog/2014/04/24/convert-excel-csv/#export-csv-utf8

I vaguely remember using Access to pull in an Excel file and saving CSV
from there. I also seem to remember an option when saving a text file to
create an export template that allowed you to specify the csv options
and formatting. This was with the Pro version of Office and the data
tools installed.

What I do now is use the Python xlrd module to read in the Excel file
and then the csv module to output a CSV file.
Post by John DeSoi
John DeSoi, Ph.D.
--
Adrian Klaver
***@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
John R Pierce
2014-09-18 02:03:47 UTC
Permalink
Post by John DeSoi
You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.
If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.
I've used the npgsql .net driver http://npgsql.projects.pgfoundry.org/
with excel... you can directly query tables into spreadsheets via the
'data' feature in excel.
--
john r pierce 37N 122W
somewhere on the middle of the left coast
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adrian Klaver
2014-09-17 13:35:57 UTC
Permalink
Post by Martin Waite
Hi,
I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.
First I think we need to establish what version of Postgres you using.
Are you really using 7.4?
Post by Martin Waite
I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008. SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.
When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252'
QUOTE as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff
If you are actually on Postgres 7.4 the above would not be a viable command.
Post by Martin Waite
I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character. Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.
Is this a bug or expected behaviour ?
Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?
Actually depending on version you may be limited to ASCII.
Post by Martin Waite
regards,
Martin
--
Adrian Klaver
***@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Loading...