Posts Tagged ‘osm’

UNICODE from OSM to PGSQL (part 2)

Monday, April 5th, 2010

There is no problem importing OSM data into PostgresSQL / PostGIS.

In part one of the article we’ve seen Geofabrik’s shapefiles having a text data truncation problem, but using osm2pgsql everything gets into an UTF-8 database without a failure.

It’s as simple as:

$ osm2pgsql -l -c -S default.style africa.osm.bz2 -d osm

The -l switch aks for keeping lat/long projection, -c requests creation of the schema, -d specifies the database to use. The default.style file is a configuration specifying what to import and how; I used the default for the sake of this test.

Resulting ralations:

 Schema |        Name        |   Type
--------+--------------------+----------
 public | planet_osm_line    | table
 public | planet_osm_point   | table
 public | planet_osm_polygon | table
 public | planet_osm_roads   | table

And the way we’ve been using for testing has all characters:

full multibyte name of way 4005333

Let’s see it in Quantum GIS, compared with the one coming from the corrupted shapefile (which I’ve imported into postgis after hacking shp2pgsql to discard incomplete multibytes):

qgis screenshot

The difference you may notice seems to be due to left-to-right vs. right-to-left orientation of the text. My terminal seems to ignore orientation, qgis doesn’t.

Now, time to see if a shapefile will be able to bear all that UNICODE. Let’s not do anything fancy, just dump the roads table using pgsql2shp:

$ pgsql2shp osm planet_osm_roads

Pretty fast (slightly above 1 second system time, 8 secs real time). And here’s the generated shapefile dataset:

72979148 planet_osm_roads.shp
69555432 planet_osm_roads.dbf
  516268 planet_osm_roads.shx
     257 planet_osm_roads.prj

Do they have the full multibyte strings now ? Sure, shp2pgsql doesn’t complain anymore, and you can safely import into postgis again completing the round-trip. Only you have to specify input encoding UTF-8 as the new default encoding, as I pointed out in previous post, is that unmentionable one… So:

$ shp2pgsql -W UTF-8 planet_osm_roads planet_osm_roads_roundtrip | psql osm
...
$ psql osm -c 'select name from planet_osm_roads_roundtrip where osm_id = 4005333';
 Avenue des Nations Unies - شارع الأمم المتحدة

Also, we can open the shapefile itself with qgis and see how it looks:

Green is the pgsql2shp-exported shapefile, red is osm2pgsql-imported planet osm, black is geofabrik-imported shapefile.

All clean and easy :)

Further excercises would include tweaking the osm2pgsql style file and generally the import process to better select data of interest, properly clean geometry invalidities and taking care of incremental updates of the data.

Good luck and happy hacking !

UNICODE from OSM to PGSQL

Sunday, April 4th, 2010

This week I’ve been presented with a problem importing OpenStreetMap data of Africa  from GeoFabrik’s shapefile export into a PostgreSQL / PostGIS database.

The problem consisted in a loss of information during the transport, resulting in wrongly encoded strings (road names) ending up in the db. This was during a feasibility study. So, is that feasible ? Let’s take a look.

I downloaded the shapefiles and tried to import the roads one using shp2pgsql with no options, and here’s the result:

Unable to convert field value "Place Othman Ibn Affane ساحة عثمان اِ" to UTF-8:
iconv reports "Invalid or incomplete multibyte or wide character"

Why is shp2pgsql trying to convert, and from which encoding? When I left it, the default was to perform no conversion unless -W was given…

Well, it turns out the default is now to convert from WINDOWS-1252 encoding (why?) and there no way to request no encoding at all (why?!).

So I patched the loader to give more informations about the encoding process and specified UTF-8 as source encoding. Here’s the result:

Unable to convert field value "Avenue des Nations Unies - شارع الأمم " from UTF-8 to UTF-8:
iconv reports "Invalid argument"

So it’s official: the dbf file contains invalid data. The confusing error message (Invalid argument) means the multibyte sequence is incomplete rather than invalid (EINVAL errno).

Adding more debugging code I can see that many many rows have values that look truncated, all ending with a single byte being either <D8> or <D9>.

OpenOffice confirms the malformation (wanted an independent opinion on that just in case it was shapelib doing the truncation):

OpenOffice shows the truncated multibyte value

Querying openstreetmap for way 4005333 shows the full string, and the full string is also present in the .osm file downloaded from geofabrik:

<tag k="name" v="Avenue des Nations Unies - شارع الأمم المتحدة"/>

So the problem is only with the shapefile, not the OSM data itself, nor with postgis.

Surely postgis loader could be tweaked to allow for a tolerance, in case anyone wants to import the truncated data anyway. In this specific case discarding the final partial multibyte string might be the best you can do as it’s a case of truncation as any other, only being multibyte it gives more problems than single-byte encoding.

Timely enough someone submitted a patch aimed at exactly this kind of tolerance handling. I’m going to see how well that’ll cope with this case.

But bottom line is we do want the good data, so this problem is not solved until the data will be in the database, stepping by shapefiles (if possible) or directly.

Next stop: osm2pgsql -> go there