Handling Text Encodings

Any material you import into Global Mapper (layers, projections, etc.) will include some metadata that holds useful information about it, such as the file name, the geographic extent of the data, or the type of data. However, occasionally some of the metadata may look unreadable when printed to the screen from Python. For example, if an imported layer is located at 40° N, 100° W, you might see something like “40\udcb0 N, 100\udcb0 W” when calling GetLayerInfo on that layer. This is a sign that the text information for your file was written with an encoding other than Python’s standard, UTF-8.

Character encodings are ways of translating ones and zeros into text, so when a file is encoded with one system but then decoded with another, the results can look strange. Most encoding systems use the same translation for ASCII characters (unaccented Latin letters, Arabic numerals, and common punctuation marks), so generally only special characters are susceptible to mistranslation, like the degree sign in the example above. If you notice a character being wrongly decoded, here is one way to fix it:

>>> # open some layer...
>>> info = gm.GetLayerInfo(myLayer)
>>> print("Layer name:", info.mDescription)
Layer name: kraków_poland.shp

>>> print("Encoding:", info.mCodePage)
Encoding: 0

>>> b = info.mDescription.encode("cp1252")
>>> s = b.decode("utf8")

>>> print("Decoded layer name:", s)
Decoded layer name: kraków_poland.shp

In this example, the diacritic “ó” was encoded using Windows CP-1252 character encoding and Python is wrongly interpreting it through UTF-8. You can tell how a GM_LayerInfo_t object was originally encoded by checking its mCodePage variable. Here, it returned the integer value 0, meaning CP-1252. This corresponds to a standard enumeration from C++ used to identify code pages (encoding systems):

  • 0 = CP-1252 (aka Windows 1252, ANSI)

  • 1 = OEM

  • 2 = Mac-Roman

  • 65000 = UTF-7

  • 65001 = UTF-8

Once you know the encoding used to generate text, you can use the standard Python method encode() to break the string down into raw bytes, which can then be translated back into text using decode().

In some cases, instead of seeing jumbled symbols where a special character should be, you might see something that resembles a Unicode string in the format “udcxx”. These surrogates are a representation of a non-ASCII character in CP-1252, and they can be decoded in the following way:

>>> # continued from last example
>>> metadata_list = gm.GM_AttrValue_array_frompointer(info.mMetadataList)
>>> metadata_list[8]
{'mName': 'NORTH LATITUDE', 'mVal': '50\udcb0 03\' 41.0120" N'}

>>> north_lat = str(metadata_list[7].mVal)
>>> b = north_lat.encode("utf8", "surrogateescape")
>>> s = b.decode("cp1252")

>>> print("North latitude:", s)
North latitude: 50° 03' 41.0120" N

The additional argument surrogateescape used in the encode() call ensures that any surrogate Unicode sequences are treated as one whole character. Notice that in this case, “utf8” is now the encode argument and “cp1252” is the decode argument – reversed from the order in the previous example. This is because the text has already been translated from CP-1252 to Unicode, which is when the surrogates were inserted. For more in-depth information about these functions, refer to the official Python documentation for encode and decode.

Unicode characters which require more than one byte, such as characters from non-Latin alphabets, are currently unsupported for use in the Global Mapper SDK in this context, and will appear as a series of question marks when printed. Text data of this type is still stored correctly inside your file and will not be affected. As an alternative, you may wish to use the Global Mapper desktop app, which can correctly display all Unicode characters.