ecto and character encoding


Weblogs are a family of HTML webpages, one of the many possible media to share information over the internet. In the old days, you'd be taking pen and paper, write down what was on your mind, and send it off in a stamped envelop. You never really had to bother what paper to use or what pen to handle, as long as you wrote clearly and the ink was visible enough to any reader.

It's not that simple on the web. In digital transmission and storage, data are represented as so-called bytes, strings of eight bits (each being either 0 or 1) that are also known as octets. In the simplest case, each octet would represent a single character, but since there are only 256 possible variations of 8-bit length strings, this would not allow for many characters. One such mapping (i.e. encoding) of octets to characters is good old ASCII:

  ! “ # $ % & ' ( ) *   , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~ 

But the most commonly used encoding is ISO 8859-1. Languages aren't that simple, though. Most non-Western languages have a much more wider range of characters, such as for example Chinese and Japanese. These languages use different forms of encoding, one that maps octets to characters in a lot more complicated way (Shift-JiS, EUC are examples). Therefore, for browsers and other kinds of software that deal with data transmission, they need to know what kind of encoding is being used, so that it can properly map octets to characters and display them properly.

Authoring tools, such as ecto, deal with data sent from a server that is hosting your blog. Usually, these data are sent as XML documents. XML is a special way to represent structured information. Structured information contains both content (words, pictures, etc.) and some indication of what role that content plays (title of your blog entry, categories for your blog entry). Almost all documents have some structure. A markup language is a mechanism to identify structures in a document. The XML specification defines a standard way to add markup to documents.

One way to transmit XML is to use XML-RPC (Remote Procedure Calling using XML as format). To properly extract the data from an XML document (i.e. parsing), the encoding must be correctly set so that the parser can separate structure from content. Blogging systems like TypePad or WordPress solve the encoding problem by storing all weblog data as UTF-8. UTF-8 is a more common encoding according to Unicode (ISO 10646), an international standard that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The IETF Policy on Character Sets and Languages (RFC 2277) favors UTF-8.

Other blogging systems assume encodings commonly used for Western languages, such as, again, ISO 8859-1. However, most bloggers create entries unaware of how their text may be encoded. This can be a problem if they use characters that are not in the corresponding encoding table such as, for example, “ and ” or even €. If these characters are sent to authoring tools via XML-RPC, it can create problems for parsers. They choke on it, badly. There are two ways to bypass this encoding problem:

1. Convert characters that are not in the encoding table to HTML entities, which is a way to refer to a character using specially designed strings (see the chart). The three special characters mentioned above would be escaped as ”&ldquo;“, ”&rdquo;“, and ”&euro;“, but a recommended way is to use the numerical equivalents (”&#8220;“, ”&#8221;“, and ”&#8364;“).

Although ecto does this automatically by default, some entries you may have created via your blogging system's online control panel. MovableType, for example, does not convert text you entered to the default encoding, but stores it as is. As a result, retrieving recent entries via ecto can produce ”Encountered string encoding problem“ errors.

2. The recommended way is to follow TypePad's idea and use UTF-8 as the default encoding. This way, you won't have to bother with HTML entities. Any text you type would be properly mapped. Apple's XML-RPC tools also work best with UTF-8. If you use MovableType, changing default encoding is a bit complicated. It requires editing the mt.cfg file of your MovableType installation. Uncomment the line that says:

NoHTMLEntities 1

and uncomment and change the line that contains ”PublishCharset“ to read:

PublishCharset utf-8

Then rebuild your site.

If you are using UTF-8 as the default encoding for your blog, then ecto does not have to escape characters as HTML entities. In that case, open Weblog -> Edit Settings..., select the ”Preprocessing“ tab and un-check the ”Convert HTML entities“ option.

UPDATE: Sam Ruby's take on encoding.

Posted by Adriaan on March 15, 2004