Some Unicode tips for Ruby

I recently released a Japanese word search builder (see the discussion in the Hacker News post). This was my first foray into the world of multibyte characters, and I quickly discovered the pain involved in simply printing "ありがと" to the screen. Below are some of the details I found to help me through this trial.

Tips to make your Unicode+Ruby experience easier

  1. Read James Edward Gray's Understanding m17n series. Every word. Period.
  2. If at all possible, use Ruby 1.9+. As James describes, Unicode support is far better than in 1.8.x.
  3. In spite of Ruby 1.9's superior Unicode support, it still doesn't read files as UTF8 by default. This means that loading any files which include Unicode characters will cause Ruby to lash out at you with complaints about multi-byte characters. To get around this, set the file's encoding type at the start of the file.
    # encoding: UTF-8
    ... unicode goes here ...
  4. If you can't use 1.9, make sure you use the Jcode library. It's included with Ruby and adds lots of useful Unicode-oriented features to String and Regexp. The only disadvantage is that the original methods remain in place (e.g., you'll have to make sure you call String#jsize rather than String#size).
    require 'jcode'
    "ありがと".size
    # => 12
    "ありがと".jsize
    # => 4
  5. If you plan for your Ruby file to be run directly, make sure you add the -KU flag to your shabang line. This will tell Ruby to be open to UTF8 data.
    #!/usr/local/bin/env ruby -KU
    ... Ruby goes here ...
  6. Familiarize yourself with iconv — both the Ruby library and the command line tool. This tool will allow you to translate data easily between various character encodings.
  7. While not strictly required these days, it's still recommended when serving HTML with Unicode data to set the content type both in a meta tag on the page and the in HTTP header.
    HTTP/1.1 200 OK
    Server: nginx/0.6.39
    Date: Wed, 24 Mar 2010 00:39:47 GMT
    Content-Type: text/html; charset=utf-8
    ...
    
    <html>
    <head>
    <meta http-equiv="Content-Type"  content="text/html; charset=utf-8" />
    ...
blog comments powered by Disqus