Note that this guide covers the recent versions of Rails, if you are stuck with an older version take a look at the UnicodeStringsInOldRails
A short intro
While Ruby doesn’t have any specific facilities for managing Unicode strings, you can store UTF-8 encoded data in your 8-bit strings. However some of the String methods assume a single byte encoding and therefore return wrong results. Besides, without proper settings you will get output and input as sequences of bytes and you might get parse errors in literals.
At version 1.2 Rails introduces ActiveSupport::Multibyte to cope with the problems inherent in Ruby’s poor handling of strings.
Check your editor
This might seem obvious, but to correctly edit scripts containing Unicode characters you will need a text editor that supports UTF-8.
You will also need to ensure that your UTF-8 aware text editor does not insert the BOM (Byte Order Mark) at the beginning of the files it writes. BBEdit and Windows Notepad too are both notorious for this. In BBedit select “UTF-8 No BOM” as your default format to avoid the problem altogether).
Check your KCODE
To tune standard input and output, as well as to enter Unicode literals in scripts the $KCODE global variable of the interpreter session should be set to “UTF-8”. Rails assumes that a web app is Unicode from the start so $KCODE is set for you already.
This gives you correct input and output and correct UTF-8 “general” (binary) sorting.
Console configuration
To actually read Unicode-encoded data from your console (for example when monitoring logs) you will need to tune your environment variables for the right locale and font. The variables responsible for locale on Unix are LC_CTYPE and LANG.
In Windows cmd.exe always uses the system locale, so you will need to reconfigure that. You can check the current locale with
chcp # outputs cp866
and redefine it to be utf8 (using the special MS code for it):
chcp 65001
The only thing left is to change the cmd window to use a monospaced Unicode font – for example ‘Lucida Console’ as recommended here
The methods of death
There are methods on the String class that return wrong results. Among these are (every use of these methods in your application introduces a bug, sometimes including irreversible damage to strings):
String#reverse
String#size
String#index
String#[]
String#downcase
String#capitalize
String#downcase
String#strip and friends
String#slice
All of these are heavily used internally by Rails, as well as by other libraries (also to operate on byte sequences and data packets). To mitigate this problem, Rails introduces the String#chars accessor
for working with bytestrings as_characters. The object that you get from the chars method will act like a mediator between Unicode rulesets and the underlying data (which is the Ruby string). All of the methods
mentioned above are multibyte-safe when you use them through String#chars. You are guaranteed to never damage a Unicode codepoint, provided these methods are used exclusively.
"Кусок текста"[0..2] # => "К\321", one letter got through and the other one got sliced in two
"Кусок текста".chars[0..2] # => "Кус", no codepoint damage
"Café".reverse #=> "\251##aC", Too broken even for a terminal
"Café".chars.reverse # => "éfaC" - works
Chars returns itself so you can easily chain calls staying in multibyte safety zone.
You can make the library work faster by installing the utf8proc gem. Please note that for some functions it’s necessary to use the original Unicode tables. These are deserialized from a file in a special format, which has its pros and cons: it’s fast but it consumes memory. As long as you do not call any methods that need table lookups (such as normalizaions/casefolding) you won’t be loading these tables.
Please take care to carefully examine the documentation about the methods provided by Chars
Bath and Shower
Fragrance
Gift Sets
Hair Care
Makeup
Men’s Grooming
Shaving and Hair Removal
Skin Care
Tools and Accessories
Computers – Computer Add-Ons
Computers – Desktops
Computers – Handhelds & PDAs
Computers – Notebooks
Baby Diapering
Baby Feeding
For Moms
Baby Furniture
Baby Gear
Baby Gifts
Baby Health & Baby Care
Nursery Décor
Potty Training
Baby Safety
Baby Strollers
Regexp gotchas
Remember that [a-z] is no longer “any letter”. Also please remember that the /i flag (case insensitive matching) does not work on Unicode strings.
Case conversion
Rails implements basic Unicode casefolding in as much as it is possible without locales (that is, it does casefolding independently of the language). Use String#chars.downcase/upcase. Standard String#downcase will not work.
Normalization
Rails provides Unicode 5.0 compliant normalization against tables using the String#chars.normalize(form) method. form is either :kc, :d, or :c. W3 recommends normalization form C to be used for data sent across the wire, and KC seems to be the common for text searches.
Setting up your database properly
Rails creates tables automatically and does not explicitly specify any table creation options when doing so (assuming that the defaults are set beforehand). That means that your database should set the right encoding for tables right when they are being created.
For MySQL use the additional DEFAULT CHARSET clause when creating the DB:
CREATE_DATABASE my_web_two_zero_development DEFAULT CHARSET utf8;
If you are running your tests and all characters from the database come as ”????”, while in development all comes through fine, you will need to recreate your test database with the proper setting.
Configuring the database connection
Do not forget to put the client encoding in your database.yml for all databases involved:
development:
adapter: mysql
database: example_development
encoding: utf8
username: root
password:
in your database.yml file if you are using MySQL. For PostgreSQL replace this with “unicode”.
This setting defines which encoding should be used for the database client driver – it literally tells the database driver to translate strings for you from the encoding the database is in to this one when querying, and back when writing. That is, you can have a database in a different encoding and query it in UTF-8.
This has been supported by PostgreSQL for quite some time and also available on MySQL since version 4. However, it’s highly recommended to have your database in Unicode for a simple reason: when you send the database a symbol that it cannot tuck into the actual database encoding you will get an aborted transaction and an error message.
To use SQLite with UTF8 you need to compile your SQLite with UTF8 support burned-in, after that you don’t need any special settings in database.yml
SQLite3 compiles as UTF8 by default.
Please note that as of now schema.rb does not preserve database-specific table creation options such as table encoding and collation. If you need these options to land in your total schema
file consider using the SQL schema format. The statements that you put into your migrations will get executed, though.
Oracle see Coffee & Gems userful post
Running tests and storing fixtures
YAML is broken when want to store your Unicode text back to file – it is going to be converted into a binary base64-encoded string (but will be perfectly usable when you read the YAML file back in). This is a known missing functionality of Syck, the Ruby YAML parser-seerializer. Funnily enough, Syck supports UTF-8 on the Perl side. To bypass this problem, just edit your fixtures by hand (versus dumping them from a live database).
Using helpers and ActionMailer
Most helper methods that work with text (such as truncate) will use Unicode-aware truncations if you did not change your $KCODE. It is the policy now that a view helper is not allowed
to be multibyte-unsafe, so if you stumble upon helpers behaving erratically with Unicode strings – by all means file a big on the TRAC.
Outgoing emails
ActionMailer should be set up using the charset option
Configuring your output correctly
It is your responsibility to set the charset of your pages to UTF-8. Rails will automatically send proper charset parameters along with the Content-Type header (the parameter value will be
devised from your $KCODE). However, it’s also considered good manners to add the charset declaration as a meta tag in your page too – especially handy for cases when a page of your site is saved
and opened afterwards.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
For Apache Users
Tell Apache to return ‘utf-8’ using the AddDefaultCharset directive either in your .htaccess or httpd.conf file. Both Rails and Apache will return a Content-Type, but Apache’s will be first and most user agents seem to take the first over all others.
Unicode in Ajax with older versions of Safari
Older versions of Safari interpret the incoming Ajax response as Latin-1 encoded unless it includes an XML prolog with a charset declaration.
One way to bypass that is to encode all Unicode characters as decimal entities in a after_filter in your ApplicationController:
class ApplicationController < ActionController::Base
after_filter :fix_unicode_for_safari
# convert all high Unicode characters to their correpsonding entities
def fix_unicode_for_safari
if headers["Content-Type"] == "text/html; charset=utf-8" and request.env['HTTP_USER_AGENT'].to_s.include? 'AppleWebKit' and request.xhr?
response.body.gsub!(/([^\x00-\xa0])/u) { |s| "&#x%x;" % $1.unpack('U')[0] }
end
end
Please note that if you use that you no longer can send JavaScript along with your render because it won’t be evaluated, so your best bet would be to advise your Safari users to upgrade.
Another way to fix UTF-8 handling for AJAX in the old Safari is to prepend an XML prolog to your AJAX output (this will also prevent all the JS code that was sent from being run):
def add_item
render_text "<?xml version='1.0' encoding='utf-8'?>" + "<li>" + params[:newitem] + "</li>"
end
Configuring input
If you send your headers properly (and let the browser know that your site outputs pages in UTF8) all modern browsers will send you forms in UTF-8 automatically, so you don’t have to do any input conversions or define “accept-charset” on forms.
Converting between charsets
Use iconv.
require 'iconv'
# will convert from UTF8 to UTF16
Iconv.new('utf-16', 'utf-8').iconv(person.name)
Configuring ActionWebService
Just let AWS know that you have iconv (or the character detection in SOAP won’t work and it will just presume it’s a dirty string and BASE64 your strings)
require 'iconv'
and ensure $KCODE is properly set to “UTF8”
A safest way to do this is by running your complete database dump through an encoding converter such as iconv, and checking that all the encoding declarations have been either removed
or adjusted for your proper encoding.
Include from first step from database dump till rails action to convert rows encoding.
Note that this guide covers the recent versions of Rails, if you are stuck with an older version take a look at the UnicodeStringsInOldRails
A short intro
While Ruby doesn’t have any specific facilities for managing Unicode strings, you can store UTF-8 encoded data in your 8-bit strings. However some of the String methods assume a single byte encoding and therefore return wrong results. Besides, without proper settings you will get output and input as sequences of bytes and you might get parse errors in literals.
At version 1.2 Rails introduces ActiveSupport::Multibyte to cope with the problems inherent in Ruby’s poor handling of strings.
Check your editor
This might seem obvious, but to correctly edit scripts containing Unicode characters you will need a text editor that supports UTF-8.
You will also need to ensure that your UTF-8 aware text editor does not insert the BOM (Byte Order Mark) at the beginning of the files it writes. BBEdit and Windows Notepad too are both notorious for this. In BBedit select “UTF-8 No BOM” as your default format to avoid the problem altogether).
Check your KCODE
To tune standard input and output, as well as to enter Unicode literals in scripts the $KCODE global variable of the interpreter session should be set to “UTF-8”. Rails assumes that a web app is Unicode from the start so $KCODE is set for you already.
This gives you correct input and output and correct UTF-8 “general” (binary) sorting.
Console configuration
To actually read Unicode-encoded data from your console (for example when monitoring logs) you will need to tune your environment variables for the right locale and font. The variables responsible for locale on Unix are LC_CTYPE and LANG.
In Windows cmd.exe always uses the system locale, so you will need to reconfigure that. You can check the current locale with
chcp # outputs cp866
and redefine it to be utf8 (using the special MS code for it):
chcp 65001
The only thing left is to change the cmd window to use a monospaced Unicode font – for example ‘Lucida Console’ as recommended here
The methods of death
There are methods on the String class that return wrong results. Among these are (every use of these methods in your application introduces a bug, sometimes including irreversible damage to strings):
String#reverse
String#size
String#index
String#[]
String#downcase
String#capitalize
String#downcase
String#strip and friends
String#slice
All of these are heavily used internally by Rails, as well as by other libraries (also to operate on byte sequences and data packets). To mitigate this problem, Rails introduces the String#chars accessor
for working with bytestrings as_characters. The object that you get from the chars method will act like a mediator between Unicode rulesets and the underlying data (which is the Ruby string). All of the methods
mentioned above are multibyte-safe when you use them through String#chars. You are guaranteed to never damage a Unicode codepoint, provided these methods are used exclusively.
"Кусок текста"[0..2] # => "К\321", one letter got through and the other one got sliced in two
"Кусок текста".chars[0..2] # => "Кус", no codepoint damage
"Café".reverse #=> "\251##aC", Too broken even for a terminal
"Café".chars.reverse # => "éfaC" - works
Chars returns itself so you can easily chain calls staying in multibyte safety zone.
You can make the library work faster by installing the utf8proc gem. Please note that for some functions it’s necessary to use the original Unicode tables. These are deserialized from a file in a special format, which has its pros and cons: it’s fast but it consumes memory. As long as you do not call any methods that need table lookups (such as normalizaions/casefolding) you won’t be loading these tables.
Please take care to carefully examine the documentation about the methods provided by Chars
Bath and Shower
Fragrance
Gift Sets
Hair Care
Makeup
Men’s Grooming
Shaving and Hair Removal
Skin Care
Tools and Accessories
Computers – Computer Add-Ons
Computers – Desktops
Computers – Handhelds & PDAs
Computers – Notebooks
Baby Diapering
Baby Feeding
For Moms
Baby Furniture
Baby Gear
Baby Gifts
Baby Health & Baby Care
Nursery Décor
Potty Training
Baby Safety
Baby Strollers
Regexp gotchas
Remember that [a-z] is no longer “any letter”. Also please remember that the /i flag (case insensitive matching) does not work on Unicode strings.
Case conversion
Rails implements basic Unicode casefolding in as much as it is possible without locales (that is, it does casefolding independently of the language). Use String#chars.downcase/upcase. Standard String#downcase will not work.
Normalization
Rails provides Unicode 5.0 compliant normalization against tables using the String#chars.normalize(form) method. form is either :kc, :d, or :c. W3 recommends normalization form C to be used for data sent across the wire, and KC seems to be the common for text searches.
Setting up your database properly
Rails creates tables automatically and does not explicitly specify any table creation options when doing so (assuming that the defaults are set beforehand). That means that your database should set the right encoding for tables right when they are being created.
For MySQL use the additional DEFAULT CHARSET clause when creating the DB:
CREATE_DATABASE my_web_two_zero_development DEFAULT CHARSET utf8;
If you are running your tests and all characters from the database come as ”????”, while in development all comes through fine, you will need to recreate your test database with the proper setting.
Configuring the database connection
Do not forget to put the client encoding in your database.yml for all databases involved:
development:
adapter: mysql
database: example_development
encoding: utf8
username: root
password:
in your database.yml file if you are using MySQL. For PostgreSQL replace this with “unicode”.
This setting defines which encoding should be used for the database client driver – it literally tells the database driver to translate strings for you from the encoding the database is in to this one when querying, and back when writing. That is, you can have a database in a different encoding and query it in UTF-8.
This has been supported by PostgreSQL for quite some time and also available on MySQL since version 4. However, it’s highly recommended to have your database in Unicode for a simple reason: when you send the database a symbol that it cannot tuck into the actual database encoding you will get an aborted transaction and an error message.
To use SQLite with UTF8 you need to compile your SQLite with UTF8 support burned-in, after that you don’t need any special settings in database.yml
SQLite3 compiles as UTF8 by default.
Please note that as of now schema.rb does not preserve database-specific table creation options such as table encoding and collation. If you need these options to land in your total schema
file consider using the SQL schema format. The statements that you put into your migrations will get executed, though.
Oracle see Coffee & Gems userful post
Running tests and storing fixtures
YAML is broken when want to store your Unicode text back to file – it is going to be converted into a binary base64-encoded string (but will be perfectly usable when you read the YAML file back in). This is a known missing functionality of Syck, the Ruby YAML parser-seerializer. Funnily enough, Syck supports UTF-8 on the Perl side. To bypass this problem, just edit your fixtures by hand (versus dumping them from a live database).
Using helpers and ActionMailer
Most helper methods that work with text (such as truncate) will use Unicode-aware truncations if you did not change your $KCODE. It is the policy now that a view helper is not allowed
to be multibyte-unsafe, so if you stumble upon helpers behaving erratically with Unicode strings – by all means file a big on the TRAC.
Outgoing emails
ActionMailer should be set up using the charset option
Configuring your output correctly
It is your responsibility to set the charset of your pages to UTF-8. Rails will automatically send proper charset parameters along with the Content-Type header (the parameter value will be
devised from your $KCODE). However, it’s also considered good manners to add the charset declaration as a meta tag in your page too – especially handy for cases when a page of your site is saved
and opened afterwards.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
For Apache Users
Tell Apache to return ‘utf-8’ using the AddDefaultCharset directive either in your .htaccess or httpd.conf file. Both Rails and Apache will return a Content-Type, but Apache’s will be first and most user agents seem to take the first over all others.
Unicode in Ajax with older versions of Safari
Older versions of Safari interpret the incoming Ajax response as Latin-1 encoded unless it includes an XML prolog with a charset declaration.
One way to bypass that is to encode all Unicode characters as decimal entities in a after_filter in your ApplicationController:
class ApplicationController < ActionController::Base
after_filter :fix_unicode_for_safari
# convert all high Unicode characters to their correpsonding entities
def fix_unicode_for_safari
if headers["Content-Type"] == "text/html; charset=utf-8" and request.env['HTTP_USER_AGENT'].to_s.include? 'AppleWebKit' and request.xhr?
response.body.gsub!(/([^\x00-\xa0])/u) { |s| "&#x%x;" % $1.unpack('U')[0] }
end
end
Please note that if you use that you no longer can send JavaScript along with your render because it won’t be evaluated, so your best bet would be to advise your Safari users to upgrade.
Another way to fix UTF-8 handling for AJAX in the old Safari is to prepend an XML prolog to your AJAX output (this will also prevent all the JS code that was sent from being run):
def add_item
render_text "<?xml version='1.0' encoding='utf-8'?>" + "<li>" + params[:newitem] + "</li>"
end
Configuring input
If you send your headers properly (and let the browser know that your site outputs pages in UTF8) all modern browsers will send you forms in UTF-8 automatically, so you don’t have to do any input conversions or define “accept-charset” on forms.
Converting between charsets
Use iconv.
require 'iconv'
# will convert from UTF8 to UTF16
Iconv.new('utf-16', 'utf-8').iconv(person.name)
Configuring ActionWebService
Just let AWS know that you have iconv (or the character detection in SOAP won’t work and it will just presume it’s a dirty string and BASE64 your strings)
require 'iconv'
and ensure $KCODE is properly set to “UTF8”
A safest way to do this is by running your complete database dump through an encoding converter such as iconv, and checking that all the encoding declarations have been either removed
or adjusted for your proper encoding.
Include from first step from database dump till rails action to convert rows encoding.