Thursday, September 04, 2008

JRuby's unicode regular expression

In my previous post, I worte about JRuby's unicode regular expression, Joni, didn't work like Ruby's even though both engines were Oniguruma. But, the truth is ... Joni dares be off the flag that enables unicode regular expression syntax described in Oniguruma's document "since unicode tables would make jruby distribiustion a bit more boilerplate (lopex)". Lopex, who is an implementor of joni, commented on my post. I followed what lopex wrote and could get unicode regular expression run on JRuby 1.1.4 as if it is Ruby 1.9.

Here're what I did to enable the unicode flag and to get correct outputs.

1. checkout jcodings from http://svn.codehaus.org/jruby/jcodings/ because joni needs it.
2. cd jcodings; mvn clean install
3. check out joni-1_0 from http://svn.codehaus.org/jruby/joni/branches/joni-1_0/. (needs exactly this version)
4. cd joni-1_0
5. edit src/org/joni/Config.java and set true to USE_UNICODE_PROPERTIES.
6. mvn clean package
7. cp target/joni.jar <somewhere>/jruby-1.1.4/build_lib/.
8. cd <somewhere>/jruby-1.1.4
9. ant clean jar

Then, I could build customized version of JRuby, which should be unicode regular expression compliant. When I tried this UTF-8 encoded Ruby script,


p 'abcアイウαβγ'.scan(/[a-z]/)
p "abcアイウαβγ".scan(/\p{Katakana}/u)
print "abcアイウαβγ".scan(/\p{Katakana}/u), "\n\n"
p "abcアイウαβγ".scan(/\p{^Greek}/u)
print "abcアイウαβγ".scan(/\p{^Greek}/u), "\n\n"
p "abcアイウαβγ".scan(/[\u0370-\u30FF]/u)
print "abcアイウαβγ".scan(/[\u0370-\u30FF]/u), "\n"

$KCODE="utf8"
p "abcアイウαβγ".scan(/\p{Greek}/)


it printed out:


["a", "b", "c"]
["\343\202\242", "\343\202\244", "\343\202\246"]
アイウ

["a", "b", "c", "\343\202\242", "\343\202\244", "\343\202\246"]
abcアイウ

["a", "b", "c"]
abc
["α", "β", "γ"]


Although unicode codepoint from Greek to Katakana didn't work, others were good. (Ruby 1.9 showed readable characters in both p and print, but JRuby's p didn't.)
Of course, I got an error "unicode_regex.rb:2: invalid character property name {Katakana}: /\p{Katakana}/u (RegexpError)" when I tried this script by regular JRuby 1.1.4.

Following lopex's comment, I wrote this Ruby script in EUC-JP encoding and ran it on regular JRuby 1.1.4.


p "abcアイウαβγ".scan(/\p{Katakana}/e)
print "abcアイウαβγ".scan(/\p{Katakana}/e),"\n"
print "abcアイウαβγ".scan(/\p{Greek}/e),"\n"


Naturally, the last line caused an error "unicode_regexp_eucjp.rb:6: invalid character property name {Greek}: /\p{Greek}/e (RegexpError)" whatever the encoding option of regular expression was. However, two lines from the top worked and outputed:


["\245\242", "\245\244", "\245\246"]
アイウ


JRuby already has the ability to handle unicode regular expression in a Ruby way but this feature is just turned off. Since unicode regular expression is useful for non ascii language speakers, I hope this feature will trun on in near future.

Wednesday, September 03, 2008

Ruby 1.9's Unicode Regular Expression

Ruby 1.9 has greatly improved its M17N features. Unicode regular expressions would be among the most improved ones. Ruby 1.9 uses Oniguruma for its regular expression engine and enables regular expressions by unicode codepoints or property names as described in Oniguruma's document at http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt.

When I tested some unicode regular expressions by ruby 1.9.0 (2008-08-26 revision 18849) [i386-darwin9.4.0], those were correctly processed. For example, this Ruby script,

# encoding: UTF-8

p 'abcアイウαβγ'.scan(/[a-z]/) # lower case alphabetical characters
p 'abcアイウαβγ'.scan(/\p{Katakana}/) # Katakana characters
p 'abcアイウαβγ'.scan(/\p{^Greek}/) # negation: other than Greek characters
p 'abcアイウαβγ'.scan(/[\u0370-\u30FF]/) # unicode codepoints from Greek to Katakana blocks

ouputs like this:

["a", "b", "c"]
["ア", "イ", "ウ"]
["a", "b", "c", "ア", "イ", "ウ"]
["ア", "イ", "ウ", "α", "β", "γ"]


The first line of Ruby script is a magic comment, which specifies an encoding of the script file. We can use either one of

# coding: UTF-8
# encoding: UTF-8
# -*- coding: UTF-8 -*-
# vim:set fileencoding=UTF-8:

to tell Ruby what encoding the script file uses. If the file starts with shebang(#!), then the magic comment goes to the second line of the file. (I found this infomation at http://i.loveruby.net/svn/rubydoc/doctree/trunk/refm/doc/spec/m17n.rd, which is written in Japanese, don't know where I can see English version of this document.)

JRuby 1.1.4 has been out there recently and started to support Ruby 1.9; however, unicode regular expressions are not included in the list. I tried to get this script run with --1.9 flag by JRuby 1.1.4, got "invalid character property name {Katakana}: /\p{Katakana}/ (RegexpError)" error. Oniguruma is also JRuby's regular expression engine like Ruby 1.9, but its implementation by Java, JONI, doesn't seem to work exactly the same as the Ruby's.