It is possible to emit Unicode (UTF-8 encoded say) strings in a Poly/ML program by emitting the appropriate sequence of bytes:
Poly/ML 5.2 Release
print "\u00e2\u0088\u0080\n";
? val it = () : unit
But as far as Poly is concerned, this is actually a sequence of four characters, not two (the universal quantifier and the newline).
Are there any plans to implement some sort of sensible WideChar signature? (The existing WideChar signature in the Basis is not really a good base on which to build good support here.)
Michael.
Hi,
On Tue, Aug 19, 2008 at 02:18:36PM +1000, Michael Norrish wrote:
Are there any plans to implement some sort of sensible WideChar signature? (The existing WideChar signature in the Basis is not really a good base on which to build good support here.)
there is a Unicode library (based on the Word32 datatype) available as part of fxp [1] (an XML parser) that works with various SML Systems (I use it with current releases of Poly/ML, sml/NJ, and Mlton).
Maybe having a look into the fxp code (especially src/Unicode) could help you - but of course, this depends on the problem you need to solve. fxp only implements the stuff needed for parsing UTF encoded XML documents. I have to admit, that I not looked to deep into the fxp source myself (only enough to get it running with the above mentioned SML-systems) but I am using as part of our UML toolchain and it works well.
[1] http://www2.informatik.tu-muenchen.de/~berlea/Fxp/
Achim
On Tue, 19 Aug 2008, Achim D. Brucker wrote:
On Tue, Aug 19, 2008 at 02:18:36PM +1000, Michael Norrish wrote:
Are there any plans to implement some sort of sensible WideChar signature? (The existing WideChar signature in the Basis is not really a good base on which to build good support here.)
there is a Unicode library (based on the Word32 datatype) available as part of fxp [1] (an XML parser) that works with various SML Systems (I use it with current releases of Poly/ML, sml/NJ, and Mlton).
Note that Word32 in Poly/ML is rather inefficient, based on tuples of 2 smaller word types (even on 64bit platforms).
In Isabelle we have managed to ignore character encodings beyond plain ASCII, using the <forall> symbol notation that you certainly know of. Back in 1997 we had taken the window of opportunity *not* to convert to the then newly introduced char type, i.e. our view on exploded strings is still that of lists of (small) strings, either single chars "a" or named symbols "<forall>". Luckily a singleton string in Poly/ML is just a unboxed integer, i.e. more efficient than Word32 or Int32. In other words the original SML90 standard essentially did already have chars of arbitrary width.
The user interface can convert to whatever encoding it needs to render text. For example, the JVM uses UTF-16 with odd "surrogate characters" to represent unicodes outside of the "basic multilingual plane", such as blackbord-bold B.
Recently we have introduced one minor change to this encoding-agnostic approach in Isabelle: 1 line of ML to count character positions, which ignores anything from the range of 128..192. This fits well with UTF-8 and also ISO-latin-1 (ignoring special punctuation 160..192).
Makarius
2008/8/19 Michael Norrish Michael.Norrish@nicta.com.au:
Are there any plans to implement some sort of sensible WideChar signature? (The existing WideChar signature in the Basis is not really a good base on which to build good support here.) Michael.
There was a long discussion on the mlton list in November 2005, seemingly leading to no conclusion:
http://mlton.org/pipermail/mlton/2005-November/thread.html
See the "[Sml-basis-discuss] Unicode and WideChar support" and the "Unicode / WideChar" threads.
Another starting point could be Aaron Turon's ml-ulex lexer which handles unicode input.
- Gergely
Gergely Buday wrote:
There was a long discussion on the mlton list in November 2005, seemingly leading to no conclusion:
See the "[Sml-basis-discuss] Unicode and WideChar support" and the "Unicode / WideChar" threads.
Yes, I remember that thread also. It was a bit disheartening.
For the moment, I am "handling" UTF-8 by using the %full option on my ml-lex lexer and treating characters in the range \128-\255 as valid constituents of "symbols". This is pretty weak, but it makes things look a deal nicer with minimal effort. If my users don't set out to break it, it should behave reasonably too.
Michael.