Hi all,
Please forgive the previous posting without data. I've been doing some get-out-the-vote work in California's famous democratic election, still scheduled for October 7th. As part of this work I have been using PolyML to tokenize voter record strings. I thought that we could not only surpass Florida, but even Canada's overnight hand counted vote. However, I am beginning to have my doubts.
To my shock, some tokenizations from repeated token separators dropped fields which I believe should have resulted in empty strings. Are my expectations reasonable, or have I subtly misinterpreted the use of String.tokenize?
Note that the field that would convey whether Miss Maxwell's male counterpart was a Sr, Jr, II, III, XIV, etc, has been dropped, making the results unsuitable for reading into an RDBMS, for example. Also, at the end, following the token "General," is the token "\n," when I believe there should be several empty strings inbetween.
Thanks, Byron Hale
Sample input/output follows, with data altered to protect the voter. (This time with data.:)
val voter = "Miss\tMaxwell\tMaria\tElaine\t\t2323 Fremont St \tSanta Clara\tCA\t95050\t2323 Fremont St \tSanta Clara CA 95050\t(408)555-3323\tGeneral\t\t\t\t\t\t\n";
String.tokens(fn ch => (#"\t" = ch)) voter;
val it = ["Miss", "Maxwell", "Maria", "Elaine", "2323 Fremont St ", "Santa Clara", "CA", "95050", "2323 Fremont St ", "Santa Clara CA 95050", "(408)555-3323", "General", "\n"] : String.string list
Byron, I think you want String.fields not String.tokens. From the SML Basis document I have: ----- tokens p s fields p s These functions return a list of tokens or fields, respectively, derived from s from left to right. A token is a non-empty maximal substring of s not containing any delimiter. A field is a (possibly empty) maximal substring of s not containing any delimiter. In both cases, a delimiter is a character satisfying the predicate p.
Two tokens may be separated by more than one delimiter, whereas two fields are separated by exactly one delimiter. For example, if the only delimiter is the character #"|", then the string "|abc||def" contains two tokens "abc" and "def", whereas it contains the four fields "", "abc", "" and "def". ---- I gather John Reppy and Emsden Ganser's book on the SML Basis is getting close to being published by CUP so it should be easier to check on these things. Regards, David. P.S. That's certainly an original application of Poly/ML.
----- Original Message ----- From: "Byron Hale" byron.hale@einfo.com To: polyml@inf.ed.ac.uk Sent: Wednesday, September 10, 2003 9:05 AM Subject: [polyml] Apparent bug in String.tokenize
Hi all,
Please forgive the previous posting without data. I've been doing some get-out-the-vote work in California's famous democratic election, still scheduled for October 7th. As part of this work I have been using PolyML
to
tokenize voter record strings. I thought that we could not only surpass Florida, but even Canada's overnight hand counted vote. However, I am beginning to have my doubts.
To my shock, some tokenizations from repeated token separators dropped fields which I believe should have resulted in empty strings. Are my expectations reasonable, or have I subtly misinterpreted the use of String.tokenize?
Note that the field that would convey whether Miss Maxwell's male counterpart was a Sr, Jr, II, III, XIV, etc, has been dropped, making the results unsuitable for reading into an RDBMS, for example. Also, at the end, following the token "General," is the token "\n," when I believe
there
should be several empty strings inbetween.
Thanks, Byron Hale
Sample input/output follows, with data altered to protect the voter. (This time with data.:)
val voter = "Miss\tMaxwell\tMaria\tElaine\t\t2323 Fremont St \tSanta Clara\tCA\t95050\t2323 Fremont St \tSanta Clara CA 95050\t(408)555-3323\tGeneral\t\t\t\t\t\t\n";
String.tokens(fn ch => (#"\t" = ch)) voter;
val it = ["Miss", "Maxwell", "Maria", "Elaine", "2323 Fremont St ", "Santa
Clara",
"CA", "95050", "2323 Fremont St ", "Santa Clara CA 95050", "(408)555-3323", "General", "\n"] : String.string list
polyml mailing list polyml@inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml