[Lispweb] utf-8 manipulations

Pascal Bourguignon pjb at informatimago.com
Tue Jun 28 17:19:08 CDT 2005


Frédéric Gobry writes:
> (please tell me if this is the wrong mailing list, I don't feel like my
> question was welcome)
> 
> > Well, trivial ad-hoc algorithms rarely end in libraries...
> 
> For my information, which of general charset conversion or generic
> string substitution is trivial ad-hoc?
> 
> > But if you're asking for url clean, then instead of reading python
> > docs you'd better read standards:
> 
> I did not have to read python docs, I mostly develop in python these
> days.
> 
> > http://www.w3.org/International/O-URL-code.html
> 
> Thanks, but I don't wish to end up with "Fran%C3%A7ois" but with
> "francois".  The aim is to generate a first version of an url from a
> string, which can then be manually tweaked by the (non technical) user.
> 
> > 
> > (defun encode-for-uri (string)
> >   (flet ((hex (nibble) (+ nibble (if (< nibble 10) 48 55))))
> >     (let ((bytes (make-array (list (length string)) 
> >                              :element-type '(unsigned-byte 8)
> >                              :adjustable t :fill-pointer 0)))
> >       (loop for byte across (encode-string-to-utf-8 string) 
> >             do (if (< byte 128)
> >                  (vector-push-extend byte bytes)
> >                  (progn
> >                    (vector-push-extend 37 bytes) ; ASCII %
> >                    (vector-push-extend (hex (truncate byte 16)) bytes)
> >                    (vector-push-extend (hex (mod      byte 16)) bytes))))
> >       bytes)))
> > 
> > For encode-string-to-utf-8:
> > 
> > (defun ENCODE-STRING-TO-UTF-8 (string)
> >   #+clisp (EXT:CONVERT-STRING-TO-BYTES string CHARSET:UTF-8))
> > 
> 
> Thanks for the code, it's always useful to read nice samples.
> 
> > I suppose something similar exists in SBCL.  (I wonder why SBCL
> > developers did not use the de-facto standard of clisp for these
> > functions ;-)
> 
> As I said, I'd prefer sticking with sarge, so no unicode support in
> sbcl. I planned to do something like utf-8 -> latin-1 and remap some
> characters in order to remove some diacritics.

Well, clearly there's two steps,
1. UTF-8 decoding to unicode characters, then 
2. unicode character to ASCII characters folding.

To read UTF-8, there's a lot of lisp code to do it, you could use the
entrails of sbcl-0.9 or of clisp, and use them at user-code level in
non-unicode Common Lisp.


The other conversion is quite trivial.  In clisp it could be written
all automatically using char-name and regexp.  A less sophisticated
solution would be to explicitely list the foldings (but of course,
this would work only if the implementation supports these characters
in the source, which should be no problem for iso-8859-1, but might be
more problematic for other unicode characters):


  (defparameter +character-foldings+
    '( ("A" "ƒÄÅ") ("AE" "Æ") ("C" "Ç") ("E" "ÈÉÊË") ("I" "ÌÍÎÏ") 
       ("ETH" "Ð") ("N" "Ñ") ("O" "ÒÓÔÕÖØ") ("U" "ÙÚÛÜ") ("Y" "Ý")
       ("TH" "Þ") ("ss" "ß") ("a" "àáâãäå") ("ae" "æ") ("c" "ç")
       ("e" "èéêë") ("i" "ìíîï") ("eth" "ð") ("n" "ñ") ("o" "òóôõöø")
       ("u" "ùúûü") ("u" "ýÿ") ("th" "þ")))

  (defun character-folding (character)
    (car (member (character character) +character-foldings+ 
                 :test (function position) :key (function second))))

  (defun character-fold (character)
    "
RETURN: A string containing the character without accent 
        (for accented characters), or a pure ASCII form of the character.
"
    (car (character-folding character)))

  (defun string-fold (string)
    (apply (function concatenate) 'string
           (map 'list (lambda (ch) (let ((conv (character-folding ch)))
                                (if conv
                                  (first conv)
                                  (list ch)))) string)))

[and use (substitute #\- #\space string) for spaces].


You can also see an example of generating simplier names in my
~/bin/clean-path clisp script.

cvs -z3 -d :pserver:anonymous at cvs.informatimago.com:/usr/local/cvs/public/chrooted-cvs/cvs co  bin


> Do you think it would be wiser to move to another CL implementation?
> I've not yet built strong affective links with any of them, having tried
> a bit cmucl and a bit more sbcl. So, clisp would be a better choice?

clisp has strong support for encodings and unicode.

sbcl support for unicode is more recent, but if you like sbcl, I see
no reason why not upgrade to sbcl-0.9.



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

This is a signature virus.  Add me to your signature and help me to live




More information about the lispweb mailing list