[Lispweb] utf-8 manipulations

Pascal Bourguignon pjb at informatimago.com
Tue Jun 28 10:31:26 CDT 2005


Frédéric Gobry writes:
> Hi,
> 
> I'm working on my 1st lisp-based web application (I'm a lisp newcomer
> too :-)). It's based on Araneida, and runs in sbcl (0.8.16 from debian
> sarge, no unicode support).
> 
> In my app, users can provide names for certain pages, say:
> 
>  Page accentuée
>  
> which should give birth to an url that matches the name as
> much as possible but remains readable, say:
> 
>  page-accentuee
> 
> I get the user input as an utf-8 string. Is there a library to help me
> processing the raw name into the url-clean version, or should I write
> both an utf-8 parser and the equivalent of python's translate () method?
> 
> Any general advice on that topic is welcome, as I don't have much
> experience with lisp.

Well, trivial ad-hoc algorithms rarely end in libraries...

But if you're asking for url clean, then instead of reading python
docs you'd better read standards:

http://www.w3.org/International/O-URL-code.html

So, to get a vector of ASCII codes, encoding a string encoded into
UTF-8, usable in a URI, you'd do:

(defun encode-for-uri (string)
  (flet ((hex (nibble) (+ nibble (if (< nibble 10) 48 55))))
    (let ((bytes (make-array (list (length string)) 
                             :element-type '(unsigned-byte 8)
                             :adjustable t :fill-pointer 0)))
      (loop for byte across (encode-string-to-utf-8 string) 
            do (if (< byte 128)
                 (vector-push-extend byte bytes)
                 (progn
                   (vector-push-extend 37 bytes) ; ASCII %
                   (vector-push-extend (hex (truncate byte 16)) bytes)
                   (vector-push-extend (hex (mod      byte 16)) bytes))))
      bytes)))

For encode-string-to-utf-8:

(defun ENCODE-STRING-TO-UTF-8 (string)
  #+clisp (EXT:CONVERT-STRING-TO-BYTES string CHARSET:UTF-8))

I suppose something similar exists in SBCL.  (I wonder why SBCL
developers did not use the de-facto standard of clisp for these
functions ;-)


[14]> (encode-for-uri "François")
#(70 114 97 110 37 67 51 37 65 55 111 105 115)



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d? s++:++ a+ C+++ UL++++ P--- L+++ E+++ W++ N+++ o-- K- w--- 
O- M++ V PS PE++ Y++ PGP t+ 5+ X++ R !tv b+++ DI++++ D++ 
G e+++ h+ r-- z? 
------END GEEK CODE BLOCK------



More information about the lispweb mailing list