|
Kernel Cousin Links
Kernel Cousins Home Page |
Guile Links
The Guile Homepage |
This week I'd like to introduce you to Guile News, a kernel cousin representing the Free Software Foundation's Guile programming language. If guile is something new to you, please take a look at the links section for more information.
forcer asked about elegent implementation of a concept hinted at in the docs:
Lately, i was wondering why there was the need for (explicitely)
shared substrings at all.
As hinted in guile-ref, a copy-on-write strategy might make
make-shared-substring useless, since normal substring could
return a shared substring which will be copied on the first
write. This would add a simple check on each string mutation
function, but could increase performance drastically otherwise
and retain a very simple interface.
I liked the idea and wanted to implement it in guile. Taking a
look at strings.[ch] (and symbol.h), i noticed that the length of
a string is encoded in it's SCM object, while the SCM obj's CDR
has to point to the char* directly.
If i wanted to change this, i had to change the implementation of
symbols as well (and probably take in alot of incompatibilities
with other parts of guile that use strings)
I assume that such pointer plays - negative indices to store
information, etc. - are common in guile. One question arises:
why?
Is (SCM_CAR(val)>>8) that much faster than SCM_CDR(val)->length?
Is SCM_CDR(val)[-1] that much faster than SCM_CDR(val)->something?
Well, my thoughts about the copy-on-write idea stumbled on one
problem.
A solution to this problem would be to have (substring) mark the
string it's applied to to be a "substring" as well. The problem
then is that:
(define str1 "foo") ; a string is allocated as usual
(define str2 (substring str1 0 1)) ; str2 is now a shared
; substring of str1
(string-set! str1 0 #\b) ; since str1 is not a shared
; substring, it's not copied,
; and thus str2 => "b"
(define str1 "foo")
(define str2 (substring str1 2 3)) ; str1 and str2 are marked as
; shared substrings
(string-set! str1 0 #\b) ; str1 now points to a newly
; malloc'd copy of the string,
; and str2 points to the
; original string[2]
(string-set! str2 0 #\b) ; str2 is now a copy of the
; string as well, and there's
; no pointer to the original
; string left :]
He added that you could use a reference count to solve this, as he didn't s see any way that a circular dependency could arise.
Chistian Lynbech mentioned that having shared substrings was important, as they provide a good mechanism for communications (i.e., all clients see updates to the strings).
Mikael Djurfeldt pointed out that scm_make_shared_substring already gives Guile e an implementation of sharestrings. But that it is not consistently used. He goes on:
To get the simplest possible user interface, all code should produce
shared strings whenever possible, and coerce them into strings when
necessary. All traces of shared substrings in the interface should be
removed.
[I understand Christian Lynbech's point, but the vast majority of
users want to have a simple system. Those who want to use shared
substrings for communication can implement a specialized data
structure for this purpose.]
Mikael also mentions SCM_COERCE_SUBSTR, and recommends that any operation which mutates a string call it. (Look for it in 'symbols.h'.) At the end of his message, he asks if forcer's second code example doesn't do what it is really intended to.
In a later message Mikael added that forcers original concept of substrings marking the string to which they're applied would have to be done carefully. Object headers could cause 'problems' changing addresses on the heap.
Christian Lynbech responded, asking if it would be difficult to provide both a general implementation whereby all substrings would share, but retaining the make-shared-substring functionality.
Mikael replied with the thought that while this would probably be easy
to do, it would probably not be the right thing.
I think it's important to keep Guile's user
interface simple. I don't think the gain of enabling the user
to manipulate with sharedness of strings balances the cost of the
increased conceptual complexity added to the interface.
There are probably extremely few cases when you can write a good
program using shared substrings. My guess is that in most candidate
cases it's a question of "tricky" programming which rather is a kind
of misuse of data structures, exploiting the properties of shared
substrings to achieve something which isn't "stringy" at all.
forcer mentioned that his second code block did do the right thing, but that it seemed to need reference counting to ensure that objects were freed at the right time. Mikael responded saying that once str2 is converted into a real string, GC would no longer find a reference to the original object and could thus free it.
Mikael went on to say that the hard part would be the mutation of str1 into a shared string object on the fly. It seems that if any shared substrings has been created from str1 before this operation, they will end up having an "original string pointer" pointing to a shared string object. Maybe this could be solved by mutating str1 into a shared string object the first time a shared object is created from it? Another idea would be to make it possible for shared string objects to point to other shared string objects: Operations on the shared string would follow the pointers all the way to the original string and possibly mutate the object so that this only needs to be done once. Though he notes that neither of these is terribly efficient.
Greg Harvey notes that ref counting isn't a good strategy for dealing with small strings, as you need to scan the whole heap to find strings pointing in. His recommendation is that you track all of this with a structure (ideally containing the string). When a function modifis the strin, it must also check for shared substrings pointing to it then do any necessary copying. One thing you'd want to look out for here is to copy all very small strings (<=4 or 8 bytes of string data... aren't those already kept on the cdr, tho, or did I dream that?), rather than creating a shared string.
Finally, Ray Lehtiniemi gave us a pointer to a copy-on-write shared
memory implementation he'd done for the gimp.
it only shares whole blocks of memory and might be a bit heavyweight,
but you are welcome to look it over if you want to see the approach i
took. maybe some ideas might be applicable.
the code and design docs are in
ftp.gimp.org/pub/users/rayl/cowmgr-0.0.10.tar.gz
under the memmgmt subdir
Mikael Djurfeldt wrote that there are three fundemental operations used frequently in dealing with lists and vectors:
1. Generation
The values from a generating procedure forms a new sequence.
+------+
| PROC |
+------+
/ / \
_/_/ \_
|_|_| . . . |_|
Examples: Generating a list of the numbers 1 to 10.
Generating a vector of random numbers in [0,1].
2. Mapping
A procedure maps each element in the old sequence to an element in
the new sequence.
_ _ _
|_|_| . . . |_|
\ \ /
\ \ /
+------+
| PROC |
+------+
/ / \
_/_/ \_
|_|_| . . . |_|
Example: Mapping a list of strings to a list of their lengths.
3. Combination
A combinator procedure combines elements from a sequence.
_ _ _
|_|_| . . . |_|
\ \ /
\ \ /
+------+
| PROC |
+------+
Examples: Summing a list of numbers.
Computing the maximum in a vector of numbers.
Mikael goes on to explain that because: 1) It is easier to implement and understand algorithms when the loop constructs don't need to be implemented, and 2) Greater efficiency can be achieved by doing the looping "under the cowling", Guile should support these operations well. (map) provides good support for mapping lists to list, but additional operations are needed.
1. Generation
(map-index N PROC) --> LIST
(vector-map-index N PROC) --> VECTOR
where
LIST = ((PROC 0) (PROC 1) ... (PROC N-1))
VECTOR = similarly
The motivation for the argument order is to improve code layout:
(map-index 10
(lambda (i)
(+ 1 i)))
2. Mapping
(vector-map PROC VECTOR1) --> VECTOR2
(vector version of map)
3. Combination
(foldl PROC INIT LIST) --> COMBINED
(foldr PROC INIT LIST) --> COMBINED
(vector-foldl PROC INIT VECTOR) --> COMBINED
(vector-foldr PROC INIT VECTOR) --> COMBINED
where foldl yields
COMBINED = (PROC En-1
(PROC En-2
...
(PROC E0 INIT)))
and foldr
COMBINED = (PROC E0
(PROC E1
...
(PROC En-1 INIT)))
(The names comes from the fact that foldl starts from the left and
goes to the right, and vice versa for foldr.)
Mikael later added a recommendation that (map-index N
PROC) could be replaced/supplemented with (for-map FROM
TO STEPPER PROC).
This brought about questions and comments about implementation.
First, Kalle Olavi Niemitalo asked FROM and TO needed to be exact, and
whether a similar function would be created for vectors (possibly
allowing multiple calls to STEPPER before calling PROC). Klaus
Schilling thought that Olin Shiver's iota and stock map procedure were
good candidates for use here. Valentin Kamyshenko asked about CL
(loop ...) constructs
(loop for i from 1 to (compute-top-value) ; first clause
while (not (unacceptable i)) ; second clause
collect (square i) ; third clause
do (format t "Working on ~D now" i) ; fourth clause
when (evenp i) ; fifth clause
do (format t "~D is a non-odd number" i)
finally (format t "About to exit!")) ; sixth clause
Lars Arvestad also brought up Olin Shiver's work, recommending that we take look at Olin's list library srfi.schemers.org/srfi-1/srfi-1.html for details. Lars felt that it shouldn't be difficult to implement this library for vectors as well.
One problem Lars mentioned was that the library relies on recieve-values pairs. He felt that a macro implementation of call-with values could be made to work better than a rewrite using call-with-values.
Russ McManus and Greg Harvey also recommended SRFI, with Greg adding:
What might be even cooler would be a generic seq module, which provides all of this stuff either vectors or lists; so, you'd have (seq-for-each proc seqs).
Chris Bitmead asked if anyone had built a guile apache module yet, how much work had been done, and what state the database interface was in.
This question spawned some replies about database interfaces with Aleksy Demakov mentioning www.damtp.cam.ac.uk/user/ig206/guile-pg/ as the site for Ian Grant's and offering to email his own PostgreSQL interface. Aleksy notes in a later email that neither is finished.
forcer jumped in on the question of a gGuile Apache Module stating: It's on the top of my to-do list. I'd love to have mod_guile. Also, i'm considering to have a feature similar to Embedperl/PHP - inline-code in HTML. Everything around would be translated into ``(display "...'', to ``(display "'', and EOF to ``")''. This simple transformation would make the whole stuff very simple and elegant.
forcer wrote that there were three things he felt he needed help with.
He also thought that PHP would be a good platform to look at replacing
with Guile.
Branko Cibej recommended that mod_guile be integrated with mod_include, and forcer thought that this could be done by (display ...)ing the return value. A patch to modify mod_include would be distributed with mod_guile.
Chris Bitmead replied with news that he had implemented a lightweight
scheme interpreter for cgi (tho' not with guile). He recommended
doing away with the <? scm and just embedding the
guile code into the HTML. Some discussion followed between forcer and
Chris about differentiating between
<h4>this is a head of size (display "4")</h4>and<h4>This is a header (a small one)</h4>
<? scm and the
(foo...) notations could be handled by a well written
module.
Jost Boekemeier writes:
A module is a structure used to encapsulate related data and functions with a restricted view from the outside. Typically, modules are used to hide implementation details through a well-defined set of functions called the interface. The module consists of two main parts: the head, where you store the information what and how something should be visible and the body, where you define your functions and data structures.
He then discusses two models for storing metadata. 1) the C model where it is stored in a seperate file, allowing you to hide the implementation but requiring you to maintain two seperate files. and 2) the Java/Oberon model where it is visible to the world as a part of the implementation file.
Guile's new environment implementation allows a programmer to use either style. Jost asks sould meta data (information about what is visible to the world, what package this module belongs to etc.) be stored in a separate file or should this information be part of module itself? How should guile's primary module interface look like?
There were many follow-ups discussing the benefits of either solution, with seeming consensus that while a one file approach is easier to maintain and probably more elegant, the module system ought to support both approaches as there is no great difficulty in doing so.
Jonas Oberg complained about guile startup time and mentioned that it might be a FAQ, but that he didn't see it listed anywhere. Chris Bitmead, Russ McManus, and Jost Boekemeier all replied agreeing that Guile's startup time is too slow, that there were things being done, and that it should get better.
Discussion of a scheme based replacement continued this week, with several people getting in on the conversation. [FINISH ME]
We Hope You Enjoy Kernel Cousins