[svn commit] r434 - trunk/include/minor
jimb at red-bean.com
jimb at red-bean.com
Fri Apr 22 02:25:20 CDT 2005
Author: jimb
Date: Fri Apr 22 02:25:18 2005
New Revision: 434
Modified:
trunk/include/minor/minor.h
Log:
Clean up rules and conventions for handling text. I haven't yet
changed the code to actually implement the new rules, so things may be
broken, but I expect them to get straightened out quickly, so I'm not
going to cut a branch; use the previous revision until things get
sorted out.
Modified: trunk/include/minor/minor.h
==============================================================================
--- trunk/include/minor/minor.h (original)
+++ trunk/include/minor/minor.h Fri Apr 22 02:25:18 2005
@@ -229,10 +229,11 @@
/* Exceptions. */
-/* The functions in this interface return exceptions in a fashion
- reminiscent of the usual C 'errno' style, except that the
- exceptions are Scheme exception objects, and not integers, and the
- interface is reentrant.
+/* The functions in this interface return exceptions in a way
+ resembling the usual C 'errno' style, except that Minor's
+ exceptions are Scheme exception objects, rather than integers, and
+ the interface is reentrant, without resorting to magical
+ definitions for an 'errno'-like variable.
For each function in this interface that can return an exception,
we document a distinguished `exception return value' --- a null
@@ -242,27 +243,43 @@
Each thread has a `pending exception' object, accessed via the
`mn_get_exception' and `mn_set_exception' functions. When a
function returns its exception return value, the caller can use
- `mn_get_exception' to find the exception object.
+ `mn_get_exception' to find the exception object describing the
+ error.
To throw an exception, C code can make an exception object, call
- `mn_set_exception' to make that the thread's pending exception, and
- return its own exception return value.
+ `mn_set_exception' to make that object the thread's pending
+ exception, and return its own exception return value.
- Where functions in this section take 'char *' strings, they are
- assumed to be in the current C execution character set.
+ Some of these functions take or produce strings; see the comments
+ at the top of the "Characters" section for Minor's conventions for
+ dealing with text.
+
+
+ When Minor Functions Abort Instead Of Returning Exceptions,
+ and Why:
+
+ By convention, Minor C API functions handle type errors, index
+ range errors, and numeric range errors by aborting, instead of
+ returning an exception. These sorts of errors typically indicate
+ bugs in the user's code itself: correct programs usually never
+ encounter them. When this is not the case, the API provides
+ functions to check for the conditions that would cause an abort.
+
+ An interface which handles these sorts of errors by returning
+ exceptions doesn't work well:
+
+ - Users will often not check for exceptional return values in these
+ cases, since they "know" the errors cannot occur. If the API
+ reports them as exceptions which the user's code ignores, then
+ the program behaves unpredictably, instead of failing in a
+ controlled way.
+
+ - For many functions in this API, these classes of errors are the
+ only ones they can ever encounter, so these functions either
+ return successfully, or not at all. This lets the user write
+ terser, more legible code, without leaving errors unchecked.
-
- NOTE: Many of the functions in this interface will typically be
- used in contexts where the caller "knows" that no error will occur.
- Having to check each call to these functions for an exception
- return value is a burden; people probably wouldn't do it, and
- people's experiences with this interface would be unpleasant.
-
- In the cases where we think this might happen, and where the user
- can easily detect the error conditions themselves, we just have the
- function abort, rather than returning an exception. This will
- allow errors to be caught by checks before the call that would
- abort is made. */
+ Each function's description details when it will abort. */
/* Return a new local reference to the calling thread's pending
@@ -287,10 +304,7 @@
null-terminated string.
If MSG cannot be converted to a Minor string, return NULL and set
- the current exception. This may happen if MSG contains invalid
- multi-byte characters, or characters not present in Unicode
- (although that should be rare). Strings containing only ASCII
- characters can always be converted. */
+ the pending exception. */
mn_ref *mn_make_generic_exception (mn_call *, const char *msg);
@@ -444,8 +458,8 @@
always return true. But:
- the bignum support isn't done yet, and our own rules say we
- shouldn't handle errors by aborting unless it's easy to check
- for the error condition, and
+ provide functions that check for each condition that could cause
+ an abort, and
- If this interface is to serve as a model for other Scheme
implementations, it needs to support those that have limits on
@@ -500,40 +514,105 @@
/* Characters. */
-/* Return true if REF refers to a character; otherwise, return false. */
-_Bool mn_character_p (mn_call *, mn_ref *ref);
+/* General Conventions For Handling Text and Character Sets:
-/* Convert between Minor characters and the C execution character set.
+ Minor uses Unicode to represent characters and strings; C uses a
+ representation that varies from one locale to another. Where the
+ functions in this API accept or return 'char' or 'wchar_t' values,
+ or strings made from them, those values use the current C execution
+ character set; the API converts to and from Minor's internal
+ representation as needed. This means that you can use these values
+ with the standard C library functions that operate on text
+ (getchar, printf, atoi, and so on) in the normal way, without
+ worrying about what representation Minor is using.
+
+ Strings of 'char' values are always treated as containing multibyte
+ characters (if the execution character set has any), never as plain
+ byte strings.
+
+ Since the encoding of characters in the current C execution
+ character set is determined by the current locale, the behavior of
+ these functions may depend on the current locale --- specifically,
+ that established for the LC_CTYPE category.
+
+
+ Reporting Conversion Errors:
+
+ Various problems can occur during conversion:
+
+ - A multi-byte C string using a variable-width character encoding
+ scheme might be unparseable as a stream of characters.
+
+ For example, in UTF-8, the byte sequence 0x80 is an ill-formed
+ character: 0x80 may only appear in UTF-8 as part of a multi-byte
+ character, and never as its first byte.
+
+ - A well-formed stream of code points might contain code points
+ that don't correspond to characters.
+
+ For example, the byte sequence 0xed 0xb0 0x80 is a well-formed
+ UTF-8 sequence, but it represents the code point 0xdc00 --- an
+ "isolated surrogate" value reserved for use in UTF-16 encoding
+ forms, and not assigned to any character.
+
+ (The distinction being attempted here is that these errors are
+ due to a code point being unassigned in the given character set,
+ and not due to some syntactic problem in the byte sequence.)
+
+ - A stream of well-formed characters in one character set may
+ contain characters that don't exist in the other. For example,
+ there is no character in ISO Latin-1 corresponding to the Unicode
+ character U+2638 ("Wheel of Dharma").
+
+ The functions in this API return exceptions when they encounter any
+ of the above problems, except in special cases where it is possible
+ to carry through the operation without losing information. If
+ information would be lost, the functions always return an
+ exception.
+
+ For example, Minor characters hold Unicode code points (up to
+ U+ffffff) without regard for whether that code point is actually
+ assigned to any particular character. In locales where the C
+ wchar_t type uses Unicode as well, the wide character L'\xdc00' can
+ be converted to a Minor character and back to a C wide character
+ without loss of information. In this case, the conversion
+ functions may not return an exception, even though L'\xdc00' is not
+ a valid Unicode character.
- (Hah! "Minor characters"??? Get it? Pretty funny, huh!)
- The "C execution character set" is the coding used by 'char',
- 'wchar_t', and 'char *' strings. You can use the following
- functions to exchange data with the standard C library functions
- that operate on text: putchar, getchar, printf, atoi, and so on.
+ Guaranteed Conversions
ISO C divides the execution character set into the "basic character
- set", which is roughly the upper- and lower-case letters, the
- digits, the graphic symbols used in C syntax, the whitespace
- characters, and "extended characters". The available set of
- extended characters, and their encodings, depends on the current
- locale. So the behavior of these functions may depend on the
- current locale --- specifically that established for the LC_CTYPE
- category. */
+ set" (roughly the upper- and lower-case letters, the digits, the
+ graphic symbols used in C syntax --- that does not include '$',
+ '@', or '`' --- and the whitespace characters), and "extended
+ characters". Characters in the basic character set, and strings
+ containing them, may always be converted to and from Minor values
+ without error. */
+
+
+/* Return true if REF refers to a character; otherwise, return false. */
+_Bool mn_character_p (mn_call *, mn_ref *ref);
+
+
+/* Convert between Minor characters and the C execution character set.
+
+ (Hah! "Minor characters"??? Get it? Pretty funny, huh!) */
/* Return true if CHARACTER is a character, and that character can be
- represented as a C 'char', or a C 'wchar_t', false otherwise. */
+ represented as a C char / wchar_t, false otherwise. */
_Bool mn_is_char (mn_call *, mn_ref *character);
_Bool mn_is_wchar (mn_call *, mn_ref *character);
-/* Return CHARACTER as a C 'char', or a C 'wchar_t'. If CHARACTER
- cannot be represented in the given type, return EOF or WEOF. If
- CHARACTER is not a character, abort. */
+/* Return CHARACTER as a C char / wchar_t. If CHARACTER cannot be
+ represented in the given type, return EOF / WEOF, and set the
+ pending exception. If CHARACTER is not a character, abort. */
int mn_to_char (mn_call *, mn_ref *character);
wint_t mn_to_wchar (mn_call *, mn_ref *character);
/* Return the Minor character corresponding to the 'char' or 'wchar_t'
- value C. */
+ value C. If C cannot be converted to a Minor character, return EOF
+ / WEOF and set the pending exception. */
mn_ref *mn_from_char (mn_call *, int c);
mn_ref *mn_from_wchar (mn_call *, wchar_t c);
@@ -545,8 +624,8 @@
Minor strings are immutable. (This is a deviation from Scheme.)
- These functions all produce or consume C strings in the current C
- execution character set, which may include multibyte characters.
+ See the comments in the "Characters" section describing the general
+ conventions for handling text and dealing with conversion errors.
These functions all copy the entire string for the user's use. If
it's important to avoid this, then we could introduce a lease-based
@@ -556,8 +635,8 @@
/* Return true if REF refers to a string; otherwise, return false. */
_Bool mn_string_p (mn_call *, mn_ref *ref);
-/* Return the length of STRING. If STRING is not a string object,
- abort. */
+/* Return the length of STRING, in characters. If STRING is not a
+ string object, abort. */
int mn_string_length (mn_call *, mn_ref *string);
/* Return the i'th character of STRING. If STRING is not a string, or
@@ -568,8 +647,8 @@
memory for the string returned is allocated using malloc; the
caller is responsible for freeing it.
- If STRING contains characters that cannot be converted to the C
- execution character set, return NULL.
+ If STRING cannot be fully and accurately converted to the C
+ execution character set, return NULL and set the pending exception.
If STRING contains null characters, truncate it just before the
first one. (Would it be more helpful to just return the entire
@@ -582,8 +661,8 @@
*LENGTH to its length. The memory returned is allocated using
malloc; the caller is responsible for freeing it.
- If STRING contains characters that cannot be converted to the C
- execution character set, return NULL.
+ If STRING cannot be fully and accurately converted to the C
+ execution character set, return NULL and set the pending exception.
If STRING is not a string, abort. */
char *mn_string_to_mem (mn_call *, mn_ref *string, size_t *length);
@@ -592,13 +671,11 @@
null-terminated string STR. This is a copy of STR; the returned
string does not refer to STR's memory.
- If STR cannot be converted to a Minor string, return NULL and set
- the current exception. This may happen if STR contains invalid
- multi-byte characters, or characters not present in Unicode
- (although that should be rare). Strings containing only ASCII
- characters can always be converted. For storing arbitrary
- sequences of bytes, use byte vectors; they are described in
- bytevec.h. */
+ If STR cannot be fully and accurately converted to a Minor string,
+ return NULL and set the pending exception.
+
+ (For storing arbitrary sequences of bytes, use byte vectors; they
+ are described in bytevec.h.) */
mn_ref *mn_string_from_str (mn_call *, const char *str);
/* Return a Minor string object whose contents are the same as the
@@ -606,48 +683,52 @@
does not refer to MEM's memory. MEM need not be null-terminated,
and may contain embedded null characters.
- If MEM cannot be converted to a Minor string, return NULL and set
- the current exception. This may happen if MEM contains invalid
- multi-byte characters, or characters not present in Unicode
- (although that should be rare). Strings containing only ASCII
- characters can always be converted. For storing arbitrary
- sequences of bytes, use byte vectors; they are described in
- bytevec.h. */
+ If MEM cannot be fully and accurately converted to a Minor string,
+ return NULL and set the pending exception.
+
+ (For storing arbitrary sequences of bytes, use byte vectors; they
+ are described in bytevec.h.) */
mn_ref *mn_string_from_mem (mn_call *, const char *mem, size_t length);
/* Symbols. */
+/* See the comments in the "Characters" section describing the general
+ conventions for handling text and dealing with conversion
+ errors. */
+
/* Return true if REF refers to a symbol; otherwise, return false. */
_Bool mn_symbol_p (mn_call *, mn_ref *ref);
-/* Return the symbol whose name is NAME. If NAME is not a Minor
- string, abort. */
+/* Return the symbol whose name is NAME. If NAME is not a string,
+ abort. */
mn_ref *mn_string_to_symbol (mn_call *, mn_ref *name);
-/* The functions here that take or return 'char' values interpret them
- according to the current C execution character set. */
-
-/* Return the symbol whose name is the null-terminated string NAME.
+/* Return the symbol whose name is the null-terminated C string NAME.
- Every symbol's name is a valid string. If NAME cannot be converted
- to a Minor string, return NULL and set the current exception. This
- may happen if NAME contains invalid multi-byte characters, or
- characters not present in Unicode (although that should be rare).
- Strings containing only ASCII characters can always be converted. */
+ Every symbol's name is a valid string. If NAME cannot be fully and
+ accurately converted to a string, return NULL and set the pending
+ exception. */
mn_ref *mn_symbol_from_str (mn_call *, const char *name);
/* Return the name of the symbol SYMBOL, as a malloc'd block of
- characters, and set *LENGTH to its length. The memory for the
- string returned is allocated using malloc; the caller is
- responsible for freeing it. If SYMBOL is not a symbol, abort. */
+ characters, and set *LENGTH to its length.
+
+ The memory for the string returned is allocated using malloc; the
+ caller is responsible for freeing it.
+
+ If SYMBOL's name cannot be fully and accurately converted to a
+ string, return NULL and set the pending exception.
+
+ If SYMBOL is not a symbol, abort. */
char *mn_symbol_to_mem (mn_call *, mn_ref *symbol, size_t *length);
/* Return the name of the symbol SYMBOL as a Minor string. If SYMBOL
is not a symbol, abort. */
mn_ref *mn_symbol_name (mn_call *, mn_ref *symbol);
+
/* Procedures. */
More information about the Minor
mailing list