2 Environment  -< ANSI C Rationale  -> 3.2 Conversions  -> 4 Library         Index 

3  Language

While more formal methods of language definition were explored, the Committee decided early on to employ the style of the Base Document: Backus-Naur Form for the syntax and prose for the constraints and semantics.  Anything more ambitious was considered to be likely to delay the Standard, and to make it less accessible to its audience. 

3.1  Lexical Elements

The Standard endeavors to bring preprocessing more closely into line with the token orientation of the language proper.  To do so requires that at least some information about white space be retained through the early phases of translation (see §2.1.1.2).  It also requires that an inverse mapping be defined from tokens back to source characters (see §3.8.3). 

3.1.1  Keywords

Several keywords have been added: const, enum, signed, void, and volatile

As much as possible, however, new features have been added by overloading existing keywords, as, for example, long double instead of extended It is recognized that each added keyword will require some existing code that used it as an identifier to be rewritten.  No meaningful programs are known to be quietly changed by adding the new keywords. 

The keywords entry, fortran, and asm have not been included since they were either never used, or are not portable.  Uses of fortran and asm as keywords are noted as common extensions

3.1.2  Identifiers

While an implementation is not obliged to remember more than the first 31 characters of an identifier for the purpose of name matching, the programmer is effectively prohibited from intentionally creating two different identifiers that are the same in the first 31 characters.  Implementations may therefore store the full identifier; they are not obliged to truncate to 31. 

The decision to extend significance to 31 characters for internal names was made with little opposition, but the decision to retain the old six-character case-insensitive restriction on significance of external names was most painful. While strong sentiment was expressed for making C ``right'' by requiring longer names everywhere, the Committee recognized that the language must, for years to come, coexist with other languages and with older assemblers and linkers.  Rather than undermine support for the Standard, the severe restrictions have been retained. 

The Committee has decided to label as obsolescent the practice of providing different identifier significance for internal and external identifers, thereby signalling its intent that some future version of the C Standard require 31-character case-sensitive external name significance, and thereby encouraging new implementations to support such significance. 

Three solutions to the external identifier length/case problem were explored, each with its own set of problems:

  1. Label any C implementation without at least 31-character, case-sensitive significance in external identifiers as non-standard.  This is unacceptable since the whole reason for a standard is portability, and many systems today simply do not provide such a name space. 

  2. Require a C implementation which cannot provide 31-character, case-sensitive significance to map long identifiers into the identifier name space that it can provide.  This option quickly becomes very complex for large, multi-source programs, since a program-wide database has to be maintained for all modules to avoid giving two different identifiers the same actual external name.  It also reduces the usefulness of source code debuggers and cross reference programs, which generally work with the short mapped names, since the source-code name used by the programmer would likely bear little resemblance to the name actually generated. 

  3. Require a C implementation which cannot provide 31-character, case-sensitive significance to rewrite the linker, assembler, debugger, any other language translators which use the linker, etc.  This is not always practical, since the C implementor might not be providing the linker, etc.  Indeed, on some systems only the manufacturer's linker can be used, either because the format of the resulting program file is not documented, or because the ability to create program files is restricted to secure programs. 

Because of the decision to restrict significance of external identifiers to six case-insensitive characters, C programmers are faced with these choices when writing portable programs:

  1. Make sure that external identifiers are unique within the first six characters, and use only one case within the name.  A unique six-character prefix could be used, followed by an underscore, followed by a longer, more descriptive name:
            extern int a_xvz_real_long_name;
            extern int a_rwt_real_long_name2;
    

  2. Use the prefix method described above, and then use #define statements to provide a longer, more descriptive name for the unique name, such as:

            #define real_long_name  a_xvz_real_long_name
            #define real_long_name2 a_rwt_real_long_name2
    
    Note that overuse of this technique might result in exceeding the limit on the number of allowed #define macros, or some other implementation limit. 

  3. Use longer and/or multi-case external names, and limit the portability of the programs to systems that support the longer names. 

  4. Declare all exported items (or pointers thereto) in a single data structure and export that structure.  The technique can reduce the number of external identifiers to one per translation unit; member names within the structure are internal identifiers, hence can have full significance.  The principal drawback of this technique is that functions can only be exported by reference, not by name; on many systems this entails a run-time overhead on each function call. 

3.1.2.1  Scopes of identifiers

The Standard has separated from the overloaded keywords for storage classes the various concepts of scope, linkage, name space, and storage duration (See §3.1.2.2, §3.1.2.3, §3.1.2.4.)  This has traditionally been a major area of confusion. 

One source of dispute was whether identifiers with external linkage should have file scope even when introduced within a block.  The Base Document is vague on this point, and has been interpreted differently by different implementations.  For example, the following fragment would be valid in the file scope scheme, while invalid in the block scope scheme:

        typedef struct data d_struct ;

        first(){
                extern d_struct func();
                /* ...  */
        }

        second(){
                d_struct n = func();
        }
While it was generally agreed that it is poor practice to take advantage of an external declaration once it had gone out of scope, some argued that a translator had to remember the declaration for checking anyway, so why not acknowledge this?  The compromise adopted was to decree essentially that block scope rules apply, but that a conforming implementation need not diagnose a failure to redeclare an external identifier that had gone out of scope (undefined behavior). 

Although the scope of an identifier in a function prototype begins at its declaration and ends at the end of that function's declarator, this scope is of course ignored by the preprocessor.  Thus an identifier in a prototype having the same name as that of an existing macro is treated as an invocation of that macro.  For example:
        #define status 23
        void exit(int status);
generates an error, since the prototype after preprocessing becomes
        void exit(int 23);
Perhaps more surprising is what happens if status is defined
        #define status []
Then the resulting prototype is
        void exit(int []);
which is syntactically correct but semantically quite different from the intent. 

To protect an implementation's header prototypes from such misinterpretation, the implementor must write them to avoid these surprises.  Possible solutions include not using identifiers in prototypes, or using names (such as __status or _Status) in the reserved name space. 

3.1.2.2  Linkages of identifiers

The Standard requires that the first declaration, implicit or explicit, of an identifier specify (by the presence or absence of the keyword static whether the identifier has internal or external linkage This requirement allows for one-pass compilation in an implementation which must treat internal linkage items differently than external linkage items.  An example of such an implementation is one which produces intermediate assembler code, and which therefore must construct names for internal linkage items to circumvent identifier length and/or case restrictions in the target assembler. 

Existing practice in this area is inconsistent.  Some implementations have avoided the renaming problem simply by restricting internal linkage names by the same rules as for external linkage.  Others have disallowed a static declaration followed later by a defining instance, even though such constructs are necessary to declare mutually recursive static functions.  The requirements adopted in the Standard may call for changes in some existing programs, but allow for maximum flexibility. 

The definition model to be used for objects with external linkage was a major standardization issue.  The basic problem was to decide which declarations of an object define storage for the object, and which merely reference an existing object.  A related problem was whether multiple definitions of storage are allowed, or only one is acceptable.  Existing implementations of C exhibit at least four different models, listed here in order of increasing restrictiveness:

Common
Every object declaration with external linkage (whether or not the keyword extern appears in the declaration)  creates a definition of storage.  When all of the modules are combined together, each definition with the same name is located at the same address in memory.  (The name is derived from common storage in FORTRAN.)  This model was the intent of the original designer of C, Dennis Ritchie.

Relaxed Ref/Def
The appearance of the keyword extern (whether it is used outside of the scope of a function or not)  in a declaration indicates a pure reference (ref), which does not define storage.  Somewhere in all of the translation units, at least one definition (def) of the object must exist.  An external definition is indicated by an object declaration in file scope containing no storage class indication.  A reference without a corresponding definition is an error.  Some implementations also will not generate a reference for items which are declared with the extern keyword, but are never used within the code.  The UNIX operating system C compiler and linker implement this model, which is recognized as a common extension to the C language (F.4.11).  UNIX C programs which take advantage of this model are standard conforming in their environment, but are not maximally portable. 

Strict Ref/Def
This is the same as the relaxed ref/def model, save that only one definition is allowed.  Again, some implementations may decide not to put out references to items that are not used.  This is the model specified in K&R and in the Base Document. 

Initialization
This model requires an explicit initialization to define storage.  All other declarations are references. 

Figure 3.1 demonstrates the differences between the models. 


Model                   File 1                  File 2

common                  extern int i;           extern int i;
                        main() {                second() {
                                i = 1;                  third(i);
                                second();       }
                        }

Relaxed Ref/Def         int i;                  int i;
                        main() {                second() {
                                i = 1;                  third(i);
                                second();       }
                        }

Strict Ref/Def          int i;                  extern int i;
                        main() {                second() {
                                i = 1;                  third(i);
                                second();       }
                        }

Initializer             int i = 0;              int i;
                        main() {                second() {
                                i = 1;                  third(i);
                                second();       }
                        }

The model adopted in the Standard is a combination of features of the strict ref/def model and the initialization model.  As in the strict ref/def model, only a single translation unit contains the definition of a given object --- many environments cannot effectively or efficiently support the ``distributed definition'' inherent in the common or relaxed ref/def approaches.  However, either an initialization, or an appropriate declaration without storage class specifier (see §3.7), serves as the external definition.  This composite approach was chosen to accommodate as wide a range of environments and existing implementations as possible. 

3.1.2.3  Name spaces of identifiers

Implementations have varied considerably in the number of separate name spaces maintained.  The position adopted in the Standard is to permit as many separate name spaces as can be distinguished by context, except that all tags (struct, union, and enum comprise a single name space. 

3.1.2.4  Storage durations of objects

It was necessary to clarify the effect on automatic storage of jumping into a block that declares local storage.  (See §3.6.2.)  While many implementations allocate the maximum depth of automatic storage upon entry to a function, some explicitly allocate and deallocate on block entry and exit.  The latter are required to assure that local storage is allocated regardless of the path into the block (although initializers in automatic declarations are not executed unless the block is entered from the top). 

To effect true reentrancy for functions in the presence of signals raised asynchronously (see §2.2.3), an implementation must assure that the storage for function return values has automatic duration.  This means that the caller must allocate automatic storage for the return value and communicate its location to the called function.  (The typical case of return registers for small types conforms to this requirement: the calling convention of the implementation implicitly communicates the return location to the called function.) 

3.1.2.5  Types

Several new types have been added:

    void
    void *
    signed char
    unsigned char
    unsigned short
    unsigned long
    long double
New designations for existing types have been added:
     signed short  for   short
     signed int    for   int
     signed long   for   long
void is used primarily as the typemark for a function which returns no result.  It may also be used, in any context where the value of an expression is to be discarded, to indicate explicitly that a value is ignored by writing the cast (void) Finally, a function prototype list that has no arguments is written as f(void), because f() retains its old meaning that nothing is said about the arguments. 

A ``pointer to void,'' void *, is a generic pointer, capable of pointing to any (data) object without truncation.  A pointer to void must have the same representation and alignment as a pointer to character; the intent of this rule is to allow existing programs which call library functions (such as memcpy and free to continue to work.  A pointer to void may not be dereferenced, although such a pointer may be converted to a normal pointer type which may be dereferenced.  Pointers to other types coerce silently to and from void * in assignments, function prototypes, comparisons, and conditional expressions, whereas other pointer type clashes are invalid.  It is undefined what will happen if a pointer of some type is converted to void *, and then the void * pointer is converted to a type with a stricter alignment requirement. 

Three types of char are specified: signed, plain, and unsigned A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice.  The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned.  For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types. 

Two varieties of the integral types are specified: signed and unsigned If neither specifier is used, signed is assumed.  In the Base Document the only unsigned type is unsigned int

The keyword unsigned is something of a misnomer, suggesting as it does arithmetic that is non-negative but capable of overflow.  The semantics of the C type unsigned is that of modulus, or wrap-around, arithmetic, for which overflow has no meaning.  The result of an unsigned arithmetic operation is thus always defined, whereas the result of a signed operation may (in principle) be undefined.  In practice, on twos-complement machines, both types often give the same result for all operators except division, modulus, right shift, and comparisons.  Hence there has been a lack of sensitivity in the C community to the differences between signed and unsigned arithmetic (see §3.2.1.1). 

The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:

The restriction to ``binary numeration systems''  rules out such curiosities as Gray code, and makes possible arithmetic definitions of the bitwise operators on unsigned types (see §3.3.3.3, §3.3.7, §3.3.10, §3.3.11, §3.3.12). 

A new floating type long double has been added to C.  The long double type must offer at least as much precision as the type double Several architectures support more than two floating types and thus can map a distinct machine type onto this additional C type.  Several architectures which only support two floating point types can also take advantage of the three C types by mapping the less precise type onto float and double, and designating the more precise type long double Architectures in which this mapping might be desirable include those in which single-precision floats offer at least as much precision as most other machines's double-precision, or those on which single-precision is considerably more efficient than double-precision.  Thus the common C floating types would map onto an efficient implementation type, but the more precise type would still be available to those programmers who require its use. 

To avoid confusion, long float as a synonym for double has been retired. 

Enumerations permit the declaration of named constants in a more convenient and structured fashion than #define's.  Both enumeration constants and variables behave like integer types for the sake of type checking, however. 

The Committee considered several alternatives for enumeration types in C:

  1. leave them out;
  2. include them as definitions of integer constants;
  3. include them in the weakly typed form of the UNIX C compiler;
  4. include them with strong typing, as, for example, in Pascal
The Committee adopted the second alternative on the grounds that this approach most clearly reflects common practice.  Doing away with enumerations altogether would invalidate a fair amount of existing code; stronger typing than integer creates problems, for instance, with arrays indexed by enumerations. 

3.1.2.6  Compatible type and composite type

The notions of compatible types and composite type have been introduced to discuss those situations in which type declarations need not be identical. These terms are especially useful in explaining the relationship between an incomplete type and a complete type. 

Structure, union, or enumeration type declarations in two different translation units do not formally declare the same type, even if the text of these declarations come from the same include file, since the translation units are themselves disjoint.  The Standard thus specifies additional compatibility rules for such types, so that if two such declarations are sufficiently similar they are compatible.

3.1.3  Constants

In folding and converting constants, an implementation must use at least as much precision as is provided by the target environment.  However, it is not required to use exactly the same precision as the target, since this would require a cross compiler to simulate target arithmetic at translation time. 

The Committee considered the introduction of structure constants.  Although it agreed that structure literals would occasionally be useful, its policy has been not to invent new features unless a strong need exists.  Since the language already allows for initialized const structure objects, the need for inline anonymous structured constants seems less than pressing. 

Several implementation difficulties beset structure constants.  All other forms of constants are ``self typing'' --- the type of the constant is evident from its lexical structure.  Structure constants would require either an explicit type mark, or typing by context; either approach is considered to require increased complexity in the design of the translator, and either approach would also require as much, if not more, care on the part of the programmer as using an initialized structure object. 

3.1.3.1  Floating constants

Consistent with existing practice, a floating point constant has been defined to have type double Since the Standard now allows expressions that contain only float operands to be performed in float arithmetic (see §3.2.1.5 rather than double, a method of expressing explicit float constants is desirable.  The new long double type raises similar issues. 

Thus the F and L suffixes have been added to convey type information with floating constants, much like the L suffix for long integers.  The default type of floating constants remains double, for compatibility with prior practice.  Lower case f and l are also allowed as suffixes. 

Note that the run-time selection of the decimal point character by setlocale (§4.4.1) has no effect on the syntax of C source text: the decimal point character is always period. 

3.1.3.2  Integer constants

The rule that the default type of a decimal integer constant is either int, long, or unsigned long, depending on which type is large enough to hold the value without overflow, simplifies the use of constants. 

The suffixes U and u have been added to specify unsigned numbers. 

Unlike decimal constants, octal and hexadecimal constants too large to be ints are typed as unsigned int (if within range of that type), since it is more likely that they represent bit patterns or masks, which are generally best treated as unsigned, rather than ``real'' numbers. 

Little support was expressed for the old practice of permitting the digits 8 and 9 in an octal constant, so it has been dropped. 

A proposal to add binary constants was rejected due to lack of precedent and insufficient utility. 

Despite a concern that a lower-case L could be taken for the numeral one at the end of an integral (or floating) literal, the Committee rejected proposals to remove this usage, primarily on the grounds of sanctioning existing practice. 

The rules given for typing integer constants were carefully worked out in accordance with the Committee's deliberations on integral promotion rules (see §3.2.1.1). 

3.1.3.3  Enumeration constants

Whereas an enumeration variable may have any integer type that correctly represents all its values when widened to int, an enumeration constant is only usable as the value of an expression.  Hence its type is simply int (See §3.1.2.5.) 

3.1.3.4  Character constants

The digits 8 and 9 are no longer permitted in octal escape sequences.  (Cf. octal constants, §3.1.3.2.) 

The alert escape sequence has been added (see §2.2.2). 

Hexadecimal escape sequences, beginning with \x, have been adopted, with precedent in several existing implementations.  (Little sentiment was garnered for providing \X as well.)  The escape sequence extends to the first non-hex-digit character, thus providing the capability of expressing any character constant no matter how large the type char is.  String concatenation can be used to specify a hex-digit character following a hexadecimal escape sequence:

        char a[] = "\xff" "f" ;
        char b[] = {'\xff', 'f', '\0'};
These two initializations give a and b the same string value. 

The Committee has chosen to reserve all lower case letters not currently used for future escape sequences (undefined behavior).  All other characters with no current meaning are left to the implementor for extensions (implementation-defined behavior).  No portable meaning is assigned to multi-character constants or ones containing other than the mandated source character set (implementation-defined behavior). 

The Committee considered proposals to add the character constant '\e' to represent the ASCII ESC ('\033') character.  This proposal was based upon the use of ESC as the initial character of most control sequences in common terminal driving disciplines, such as ANSI X3.64 However, this usage has no obvious counterpart in other popular character codes, such as EBCDIC A programmer merely wishing to avoid having to type \033 to represent the ESC character in an ASCII/X3.64 environment, may, instead of writing

        printf("\033[10;10h%d\n", somevalue);
write:
        #define  ESC  "\033"

        printf( ESC "[10;10h%d\n", somevalue);

Notwithstanding the general rule that literal constants are non-negative  [Footnote: -3 is an expression: unary minus with operand 3.]  , a character constant containing one character is effectively preceded with a (char) cast and hence may yield a negative value if plain char is represented the same as signed char This simply reflects widespread past practice and was deemed too dangerous to change. 

An L prefix distinguishes wide character constants.  (See §2.2.1.2.) 

3.1.4  String literals

String literals are specified to be unmodifiable.  This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and perform certain optimizations.  However, string literals do not have the type array of const char, in order to avoid the problems of pointer type checking, particularly with library functions, since assigning a pointer to const char to a plain pointer to char is not valid.  Those members of the Committee who insisted that string literals should be modifiable were content to have this practice designated a common extension (see F.5.5). 

Existing code which modifies string literals can be made strictly conforming by replacing the string literal with an initialized static character array.  For instance,

        char *p, *make_temp(char *str);
            /* ... */
        p = make_temp("tempXXX");
                /* make_temp overwrites the literal */
                /* with a unique name */
can be changed to:
        char *p, *make_temp(char *str);
            /* ... */
        {
            static char template[ ] = "tempXXX";
            p = make_temp( template );
        }

A long string can be continued across multiple lines by using the backslash-newline line continuation, but this practice requires that the continuation of the string start in the first position of the next line.  To permit more flexible layout, and to solve some preprocessing problems (see §3.8.3), the Committee introduced string literal concatenation Two string literals in a row are pasted together (with no null character in the middle)  to make one combined string literal.  This addition to the C language allows a programmer to extend a string literal beyond the end of a physical line without having to use the backslash-newline mechanism and thereby destroying the indentation scheme of the program.  An explicit concatenation operator was not introduced because the concatenation is a lexical construct rather than a run-time operation. 

\smallskip without concatenation:

        /* say the column is this wide */
                alpha = "abcdefghijklm\
        nopqrstuvwxyz" ;

with concatenation:

        /* say the column is this wide */
                alpha = "abcdefghijklm"
                        "nopqrstuvwxyz";

An L prefix distinguishes wide string literals.  A prefix (as opposed to suffix) notation was adopted so that a translator can know at the start of the processing of a long string literal whether it is dealing with ordinary or wide characters.  (See §2.2.1.2.) 

3.1.5  Operators

Assignment operators of the form =+, described as old fashioned even in K&R, have been dropped. 

The form += is now defined to be a single token, not two, so no white space is permitted within it; no compelling case could be made for permitting such white space. 

The operator # has been added in preprocessing statements: within a #define it causes the macro argument following to be converted to a string literal. 

The operator ## has also been added in preprocessing statements: within a #define it causes the tokens on either side to be pasted to make a single new token.  See §3.8.3 for further discussion of these preprocessing operators. 

3.1.6  Punctuators

The punctuator ... (ellipsis) has been added to denote a variable number of trailing arguments in a function prototype.  (See §3.5.4.3.) 

The constraint that certain punctuators must occur in pairs (and the similar constraint on certain operators in §3.1.5 only applies after preprocessing. Syntactic constraints are checked during syntactic analysis, and this follows preprocessing. 

3.1.7  Header names

Header names in #include directives obey distinct tokenization rules; hence they are identified as distinct tokens. Attempting to treat quote-enclosed header names as string literals creates a contorted description of preprocessing, and the problems of treating angle-bracket-enclosed header names as a sequence of C tokens is even more severe. 

3.1.8  Preprocessing numbers

The notion of preprocessing numbers has been introduced to simplify the description of preprocessing.  It provides a means of talking about the tokenization of strings that look like numbers, or initial substrings of numbers, prior to their semantic interpretation.  In the interests of keeping the description simple, occasional spurious forms are scanned as preprocessing numbers --- 0x123E+1 is a single token under the rules. The Committee felt that it was better to tolerate such anomalies than burden the preprocessor with a more exact, and exacting, lexical specification.  It felt that this anomaly was no worse than the principle under which the characters a+++++b are tokenized as a ++ ++ + b (an invalid expression), even though the tokenization a ++ + ++ b would yield a syntactically correct expression. In both cases, exercise of reasonable precaution in coding style avoids surprises. 

3.1.9  Comments

The Committee considered proposals to allow comments to nest.  The main argument for nesting comments is that it would allow programmers to ``comment out'' code.  The Committee rejected this proposal on the grounds that comments should be used for adding documentation to a program, and that preferable mechanisms already exist for source code exclusion.  For example,

        #if 0
        /* this code is bracketed out because ... */
        code_to_be_excluded();
        #endif
Preprocessing directives such as this prevent the enclosed code from being scanned by later translation phases.  Bracketed material can include comments and other, nested, regions of bracketed code. 

Another way of accomplishing these goals is with an if statement:

        if (0) {
            /* this code is bracketed out because ... */
            code_to_be_excluded();
        }
Many modern compilers will generate no code for this if statement. 
2 Environment  -< ANSI C Rationale  -> 3.2 Conversions  -> 4 Library         Index