3 Language Index

2 Environment

Because C has seen widespread use as a cross-compiled language, a clear distinction must be made between translation and execution environments. The preprocessor, for instance, is permitted to evaluate the expression in a #if statement using the long integer arithmetic native to the translation environment: these integers must comprise at least 32 bits, but need not match the number of bits in the execution environment. Other translate-time arithmetic, however, such as type casting and floating arithmetic, must more closely model the execution environment regardless of translation environment.

2.1 Conceptual models

The as if principle is invoked repeatedly in this Rationale. The Committee has found that describing various aspects of the C language, library, and environment in terms of concrete models best serves discussion and presentation. Every attempt has been made to craft the models so that implementors are constrained only insofar as they must bring about the same result, as if they had implemented the presentation model; often enough the clearest model would make for the worst implementation.

2.1.1 Translation environment

2.1.1.1 Program structure

The terms source file, external linkage, linked, libraries, and executable program all imply a conventional compiler-linker combination. All of these concepts have shaped the semantics of C, however, and are inescapable even in an interpreted environment. Thus, while implementations are not required to support separate compilation and linking with libraries, in some ways they must behave as if they do.

2.1.1.2 Translation phases

Perhaps the greatest undesirable diversity among existing C implementations can be found in preprocessing. Admittedly a distinct and primitive language superimposed upon C, the preprocessing commands accreted over time, with little central direction, and with even less precision in their documentation. This evolution has resulted in a variety of local features, each with its ardent adherents: the Base Document offers little clear basis for choosing one over the other.

The consensus of the Committee is that preprocessing should be simple and overt, that it should sacrifice power for clarity. For instance, the macro invocation f(a, b) should assuredly have two actual arguments, even if b expands to c, d; and the formal definition of f must call for exactly two arguments. Above all, the preprocessing sub-language should be specified precisely enough to minimize or eliminate dialect formation.

To clarify the nature of preprocessing, the translation from source text to tokens is spelled out as a number of separate phases. The separate phases need not actually be present in the translator, but the net effect must be as if they were. The phases need not be performed in a separate preprocessor, although the definition certainly permits this common practice. Since the preprocessor need not know anything about the specific properties of the target, a machine-independent implementation is permissible.

The Committee deemed that it was outside the scope of its mandate to require the output of the preprocessing phases be available as a separate translator output file.

The phases of translation are spelled out to resolve the numerous questions raised about the precedence of different parses. Can a #define begin a comment? (No.) Is backslash/new-line permitted within a trigraph? (No.) Must a comment be contained within one #include file? (Yes.) And so on. The Rationale section on preprocessing (§3.8) discusses the reasons for many of the particular decisions which shaped the specification of the phases of translation.

A backslash immediately before a new-line has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the Committee generalized this mechanism to permit any token to be continued by interposing a backslash/new-line sequence.

2.1.1.3 Diagnostics

By mandating some form of diagnostic message for any program containing a syntax error or constraint violation, the Standard performs two important services. First, it gives teeth to the concept of erroneous program, since a conforming implementation must distinguish such a program from a valid one. Second, it severely constrains the nature of extensions permissible to a conforming implementation.

The Standard says nothing about the nature of the diagnostic message, which could simply be ``syntax error'', with no hint of where the error occurs. (An implementation must, of course, describe what translator output constitutes a diagnostic message, so that the user can recognize it as such.) The Committee ultimately decided that any diagnostic activity beyond this level is an issue of quality of implementation, and that market forces would encourage more useful diagnostics. Nevertheless, the Committee felt that at least some significant class of errors must be diagnosed, and the class specified should be recognizable by all translators.

The Standard does not forbid extensions, but such extensions must not invalidate strictly conforming programs. The translator must diagnose the use of such extensions, or allow them to be disabled as discussed in (Rationale) §1.7. Otherwise, extensions to a conforming C implementation lie in such realms as defining semantics for syntax to which no semantics is ascribed by the Standard, or giving meaning to undefined behavior.

2.1.2 Execution environments

The definition of program startup in the Standard is designed to permit initialization of static storage by executable code, as well as by data translated into the program image.

2.1.2.1 Freestanding environment

As little as possible is said about freestanding environments, since little is served by constraining them.

2.1.2.2 Hosted environment

The properties required of a hosted environment are spelled out in a fair amount of detail in order to give programmers a reasonable chance of writing programs which are portable among such environments.

The behavior of the arguments to main, and of the interaction of exit, main and atexit (see §4.10.4.2) has been codified to curb some unwanted variety in the representation of argv strings, and in the meaning of values returned by main.

The specification of argc and argv as arguments to main recognizes extensive prior practice. argv[argc] is required to be a null pointer to provide a redundant check for the end of the list, also on the basis of common practice.

main is the only function that may portably be declared either with zero or two arguments. (The number of arguments must ordinarily match exactly between invocation and definition.) This special case simply recognizes the widespread practice of leaving off the arguments to main when the program does not access the program argument strings. While many implementations support more than two arguments to main, such practice is neither blessed nor forbidden by the Standard; a program that defines main with three arguments is not strictly conforming. (See Standard Appendix F.5.1.)

Command line I/O redirection is not mandated by the Standard; this was deemed to be a feature of the underlying operating system rather than the C language.

2.1.2.3 Program execution

Because C expressions can contain side effects, issues of sequencing are important in expression evaluation. (See §3.3.) Most operators impose no sequencing requirements, but a few operators impose sequence points upon the evaluation: comma, logical-AND, logical-OR, and conditional. For example, in the expression (i = 1, a[i] = 0) the side effect (alteration to storage) specified by i = 1 must be completed before the expression a[i] = 0 is evaluated.

Other sequence points are imposed by statement execution and completion of evaluation of a full expression. (See §3.6). Thus in fn(++a), the incrementation of a must be completed before fn is called. In i = 1; a[i] = 0; the side-effect of i = 1 must be complete before a[i] = 0 is evaluated.

The notion of agreement has to do with the relationship between the abstract machine defining the semantics and an actual implementation. An agreement point for some object or class of objects is a sequence point at which the value of the object(s) in the real implementation must agree with the value prescribed by the abstract semantics.

For example, compilers that hold variables in registers can sometimes drastically reduce execution times. In a loop like

        sum = 0;
        for (i = 0; i < N; ++i)
            sum += a[i];

both sum and i might be profitably kept in registers during the execution of the loop. Thus, the actual memory objects designated by sum and i would not change state during the loop.

Such behavior is, of course, too loose for hardware-oriented applications such as device drivers and memory-mapped I/O. The following loop looks almost identical to the previous example, but the specification of volatile ensures that each assignment to *ttyport takes place in the same sequence, and with the same values, as the (hypothetical) abstract machine would have done.

        volatile short *ttyport;
        /* ... */
        for (i = 0; i < N; ++i)
            *ttyport = a[i];

Another common optimization is to pre-compute common subexpressions. In this loop:

        volatile short *ttyport;
        short mask1, mask2;
        /* ... */
        for (i = 0; i < N; ++i)
            *ttyport = a[i] & mask1 & mask2;

evaluation of the subexpression mask1 & mask2 could be performed prior to the loop in the real implementation, assuming that neither mask1 nor mask2 appear as an operand of the address-of (&) operator anywhere in the function. In the abstract machine, of course, this subexpression is re-evaluated at each loop iteration, but the real implementation is not required to mimic this repetitiveness, because the variables mask1 and mask2 are not volatile and the same results are obtained either way.

The previous example shows that a subexpression can be pre-computed in the real implementation. A question sometimes asked regarding optimization is, ``Is the rearrangement still conforming if the pre-computed expression might raise a signal (such as division by zero)?'' Fortunately for optimizers, the answer is ``Yes,'' because any evaluation that raises a computational signal has fallen into an undefined behavior (§3.3), for which any action is allowable.

Behavior is described in terms of an abstract machine to underscore, once again, that the Standard mandates results as if certain mechanisms are used, without requiring those actual mechanisms in the implementation. The Standard specifies agreement points at which the value of an object or class of objects in an implementation must agree with the value ascribed by the abstract semantics.

Appendix B to the Standard lists the sequence points specified in the body of the Standard.

The class of interactive devices is intended to include at least asynchronous terminals, or paired display screens and keyboards. An implementation may extend the definition to include other input and output devices, or even network inter-program connections, provided they obey the Standard's characterization of interactivity.

2.2 Environmental considerations

2.2.1 Character sets

The Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed --- thus the common Japanese practice of using the glyph ¥ for the C character \ is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.

For this reason, and economy of description, source code is described as if it undergoes the same translation as text that is input by the standard library I/O routines: each line is terminated by some new-line character, regardless of its external representation.

2.2.1.1 Trigraph sequences

Trigraph sequences have been introduced as alternate spellings of some characters to allow the implementation of C in character sets which do not provide a sufficient number of non-alphabetic graphics.

Implementations are required to support these alternate spellings, even if the character set in use is ASCII, in order to allow transportation of code from systems which must use the trigraphs.

The Committee faced a serious problem in trying to define a character set for C. Not all of the character sets in general use have the right number of characters, nor do they support the graphical symbols that C users expect to see. For instance, many character sets for languages other than English resemble ASCII except that codes used for graphic characters in ASCII are instead used for extra alphabetic characters or diacritical marks. C relies upon a richer set of graphic characters than most other programming languages, so the representation of programs in character sets other than ASCII is a greater problem than for most other programming languages.

The International Standards Organization (ISO) uses three technical terms to describe character sets: repertoire, collating sequence, and codeset. The repertoire is the set of distinct printable characters. The term abstracts the notion of printable character from any particular representation; the glyphs R, {calligraphy R}, R, R, {slanted R}, {sans-serif R}, and {Re} all represent the same element of the repertoire, upper-case-R, which is distinct from lower-case-r. Having decided on the repertoire to be used (C needs a repertoire of 96 characters), one can then pick a collating sequence which corresponds to the internal representation in a computer. The repertoire and collating sequence together form the codeset.

What is needed for C is to determine the necessary repertoire, ignore the collating sequence altogether (it is of no importance to the language), and then find ways of expressing the repertoire in a way that should give no problems with currently popular codesets.

C derived its repertoire from the ASCII codeset. Unfortunately the ASCII repertoire is not a subset of all other commonly used character sets, and widespread practice in Europe is not to implement all of ASCII either, but use some parts of its collating sequence for special national characters.

The solution is an internationally agreed-upon repertoire, in terms of which an international representation of C can be defined. The ISO has defined such a standard: ISO 646 describes an invariant subset of ASCII.

The characters in the ASCII repertoire used by C and absent from the ISO 646 repertoire are:

        # [ ] { } \ | ~ ^

Given this repertoire, the Committee faced the problem of defining representations for the absent characters. The obvious idea of defining two-character escape sequences fails because C uses all the characters which are in the ISO 646 repertoire: no single escape character is available. The best that can be done is to use a trigraph --- an escape digraph followed by a distinguishing character.

?? was selected as the escape digraph because it is not used anywhere else in C (except as noted below); it suggests that something unusual is going on. The third character was chosen with an eye to graphical similarity to the character being represented.

The sequence ?? cannot currently occur anywhere in a legal C program except in strings, character constants, comments, or header names. The character escape sequence '\?' (see §3.1.3.4) was introduced to allow two adjacent question-marks in such contexts to be represented as ?\?, a form distinct from the escape digraph.

The Committee makes no claims that a program written using trigraphs looks attractive. As a matter of style, it may be wise to surround trigraphs with white space, so that they stand out better in program text. Some users may wish to define preprocessing macros for some or all of the trigraph sequences.

QUIET CHANGE

??!

2.2.1.2 Multibyte characters

The ``byte = character'' orientation of C works well for text in Western alphabets, where the size of the character set is under 256. The fit is rather uncomfortable for languages such as Japanese and Chinese, where the repertoire of ideograms numbers in the thousands or tens of thousands.

Internally, such character sets can be represented as numeric codes, and it is merely necessary to choose the appropriate integral type to hold any such character.

Externally, whether in the files manipulated by a program, or in the text of the source files themselves, a conversion between these large codes and the various byte media is necessary.

The support in C of large character sets is based on these principles:

Multibyte encodings of large character sets are necessary in I/O operations, in source text comments, and in source text string and character literals.
No existing multibyte encoding is mandated in preference to any other; no widespread existing encoding should be precluded.
The null character ('\0') may not be used as part of a multibyte encoding, except for the one-byte null character itself. This allows existing functions which manipulate strings transparently to work with multibyte sequences.
Shift encodings (which interpret byte sequences in part on the basis of some state information) must start out in a known (default) shift state under certain circumstances, such as the start of string literals.
The minimum number of absolutely necessary library functions is introduced. (See §4.10.7.)

2.2.2 Character display semantics

The Standard defines a number of internal character codes for specifying ``format effecting actions on display devices,'' and provides printable escape sequences for each of them. These character codes are clearly modelled after ASCII control codes, and the mnemonic letters used to specify their escape sequences reflect this heritage. Nevertheless, they are internal codes for specifying the format of a display in an environment-independent manner; they must be written to a text file to effect formatting on a display device. The Standard states quite clearly that the external representation of a text file (or data stream) may well differ from the internal form, both in character codes and number of characters needed to represent a single internal code.

The distinction between internal and external codes most needs emphasis with respect to new-line. ANSI X3L2 (Codes and Character Sets) uses the term to refer to an external code used for information interchange whose display semantics specify a move to the next line. Both ANSI X3L2 and ISO 646 deprecate the combination of the motion to the next line with a motion to the initial position on the line. The C Standard, on the other hand, uses new-line to designate the end-of-line internal code represented by the escape sequence '\n'. While this ambiguity is perhaps unfortunate, use of the term in the latter sense is nearly universal within the C community. But the knowledge that this internal code has numerous external representations, depending upon operating system and medium, is equally widespread.

The alert sequence ('\a') has been added by popular demand, to replace, for instance, the ASCII BEL code explicitly coded as '\007'.

Proposals to add '\e' for ASCII ESC ('\033') were not adopted because other popular character sets such as EBCDIC have no obvious equivalent. (See §3.1.3.4.)

The vertical tab sequence ('\v') was added since many existing implementations support it, and since it is convenient to have a designation within the language for all the defined white space characters.

The semantics of the motion control escape sequences carefully avoid the Western language assumptions that printing advances left-to-right and top-to-bottom.

To avoid the issue of whether an implementation conforms if it cannot properly effect vertical tabs (for instance), the Standard emphasizes that the semantics merely describe intent.

2.2.3 Signals and interrupts

Signals are difficult to specify in a system-independent way. The Committee concluded that about the only thing a strictly conforming program can do in a signal handler is to assign a value to a volatile static variable which can be written uninterruptedly and promptly return. (The header <signal.h> specifies a type sig_atomic_t which can be so written.) It is further guaranteed that a signal handler will not corrupt the automatic storage of an instantiation of any executing function, even if that function is called within the signal handler.

No such guarantees can be extended to library functions, with the explicit exceptions of longjmp (§4.6.2.1) and signal (§4.7.1.1), since the library functions may be arbitrarily interrelated and since some of them have profound effect on the environment.

Calls to longjmp are problematic, despite the assurances of §4.6.2.1. The signal could have occurred during the execution of some library function which was in the process of updating external state and/or static variables.

A second signal for the same handler could occur before the first is processed, and the Standard makes no guarantees as to what happens to the second signal.

2.2.4 Environmental limits

The Committee agreed that the Standard must say something about certain capacities and limitations, but just how to enforce these treaty points was the topic of considerable debate.

2.2.4.1 Translation limits

The Standard requires that an implementation be able to translate and compile some program that meets each of the stated limits. This criterion was felt to give a useful latitude to the implementor in meeting these limits. While a deficient implementation could probably contrive a program that meets this requirement, yet still succeed in being useless, the Committee felt that such ingenuity would probably require more work than making something useful. The sense of the Committee is that implementors should not construe the translation limits as the values of hard-wired parameters, but rather as a set of criteria by which an implementation will be judged.

Some of the limits chosen represent interesting compromises. The goal was to allow reasonably large portable programs to be written, without placing excessive burdens on reasonably small implementations.

The minimum maximum limit of 257 cases in a switch statement allows coding of lexical routines which can branch on any character (one of at least 256 values) or on the value EOF.

2.2.4.2 Numerical limits

In addition to the discussion below, see §4.1.4.

2.2.4.2.1 Sizes of integral types `<limits.h>`

Such a large body of C code has been developed for 8-bit byte machines that the integer sizes in such environments must be considered normative. The prescribed limits are minima: an implementation on a machine with 9-bit bytes can be conforming, as can an implementation that defines int to be the same width as long. The negative limits have been chosen to accommodate ones-complement or sign-magnitude implementations, as well as the more usual twos-complement. The limits for the maxima and minima of unsigned types are specified as unsigned constants (e.g., 65535u) to avoid surprising widenings of expressions involving these extrema.

The macro CHAR_BIT makes available the number of bits in a char object. The Committee saw little utility in adding such macros for other data types.

The names associated with the short int types (SHRT_MIN, etc., rather than SHORT_MIN, etc.) reflect prior art rather than obsessive abbreviation on the Committee's part.

2.2.4.2.2 Characteristics of floating types `<float.h>`

The characterization of floating point follows, with minor changes, that of the FORTRAN standardization committee (X3J3). [Footnote: See X3J3 working document S8-112.] The Committee chose to follow the FORTRAN model in some part out of a concern for FORTRAN-to-C translation, and in large part out of deference to the FORTRAN committee's greater experience with fine points of floating point usage. Note that the floating point model adopted permits all common representations, including sign-magnitude and twos-complement, but precludes a logarithmic implementation.

Single precision (32-bit) floating point is considered adequate to support a conforming C implementation. Thus the minimum maxima constraining floating types are extremely permissive.

The Committee has also endeavored to accommodate the IEEE 754 floating point standard by not adopting any constraints on floating point which are contrary to this standard.

The term FLT_MANT_DIG stands for ``float mantissa digits.'' The Standard now uses the more precise term significand rather than mantissa.

1 Introduction

ANSI C Rationale