A Brief History of Strings

现在的位置: 首页 > 综合 > 正文

A Brief History of Strings

2012年08月05日 ⁄ 综合 ⁄ 共 16043字 ⁄ 字号小中大 ⁄ 评论关闭

A Brief History of Strings

Delphi strings in depth

Strings are a bit of a mystery to most Delphi developers. The fact that new features are added in almost every new release probably does not help the situation. For example, few developers know the difference between string, ShortStrings, long strings, AnsiStrings, WideStrings, openstrings etc..

ShortString

We should start at the beginning. During the days of Delphi's direct ancestor Turbo Pascal, there was only one kind of string: the ShortString type. Though this is how we call it these days, in Turbo Pascal it was the default string type, and was simply referred to as string. ShortStrings "live" on the stack (unless "manually" allocated), and not on the heap (click here for a discussion of heaps). As such, from a memory allocation point of view, the ShortString is no different from other statically-allocated types like integers, booleans, records and enumerated types. Since it is statically-allocated, it is obvious that a ShortString's size cannot change at run-time. So operations like insertion, concatenation, etc. pose a problem. To circumvent this, Turbo Pascal (and Delphi) preallocate a maximum size of 256 bytes for every string instance. Since the first element (astring[0]) is used to hold the actual length of the string, a ShortString may have a maximum of 255 characters, which, conveniently, is also the maximum number of characters the length byte can index. For example, a variable declared as:

        var s: shortstring;

actually consumes 256 bytes on the stack. ShortString variables, whether as local variables, global variables or within objects or records consume 256 bytes, whatever length the actual "payload" string happens to be. To better understand the structure of a ShortString, we present a "semantic equivalent", which is how a user might declare a ShortString if it were not built into the language. A ShortString declared as above is semantically equivalent to:

        var s: array [0..255] of char;

And the ShortString operations:

        1) s := 'abc';
        2) s := s + t;

are semantically equivalent to:

        1) s[1] := 'a';
           s[2] := 'b';
           s[3] := 'c';
           s[0] := #3;

        2) Move( s[1], s[ord(s[0])+1], ord(t[0]) );
           s[0] := chr( ord(s[0]) + ord(t[0]) );

In fact you could actually access s[0] in code and experiment with it. A ShortString's memory layout is shown below, where n is the number of payload characters stored in the string:

For a string of default length, that is, declared simply as string, max size is always 255. A ShortString is never allowed to grow to more than 255 characters. This limitation is imposed by the size of the count field (1 unsigned byte), and also by the (im)practicality of having strings waste large amounts of unused space. There is a way to address this wastage problem somewhat. Turbo Pascal and Delphi both allow a string to specify its maximum size when it is declared. For example:

        var s: string[20];  // maximum 20 characters

declares a string that can be at most, 20 characters long. It still reserves a length byte, so 21 bytes are actually consumed. But doing this presented new problems. Because of its strong type-checking, TP didn't allow a string type declared with a certain length to be passed as a parameter to a function expecting a different string length:

        // 'strict var strings' turned on
type
        	String20 = string[20];
        	String30 = string[30];

procedure Foo( var aString: String20 );
        	...
var s: String30;
        Foo(s); // error: incompatible types

A more technically correct explanation of why this is a problem is that, in TP, strings were treated just like any other parameter; that is, they were placed on the stack. A function expecting a string of a certain length then reserved an appropriate amount of space on the stack. Thus, passing strings of a different length can pose problems. To solve this, the openstring type was introduced. It was a kind of "generic" string type that could accept a string declared to be of any length:

        type
        	String20 = string[20];
        	String30 = string[30];

procedure Bar( var aString: openstring )
        	...

var s: String20;
            t: String30;

        Bar(s); // OK
        Bar(t); // OK

Note that this only applies to variable (var) parameters. For value parameters, a copy of the string parameter was required to be on the stack, and the compiler always allocates a maximum-length string to hold the parameter. There was also the "strict var strings" setting, which when turned off, relaxed string type checking. In Delphi, any string declaration that specifies a maximum length is considered a ShortString. Also, the maximum length specifier may not exceed 255:

        var s: string[256]; // error

PChar

With all of the concessions given to ShortString, it still didn't address the upper limit of 255 characters, which is a very serious limitation for real-world applications. In contrast, the C language, which the Windows API is "friendly" to, had the character pointer (char *). It allowed strings of any length, limited only by how much memory the native pointer type can access, and of course, by available memory. However, it required the user to explicitly manage allocation/deallocation of all strings, spawned subtle bugs, and was inefficient with many operations, since any function for taking the length had to traverse the entire string for every call.

Delphi 1 introduced PChar, which was equivalent to C's char pointer. In 16-bit Delphi 1, it had a limit of 65,535 characters. This limit was imposed by 16-bit windows' memory segmentation. As in C, you also had to explicitly manage PChar allocation/ deallocation (surprisingly, the default string type was still ShortString, limited to 255 characters). Unlike ShortString, PChars are actually pointers, and all that entails (see section on assignment semantics). Many developers also misunderstood them. A common error is:

        var s: PChar;
      	 	...
      	 s := 'Hello'; // danger: pointer to static data; no memory allocated
      	 s := s + ' World'; // error: memory corrupted; may be undetected

In this example no memory was allocated, and s was simply loaded with the address of the location of the string 'Hello', in whatever place the compiler chose to put it; usually with disastrous results. The important thing to keep in mind is that PChar is not a drop-in replacement for the default string type. Unlike a ShortString, a PChar variable is only 4 bytes in size (sizeof(PChar) = 4). It is the array of characters being pointed to that actually hold the string data.

Also, ShortString character indexing starts at 1 (because of the length byte), while PChars are zero-based; there is no "length byte". This convention is copied directly from C. Usually, memory must be allocated for PChars explicitly, using StrAlloc/StrDispose or GetMem/FreeMem. In place of explicit allocation, an array of characters can also serve as a buffer for PChars:

        var s: PChar;
            arr: array [0..20] of char; // space for 20 chars, plus null
        	...
        s := @arr;

But this can lead to subtle errors:

        // this function is FLAWED; do not use

function Combine( s1, s2: PChar ): PChar;
(* returns concatenated strings, leaves sources intact *)
var buf: array [0..1024] of char;
begin
             Result := @buf;
             StrCopy( Result, s1 );
             StrCat( Result, s2 );
end; // error: will return random stack data

Depending on what happens to the region of memory that was occupied by the function's stack, this function might even appear to work, while occasionally returning random data. While it is dangerous to use static allocation, on the other hand, with explicit (dynamic) allocation, users might forget to release allocated memory. Where provided, the null-terminated string routines must always be used when working with PChars. The names of these functions generally start with the 'Str' prefix (eg: StrCopy, StrCat, StrLen), and work like their C-library namesakes.

Again, we present semantic equivalents. For a PChar and associated operations:

        1) var p: PChar;
        2) p := StrAlloc( 4 );
        3) StrCopy( p, 'abc' );
        4) StrCat( p, t );
        5) StrDispose( p );

Have the semantic equivalents:

        1) var p: ^char;
        2) GetMem( p, 4 );

        3) p^[0] := 'a';
           p^[1] := 'b';
           p^[2] := 'c';
           p^[3] := #0;

        4) { somehow find index of first zero char from start of p, store in temporary variable _p }
           { somehow find index of first zero char from start of t, store in temporary variable _t }
           Move( p^[_p], t^[0], _t );
           p^[_p + _t] := #0;

        5) FreeMem( p );

Here we see that a PChar variable is actually a pointer, and that Delphi simply lets us get away with saying p[1] instead of p^[1]. Note the {somehow find index of first zero char} operations in item 4. These illustrate serious performance penalties that a program incurs for each call of StrLen(). Since there is no simple way to "look up" a PChar's length, the entire string must be stepped through to find the first zero character, whose index would be equivalent to the string's length.

In memory, a PChar would be laid out as (again, n is the number of payload characters):

Typically the array part would be allocated on the heap, using some sort of dynamic allocation function (StrAlloc, GetMem), but it may also be an array on the stack. In fact Delphi lets us get away with:

        var
        	arr: array [0..20] of char;
        	p: PChar;
        ...
        	p := arr;

without even using the address-of operator or having to typecast. In some situations Delphi will even allow a char array to stand-in for a PChar; emulating, no doubt, the syntactical permissiveness of C.

Long String (AnsiString)

To provide a "low-maintenance", efficient and fast string type that could accommodate large numbers of characters, 32-bit Delphi 2 introduced the long string. It is also called AnsiString, a reference to the fact that it held only ANSI 1-byte characters. It was also automatically memory-managed, which meant that it could be used much like a ShortString, without the need for explicit allocation/deallocation. AnsiString also used a 32-bit value as a length field, which, due to Win32's flat address space, allowed strings to grow to a staggering 2 Gigabytes long, subject to memory limitations of course. AnsiString's advantages do not stop there. Like C's strings, it is also null-terminated, which meant that Delphi automatically appends a zero byte at the end (actually 2 zero bytes), which meant that it can be passed to functions requiring a PChar using a simple cast. Furthermore, it is reference-counted, with copy-on-write semantics (more on this later). Like Java's objects, each AnsiString, aside from the length field, contains a reference count. This allows Delphi to manage the string's lifetime, deleting it when it is no longer needed. Thus, long strings have the best of both worlds: The convenience of short strings and the efficiency and storage capacity of PChars, without the bad PChar aftertaste. Like ShortString, AnsiString character indexing begins at 1. They may also be freely copied, modified and returned from functions, without any danger of returning corrupt data or leaking memory:

        var s: AnsiString;
        	...
        s := 'Hello';
        s := s + 'World'; // OK, memory automatically managed.

In current versions of Delphi, the string type is simply an alias for AnsiString, although this can be changed with compiler directives, and in fact string may default to WideString in future versions.

In memory, an AnsiString's layout is (again, n = number of payload characters):

An AnsiString variable is actually a pointer to the first character of the string in memory. It is important to emphasize that the variable points to the first character of the string, and not to the first element of the entire structure. This may seem strange, but does not pose any problems as long as the runtime can locate the structure in memory (it can simply use negative offsets to "step back" to the first elements). An empty string (s := '';) is actually a nil pointer value:

s := '';

This is why attempts to access the contents of an empty string raise an access violation. Since they are null-terminated, casting to PChar is very efficient. Delphi can simply pass the variable itself in place of a PChar (although Delphi performs a few additional checks before doing so).

To better understand long strings, we again present some operations and their semantic equivalents. An AnsiString declaration such as:

        var s: AnsiString;

Can be simulated by:

        type
            TAnsiStrRec = packed record
                RefCount: longint;
                Length: longint;
end;

            TAnsiStrChars = array [1..MaxInt-8] of char;
            TAnsiString = ^TAnsiStrChars;

var s: TAnsiString;

And the following operations:

        1) s := '';
        2) s := 'abc';
        3) s := s + t;

can be simulated by:

        1) s := nil;

        2) GetMem( s, sizeof(TAnsiStrHeader) + 4 );
           PAnsiStrHeader(integer(s)-8)^.RefCount := 1;
           PAnsiStrHeader(integer(s)-8)^.Length := 3;
           s[1] := 'a';
           s[2] := 'b';
           s[3] := 'c';
           s[4] := #0;


        3) // allocate
           GetMem( t,
                   sizeof(TAnsiStrHeader) +
                   PAnsiStrHeader(integer(u)-8)^.Length +
                   PAnsiStrHeader(integer(v)-8)^.Length + 1 );

// copy
           Move( u^, t^[1], PAnsiStrHeader(integer(u)-8)^.Length );
           Move( v^,
                 t^[PAnsiStrHeader(integer(u)-8)^.Length + 1],
                 PAnsiStrHeader(integer(v)-8)^.Length );

// header
           PAnsiStrHeader(integer(t)-8)^.RefCount := 1;
           PAnsiStrHeader(integer(u)-8)^.Length :=
                   PAnsiStrHeader(integer(u)-8)^.Length +
                   PAnsiStrHeader(integer(v)-8)^.Length;

In fact, the Delphi runtime does something very similar to this "behind the scenes". The code is found in System.pas. In the above examples, to keep things simple, we have deliberately chosen instances where we would have to manipulate the reference count. Reference-counting is one of the most important new features of AnsiString, one that differentiates it from the older types. Reference-counting and copy-on-write semantics are discussed later.

(Note: The simulation of line number 2 above has been simplified; in actual fact, Delphi assigns a reference count of -1, which is a value reserved for constant strings. We can make the example more accurate by assigning a string built at runtime, ie: "StringOfChar('A', 5)" instead of "'abc'", but that could distract from the illustration. This clarification is here for those very astute readers who couldn't let this pass and just simply have to point this out )

WideString

WideString is the newest addition to the Delphi string family. It was created to address the increasing adoption of Unicode on many platforms, and it handles Unicode characters exclusively. Windows 2000, Windows XP, Java and the new .NET architecture all have been written from the ground up to work with Unicode, and in fact the newer versions of the Windows API actually convert ANSI-style strings to Unicode before working with them internally. Thus, we can reasonably expect that in the future, Unicode characters and WideStrings would replace all the other character and string types.

Semantically, WideString behaves exactly like AnsiString; it is automatically-managed, null-terminated and reference-counted. The only difference is that each character of a WideString is a WideChar, a new character type that holds Unicode characters. WideString is also used for working with OLE types. Unicode is one reason why you should not assume that sizeof(char)=1 in your code. Always use the sizeof() operator when doing size arithmetic: when, in the future, you compile your code with a newer version of Delphi where sizeof(char)=2, then your code can adapt accordingly.

Assignment Semantics

These string types differ in one more crucial aspect: They have different assignment semantics. Understanding these differences is crucial to using them effectively. Assignment semantics deals with the underlying behavior of the (seemingly simple) assignment construct. This is a topic very seldom given thought by developers, and this lack of understanding frequently leads to inefficient and buggy code. Let us begin with ShortString. As mentioned before, ShortStrings are statically allocated, and have assignment semantics identical to integers, booleans and records. With ShortStrings, an assignment always copies the entire contents of one string variable to another:

        var s, t: ShortString;
        	...
        s := 'Foo';
        t := s;     // entire contents copied
        s[1] := 'B' // t still contains 'Foo'

For a PChar, which is a pointer to an (often dynamically allocated) array characters, assignment semantics are identical to that of other pointers. An assignment only copies the contents of the pointer, not the array of characters. Thus, after a direct assignment, two PChars point to the same array:

        var s, t: PChar;
        	...
        s := 'Foo';
        t := s;        // pointer copied
        s[0] := 'B';   // both contain 'Boo'

Most of the time, we actually want the contents of the string, and not the result we get with the code above. This is what the Str* functions are for. For copying PChar contents, the StrCopy function is provided:

        var s, t: array [0..50] of char;
        	...
        StrCopy( s, 'Foo' );
        StrCopy( t, s );  // contents copied
        s[0] := 'B';      // t still contains 'Foo'

One advantage of PChar over ShortString is the fact that for instances when strings are strictly read-only, then pointer copying can be used to boost performance, since only four bytes need ever be copied for each string. The disadvantage is that the programmers are tasked with remembering which strings read-only and streamlining their code accordingly.

Long strings like AnsiString and WideString, on the other hand, have very different semantics. Since like PChar, they are implicitly pointers, they are very quickly copied:

        var s, t: AnsiString;
        	...
        s := 'Foo';
        t := s;  // only pointers copied; reference count incremented

But as the comment notes, aside from copying the buffer address, the AnsiString also has its reference count incremented. This count keeps track of how many read-only "clients" an AnsiString has. Then, when one client needs to modify the buffer, a copy of the buffer is silently made, the reference count of the original buffer is decremented, and the change finally applied to the new buffer:

        var s, t: AnsiString;
        	...
        s := 'Foo';
        t := s;      // only pointers copied; reference count incremented
        s[1] := 'B'; // buffer refcount > 1: copy made, original string's refcount decremented

The figures below show a (simplified) view of how long string reference-counting works.

s := 'Foo';

    New string buffer allocated.
    Reference Count set to 1.
    Length set to 3.
    Variable s pointed to address of buffer.

t := s;

Variable s copied to t (pointer copy).
Buffer Reference Count incremented.

s[1] := 'B';

     Is Reference Count > 1?
     Yes, new string buffer allocated.
     New buffer's reference count set to 1.
     Old buffer Reference Count decremented.
     Variable t pointed to new buffer.
     Edit performed on new buffer.

This is what is meant by 'reference-counting with copy-on-write semantics'. From this, long strings gain the advantages of both ShortString (convenience, no manual allocation, intuitive copying ) and PChar (very fast copying, efficient, capacious storage).

C treats you like a consenting adult. Pascal treats you like a naughty child. Ada treats you like a criminal.

--Bruce Powel Douglass

ems ATSIGN codexterity PERIOD com