5. Strings: high-performance strings

The generic package GNATCOLL.Strings_Impl (and its default instantiation in GNATCOLL.Strings) provides a high-performance strings implementation.

It comes in addition to Ada’s own String and Unbounded_String types, although it attempts to find a middle ground in between (flexibility vs performance).

GNATCOLL.Strings therefore provides strings (named XString, as in extended-strings) that can grow as needed (up to Natural’Last, like standard strings), yet are faster than unbounded strings. They also come with an extended API, which includes all primitive operations from unbounded strings, in addition to some subprograms inspired from GNATCOLL.Utils and the python and C++ programming languages.

5.1. Small string optimization

GNATCOLL.Strings uses a number of tricks to improve on the efficiency. The most important one is to limit the number of memory allocations. For this, we use a trick similar to what all C++ implementations do nowadays, namely the small string optimization.

The idea is that when a string is short, we can avoid all memory allocations altogether, while still keeping the string type itself small. We therefore use an Unchecked_Union, where a string can be viewed in two ways:

Small string

  [f][s][ characters of the string 23 bytes               ]
     f  = 1 bit for a flag, set to 0 for a small string
     s  = 7 bits for the size of the string (i.e. number of significant
          characters in the array)

Big string

  [f][c      ][size     ][data      ][first     ][pad    ]
     f = 1 bit for a flag, set to 1 for a big string
     c = 31 bits for half the capacity. This is the size of the buffer
         pointed to by data, and which contains the actual characters of
         the string.
     size = 32 bits for the size of the string, i.e. the number of
         significant characters in the buffer.
     data = a pointer (32 or 64 bits depending on architecture)
     first = 32 bits, see the handling of substrings below
     pad = 32 bits on a 64 bits system, 0 otherwise.
         This is because of alignment issues.

So in the same amount of memory (24 bytes), we can either store a small string of 23 characters or less with no memory allocations, or a big string that requires allocation. In a typical application, most strings are smaller than 23 bytes, so we are saving very significant time here.

This representation has to work on both 32 bits systems and 64 bits systems, so we have careful representation clauses to take this into account. It also needs to work on both big-endian and little-endian systems. Thanks to Ada’s representation clauses, this one in fact relatively easy to achieve (well, okay, after trying a few different approaches to emulate what’s done in C++, and that did not work elegantly). In fact, emulating via bit-shift operations ended up with code that was less efficient than letting the compiler do it automatically because of our representation clauses.

5.2. Character types

Applications should be able to handle the whole set of Unicode characters. In Ada, these are represented as the Wide_Character type, rather than Character, and stored on 2 bytes rather than 1. Of course, for a lot of applications it would be wasting memory to always store 2 bytes per character, so we want to give flexibility to users here.

So the package GNATCOLL.Strings_Impl is a generic. It has several formal parameters, among which:

  • Character_Type is the type used to represent each character. Typically, it will be Character, Wide_Character, or even possibly Wide_Wide_Character. It could really be any scalar type, so for instance we could use this package to represent DNA with its 4-valued nucleobases.
  • Character_String is an array of these characters, as would be represented in Ada. It will typically be a String or a Wide_String. This type is used to make this package work with the rest of the Ada world.

Note about Unicode: we could also always use a Character, and use UTF-8 encoding internally. But this makes all operations (from taking the length to moving the next character) slower, and more fragile. We must make sure not to cut a string in the middle of a multi-byte sequence. Instead, we manipulate a string of code points (in terms of Unicode). A similar choice is made in Ada (String vs Wide_String), Python and C++.

5.3. Configuring the size of small strings

The above is what is done for most C++ implementations nowadays. The maximum 23 characters we mentioned for a small string depends in fact on several criteria, which impact the actual maximum size of a small string:

  • on 32 bits system, the size of the big string is 16 bytes, so the maximum size of a small string is 15 bytes.
  • on 64 bits system, the size of the big string is 24 bytes, so the maximum size of a small string is 23 bytes.
  • If using a Character as the character type, the above are the actual number of characters in the string. But if you are using a Wide_Character, this is double the maximum length of the string, so a small string is either 7 characters or 11 characters long.

This is often a reasonable number, and given that applications mostly use small strings, we are already saving a lot of allocations. However, in some cases we know that the typical length of strings in a particular context is different. For instance, GNATCOLL.Traces builds messages to output in the log file. Such messages will typically be at most 100 characters, although they can of course be much larger sometimes.

We have added one more formal parameter to GNATCOLL.Strings_Impl to control the maximum size of small strings. If for instance we decide that a “small” string is anywhere from 1 to 100 characters long (i.e. we do not want to allocate memory for those strings), it can be done via this parameter.

Of course, in such cases the size of the string itself becomes much larger. In this example it would be 101 bytes long, rather than the 24 bytes. Although we are saving on memory allocations, we are also spending more time copying data when the string is passed around, so you’ll need to measure the performance here.

The maximum size for the small string is 127 bytes however, because this size and the 1-bit flag need to fit in 1 bytes in the representation clauses we showed above. We tried to make this more configurable, but this makes things significantly more complex between little-endian and big-endian systems, and having large “small” strings would not make much sense in terms of performance anyway.

Typical C++ implementations do not make this small size configurable.

5.4. Task safety

Just like unbounded strings, the strings in this package are not thread safe. This means that you cannot access the same string (read or write) from two different threads without somehow protecting the access via a protected type, locks,…

In practice, sharing strings would rarely be done, so if the package itself was doing its own locking we would end up with very bad performance in all cases, for a few cases where it might prove useful.

As we’ll discuss below, it is possible to use two different strings that actually share the same internal buffer, from two different threads. Since this is an implementation detail, this package takes care of guaranteeing the integrity of the shared data in such a case.

5.5. Copy on write

There is one more formal parameter, to configure whether this package should use copy-on-write or not. When copy on write is enabled, you can have multiple strings that internally share the same buffer of characters. This means that assigning a string to another one becomes a reasonably fast operation (copy a pointer and increment a refcount). Whenever the string is modified, a copy of the buffer is done so that other copies of the same string are not impacted.

But in fact, there is one drawback with this scheme: we need reference counting to know when we can free the shared data, or when we need to make a copy of it. This reference counting must be thread safe, since users might be using two different strings from two different threads, but they share data internally.

Thus the reference counting is done via atomic operations, which have some impact on performance. Since multiple threads try to access the same memory addresses, this is also a source of contention in multi-threaded applications.

For this reason, the current C++ standard prevents the use of copy-on-write for strings.

In our case, we chose to make this configurable in the generic, so that users can decide whether to pay the cost of the atomic operations, but save on the number of memory allocations and copy of the characters. Sometimes it is better to share the data, sometimes to systematically copy it. Again, actual measurements of the performance are needed for your specific application.

5.6. Growth strategy

When the current size of the string becomes bigger than the available allocated memory (for instance because you are appending characters), this package needs to reallocate memory. There are plenty of strategies here, from allocating only the exact amount of memory needed (which saves on memory usage, but is very bad in terms of performance), to doubling the current size of the string until we have enough space, as currently done in the GNAT unbounded strings implementation.

The latter approach would therefore allocate space for two characters, then for 4, then 8 and so on.

This package has a slightly different strategy. Remember that we only start allocating memory past the size of small strings, so we will for instance first allocate 24 bytes. When more memory is needed, we multiply this size by 1.5, which some researchers have found to be a good comprise between waste of memory and number of allocations. For very large strings, we always allocate multiples of the memory page size (4096 bytes), since this is what the system will make available anyway. So we will basically allocate the following: 24, 36, 54, 82, 122,…

An additional constraint is that we only ever allocate even number of bytes. This is called the capacity of the string. In the layout of the big string, as shown above, we store half that capacity, which saves one bit that we use for the flag.

5.7. Substrings

One other optimization performed by this package (which is not done for unbounded strings or various C++ implementations) is to optimize substrings when also using copy-on-write.

We simply store the index of the first character of the string within the shared buffer, instead of always starting at the first.

From the user’s point of view, this is an implementation detail. Strings are always indexed from 1, and internally we convert to an actual position in the buffer. This means that if we need to reallocate the buffer, for instance when the string is modified, we transparently change the index of the first character, but the indexes the user was using are still valid.

This results in very significant savings, as shown below in the timings for Trim for instance. Also, we can do an operation like splitting a string very efficiently.

For instance, the following code doesn’t allocate any memory, beside setting the initial value of the string. It parses a file containing some “key=value” lines, with optional spaces, and possibly empty lines:

declare
   S, Key, Value : XString;
   L             : XString_Array (1 .. 2);
   Last          : Natural;
begin
   S.Set (".......");

   --  Get each line
   for Line in S.Split (ASCII.LF) loop

      --  Split into at most two substrings
      Line.Split ('=', Into => L, Last => Last);

      if Last = 2 then
         Key := L (1);
         Key.Trim;    --  Removing leading and trailing spaces

         Value := L (2);
         Value.Trim;

      end if;
   end loop;
end;

5.8. API

This package provides a very extensive set of API that apply to XString, please check the spec in gnatcoll-strings_impl.ads for a fully documented list.