Pay Less for Strings or How Strings Work

by Genady Beryozkin

Abstract: Many applications need to process lots of textual data, and extract some specific piece of textual information. Usually such applications use either simple textual formats or complex XML documents.
For either choice, Java offers many ways to deal with strings by providing the String and StringTokenizer classes. While these classes provide a handy interface, inadvertent use can cause serious memory leaks. This article explains how to parse text without running out of memory.

Consider yourself writing a simple phone book application. Each phone book record contains information such as phone and fax numbers, home address and some other data. Suppose you decided to use textual format for the database, where each record is stored on a separate line, as a colon delimited string.
Imagine a user who wants to sort the phone book by person's fax number, and the application loads fax numbers from all the records. Surprisingly the phone book application runs out of memory, as if it have loaded the entire database (which it was not designed for) and crashes. The user is justly unhappy and quickly switches to another phone book application.

While this scenario is simplistic and exaggerated, I have ran into similar situations in real applications. Such memory leaks are very hard to identify, since the code of the application is perfectly correct. The profiler will tell you that you use too much memory, and even tell you that most memory is used by strings, which is what you expect. But why do the strings take so much memory ? This is the question that this article deals with, and at the end, a solution is suggested.

A downloadable demo program demonstrates all approaches. At the end of this document you can find the instructions on how to use the demo.

However, let's begin at the beginning.

The naive approach

First let's check a naive approach that would have caused the problem. For simplicity, assume we retrieve the first field:

    private static String extractFirstField(String str) {
        int colonPos = str.indexOf(':');
        return str.substring(0, colonPos);
    }

    public String[] extractFirstFields(String[] lines) {
           Vector firstFields = new Vector(lines.length);
           for (int i=0 ; i < lines.length ; i++) {
                firstFields.add(extractFirstField(lines[i]);
           }
           return (String[])firstFields.toArray(new String[0]);
    }
The complete code listing appears in the references section.

Suppose our phone book has 10,000 entries and its total size is 10MB. The fax number is a 10 digits string, so you'd expect to have a memory footprint of about 100KB. What you get is a memory footprint of about 20MB.

"20MB ? What did I do wrong ?", you ask yourself. Indeed, 20MB are twice the size of the database file, and at first sight this may look quite odd. However, we recall that strings are built of characters, and characters in Java occupy two bytes, to handle non-Latin character sets. If our data is plain ASCII text, only one byte is written for each character in a string. When the file is read into a string again, Java classes convert the single byte into the appropriate Java character, which accounts for the difference between the file size and the memory required to hold the strings in the memory.

In the demo application, select the "Naive (indexOf+substring)" method, and observe as the memory goes low, until we get an java.lang.OutOfMemoryError exception after about 75% of the data has been loaded.

Now it appears that the all file contents were loaded, instead of just the first field. To find how it could happen, let's look inside the String class:

public final class String implements ......
{
    private char value[];

    private int offset;

    private int count;

    ......
    ......
    // more fields and methods follow
}

The actual data is stored in an array named value. Since strings are immutable, this array can be shared across String objects. This is exactly what the substring() method does - it shares its characters array with the child sub-string. The method calls a private constructor, and its arguments are the now shared characters buffer and first and the last indices of the substring. This process can repeat itself when we ask for a substring of the child string. In such case, the parent, its child and its grandchild strings, all share the same characters buffer.

As we see, substrings retain the reference to the original characters array. While it certainly saves memory when the original string is referenced by the application, it wastes memory if the original string is no longer needed. In such case, the String object itself can be collected, but its (potentially large) characters array cannot be discarded, since a substring still references it.

Using StringTokenizers


Another approach would be to use the StringTokenizer class:

    private static String extractField(String str, int fieldNum) {
        StringTokenizer t = new StringTokenizer(str, ":");
        for (int i = 0; i < fieldNum - 1; i++) {
            t.nextToken();
        }
        return t.nextToken();
    }


However, the results are still the same. Looking into the source code of the nextToken() method reveals that it invokes the substring() method on the string passed in the tokenizer's constructor. Therefore, the extracted field still references the characters array of the original string.

Using regular expressions


The most powerful (and on the other hand the simplest) way to retrieve fields from a delimited string is to use the split() method, introduced with JDK 1.4. Regular expressions are very popular in the field of text processing, and are one of the exiting features of the PERL language. The split() method is invoked on a String object and receives a regular expression that represents the delimiter string. It returns an array of Strings that are the fields of the original string:

    
    private static String extractFieldRegex(String str, int fieldNum) {
        return str.split(":")[fieldNum];
    }

Neat, isn't it? Yet, it does not solve our problem. The observed memory footprint is still high, which means that the split() method returns substrings that share the characters array too.

This method, as presented here, has one drawback: It operates on the entire string, processing information you didn't ask for. If a phone book record has 15 fields, and you only need the second field, the split(String regex) method may not be the most efficient way to get it. The solution lies in the fact that actually, there are two split() methods, and the second method was silently ignored, until now. The second method receives an additional argument that tells the regular expressions engine, how many many sub-strings the string will be split to. Specifying a number which is greater by two than the field number we want to retrieve, is what we need. Thus we reduce the processing time, as well as number of objects created (recall that each field is represented by a different String object)

The solution

It seems there is no way around. But just before we rush to file a bug report, we should take a look at a not so popular constructor of the String class. While String's constructors are not very popular in our code (we can use string literals or toString() methods), this one looks almost redundant. It receives one argument, which is another String. In C++ we would call it a copy constructor. In Java, where strings are immutable there seems to be no need in a copy constructor for strings. After all, what can I do with the new String that I couldn't have done before? Yet, this constructor will be the one to help us with our memory footprint problem.

Carefully reading the documentation and the source code reveals what this constructor really does. The constructor creates a new String with an unshared characters array which size is exactly the string length! Then it copies the contents of the old string into this new array. Such operation is called "deep copy", as opposed to "shallow copy" that would use the shared array.

If we apply this "copy constructor" on a substring and only use the result, nothing will prevent the parent String object from being collected along with its characters array. The following code is the solution to our memory shortage problem:

    private static String extractFieldRegex(String str, int fieldNum) {
        return str.split(":", fieldNum + 2)[fieldNum];
    }

    public String[] extractFields(String[] lines, int fieldNum) {
           Vector firstFields = new Vector(lines.length);
           for (int i=0 ; i < lines.length ; i++) {
                firstFields.add(new String(extractFieldRegex(lines[i], fieldNum));
           }
           return (String[])firstFields.toArray(new String[]{});
    }

Indeed, the memory consumption is down to about 10% of the amount of memory initially required.

Copying strings is not free

Copying strings is not free. There are three factors that should be considered when using String's copy constructor. First, there is the creation of another object on the heap, which the garbage collector must deal with.

Second, the characters array must be copied to the new array, an operation which complexity is linear in the string's length. The characters are copied using System.arraycopy method, which is implemented as a native method. Under some conditions, calling this method may pose a noticeable performance penalty compared to a simple for loop. However, in most cases this penalty is negligible and in most cases it is faster than copying the characters in a loop.

Last, the String object caches the hash code of the string, to speed up hash table lookups. When a string is copied, the cached hashcode is not copied, and will be recalculated next time the hashCode() method is called. However if the string has been just created, it does not have a hashcode yet, so it is not an issue in our examples.

The conclusion is that we should copy strings only when necessary, and follow the general rules of optimization: Optimize only when you have performance / memory consumption problems. Check the gain in speed and memory after each optimization step. Profile your application before optimizing it.

Internal Strings pool

First, a disclaimer: Everything written in this section is true for Sun's JDK 1.4.0. Other versions of Sun's JDK as well as other providers may implement this method in a totally different way. I did not have the resources to perform a comprehensive survey of all available JDKs.

Another method which is worth mentioning, is the intern() method. This method returns a copy of the string, drawn from a pool of unique strings. That is, two strings that were returned by the intern() method are the same object, and are equal when compared by the "==" operator. It is not guaranteed that strings returned by this method will use only as much memory as necessary, but it is true for Sun's JDKs. Another problem with this method, is that it is native. It means that we need to cross the JNI barrier, which imposes a performance penalty. Besides, the implementation of the intern() method (re)-calculates the string's hash code and performs a hash table lookups. While the complexity of calculating the hash code is linear in the string length, and the complexity of hash table lookup is almost constant, both are not free. In each case, one should carefully examine whether this method can provide an overall better performance or not.

Probably the most tricky thing about the intern() method, is that it hides the memory used by the internalized strings. If you modify the first, and the most inefficient version of the phone book program to internalize all strings, you will be surprised to discover that it no longer aborts with an OutOfMemoryError. If you turn on the verbosity option for the garbage collector (add -verbose:gc flag to the command line), you will see that the program consumes as little as few kilobytes of memory. It may be a nice trick to fool the boss, but the truth is that the internalized strings simply hide in the native heap which is not directly accessible by the garbage collector. The internalized strings are allocated in a native code, on the native heap, and the JVM uses "handles" to reference objects allocated on the native heap. The C language does not provide a way to find the amount of memory allocated for an object. This is why the garbage collector does not report the memory usage of such objects.

Just to be sure we didn't find the panacea for memory consumption problems, we look at the process list, and see that the java executable indeed consumes almost the same amount of memory.

Be careful with XML parsers !

Now you may be saying to yourself, "I don't need this 'primitive' method, my phone book application would use XML, the hottest buzzword in document processing". XML is a standard and highly customizable way to organize our data. The discussion on the benefits and drawbacks of the XML format is not in the scope of this article, where we focus on memory related problems.

XML, as high-level as it sounds, is is still implemented using low-level routines, and to be more specific, string manipulation routines. Therefore, we must not rely on the default implementation and make sure that no memory is wasted. In this article we shall focus on some problems specific to SAX parsers.

First, let's look at a sample XML record of our new phone book application:

    <record>
      <name first="George" middle="Walker" last="Bush" />
      <contact phone="++1-202-456-1414" fax="++1-202-456-2461" 
        email="president@whitehouse.gov" >
        <address>
          The White House
          1600 Pennsylvania Avenue NW
          Washington, DC 20500 
          USA
        </address>
      </contact>
    </record>

Suppose we parse the phone book records with a SAX2 parser. To accomplish that, we define the following content handler:

public class PhonebookHandler extends DefaultHandler {

    public void startElement(String namespaceURI, String localName,
        String qName, Attributes atts) throws SAXException 
    {
        // ...
    }

    public void characters(char[] ch, int start, int length)
        throws SAXException 
    {
        // ...
    }

    public void endElement(String namespaceURI, String localName,
        String qName) throws SAXException 
    {
        // ...
    }
}

In the next sections we examine the ways to handle the parsed data.

Strings as attribute values

First, consider the attributes passed to PhonebookHandler's startElement() method. Parsers usually intern element and attribute names, but not values. Consult your parser's documentation to find out how it handles names interning. For example both in Xerces 2.0.2 and Crimson, interning is controlled by the http://xml.org/sax/features/string-interning which is always set to true. The attribute value strings may be substrings of a string holding a larger chunk of the document text. Yet, most parsers will return strings that use only as much memory as needed, and are not substrings of some larger strings. It is best to check this issue if you feel you have a memory consumption problem.

Reading characters data

Characters data presents us a more serious problem. Characters data is the text between XML elements, like the address text in the example. Characters data is reported to the parser by the characters() callback method. But what happens when the characters data is large ? Will it be reported at once or in small pieces?

SAX parsers make no promise that all characters will be delivered by a single call to the characters() method. Furthermore, usually it is not the case. This requires the programmer to collect the characters, usually in a StringBuffer, and use it later, usually in the endElement() method, after all characters have been read.

A possible implementation may look as follows:

public class PhonebookHandlers extends DefaultHandler {

    StringBuffer buf;

    public void startElement(String namespaceURI, String localName,
        String qName, Attributes atts) throws SAXException 
    {
        if (/* this is an address element */) {
            buf = new StringBuffer();
        }
    }

    public void characters(char[] ch, int start, int length)
        throws SAXException 
    {
        if (buf != null) {
            buf.append(ch, start, length);
        }
    }

    public void endElement(String namespaceURI, String localName, 
        String qName) throws SAXException 
    {
        if (/* this is an address element */) {
            address = buf.toString();
            buf = null;
        }
    }
}

Let's examine the internals of the address string. The initial capacity of a StringBuffer is 16 characters. StringBuffer doubles the size of its storage array each time it runs out of space. It means that buf can reference an array which is twice as large as its actual data. When the storage is being resized the data is being copied into a new location (to save resizing costs, some programmers initialize the StringBuffer objects with large initial storage space). In order to retrieve the string represented by the StringBuffer we use the toString() method. This method creates a new String object which references StringBuffer's internal array (using copy-on-write technique). It indeed saves memory and copying time in many cases, but it means that even when the StringBuffer is collected, we still hold a reference to an array which can be up to two times larger than its actual string length.

There are two solutions to this problem. We have seen the first solution earlier, which is to call new String(buf.toString()) (notice that we pass a String, and not a StringBuffer). The second solution is to call the buf.substring(0) method. This method will create a new String of exact capacity. It is more efficient than the first solution, since it creates less temporary objects. On the other hand, it looks a little odd at first sight.

Conclusion

It is necessary to write text processing applications with memory consumption issues in mind. Java offers a great variety of methods to fit almost any need. In this article you should have learned the design of the strings classes and potential problems of their unwary use. You have also seen a possible solution.

Write your programs with performance in mind, but remember Donald Knuth saying: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.".

Installing and running the demo

Requirements: The demo program requires a 1.4 level JDK, since some methods described here make use of APIs which first appeared in JDK 1.4. If you don't have an 1.4 level JDK, you can download one from Sun's Java site. You will also need about 20MB of free disk space to generate the database for the demo.

First of all, download the jar file and save it on your local drive. The run the demo using a command line similar to what appears below:

java -jar -Xmx20m -Xms20m phonebook.jar

The "-jar" option tells the Java executable to run the main method of the class specified in the manifest file. The "-Xmx20m -Xms20m" options specify the exact size of the heap available to the demo program. In this example we limit the heap to 20MB to demonstrate the phenomena using a relatively small database file. If we would use the defaults (max heap size of 64MB) we would need much larger database file, which sort would require a considerable amount of time.

After the application loads, you need to create the example database. The example database consists of the list of all first names that appear on the www.behindthename.com site, followed by their origin (English, German, Biblical, etc). They are followed by random 10 digits phone and fax number, and by (almost) random email addresses. The "real-world" addresses are generated by picking random street and house numbers in one of the 100 world's largest cities.

To create the database select File -> Create DB... menu option. The generation process may take up to a minute, during which the program will not be responding.

To load the database select File -> Open... menu option.

Use the Options menu to select the field retrieval methods described here. After that you can press the "Sort by fax number" button and observe the memory consumption details in the bottom section of the main window. If the program runs out of memory, an appropriate message appears in the text area.
Notice that even when the application appears to be idle, objects are being created and collected. You may need to press the "Garbage Collection" button to force a garbage collection, or the figures in the messages area will display inaccurate figures. During the sort operation the application internally calls the garbage collector, so the figures displayed during the sort operation do not include temporary objects.

It is recommended to restart the program after testing each field retrieval method, because some (like "Interning") methods can affect latter tests of other methods.

Resources

About the author

Genady Beryozkin is now a graduate student in the Computer Science faculty in The Technion in Haifa, Israel. He is interested in computational linguistics, artificial intelligence and computer networks. Genady is an active Java developer since 1999 and is involved in a large software project written entirely in Java. Genady is also developing two plugins for the Eclipse platform on SourceForge.net

Please send your comments to the author. If you are emailing from outside the Technion, please append ".technion.ac.il" (Sorry, this is my anti-spam protection).