Removing Characters From a String

Programming Interviews Exposed asks us to remove characters from an input string, looking for the characters in a second string. I am unhappy with solution they provide, particularly because of the reasons they use to invalidate what I believe to the better solution. You can check out my solution to the problem on my github repository.

In this problem we are given the following C# signature:

string removeChars( string str, string remove );

‘str’ is the original string we are removing characters from. ‘remove’ is the string that contains that characters that we need to remove from ‘str’. For example if str=”hello world” and remove=” wo” the result should be “hellwrld” with all spaces, w’s, and o’s removed.

Synchronized

My first issue, is they recommend not using the StringBuffer for the wrong reason. The reason you do not want to use StringBuffer, is because it’s synchronized in both Java and C#. Below is the Java source code from grepcode:

    public synchronized StringBuffer append(char c) {
        super.append(c);
        return this;
    }

The Microsoft documentation on StringBuffer says,

The methods are synchronized where necessary so that all the operations on any particular instance behave as if they occur in some serial order that is consistent with the order of the method calls made by each of the individual threads involved."

Since the buffer does not exceed the scope of our method, and we know that only one thread is modifying the buffer, we don’t need the expensive overhead of protecting against concurrent access. We can go with the cheaper solution of using a StringBuilder (in both Java and C#).

Extra Memory

Here the authors recommend building the string in the original character array and using two pointers to the array. One for the next non-deleted character and a second for where the next character is coming from. Ideally this is an excellent solution, if implementing correctly. I have annotated their solution with the issues.

    // str has N characters
    // remove has M characters
    string removeChars( string str, string remove ){
      //NOTE: We use N extra memory because .toCharArray() copies the
      //array in string because strings are immutable
      char[] s = str.toCharArray();
      //Note: Again, we use M extra memory because .toCharArray copies
      char[] r = remove.toCharArray();
      bool[] flags = new bool[128]; // assumes ASCII!
      int len = s.Length;
      int src, dst;
      // Set flags for characters to be removed
      for( src = 0; src < len; ++src ){
        flags[r[src]] = true;
      }
      src = 0;
      dst = 0;
      // Now loop through all the characters,
      // copying only if they aren’t flagged
      while( src < len ){
        if( !flags[ (int)s[src] ] ){
          s[dst++] = s[src];
        }
        ++src;
      }
      // NOTE: Again we use N extra memory because the string constrcutor
      // copies the characters of s, into the new string
      return new string( s, 0, dst );
    }

Because Strings are immutable in Java and C#, toCharArray() in Java and toCharArray() in C# both have to allocate a new character array to prevent the caller from modifying the internal character array. Secondly you cannot force the Java String(char[], int, int) or the C# string(char[],int,int) [String->StringNative->StringObject] methods to use your char[] array. It has to copy it because the caller has a reference to the original character array and potentially modify it.

Overall, we used 2N + M memory, when the best possible solution in Java or C# uses 2N extra memory. We need N for the buffer and N for returning the result as a String. So we might as well have kept a StringBuilder anyways as it tremendously simplifies the code and uses potentially less extra memory because potentially a majority of the characters in the original string might be removed. I think this is an excellent example of what Knuth meant in “premature optimization is the root of all evil” because in this solution we didn’t make it more efficient and we made it more complex.

If you really wanted to use no extra memory, this should have been implemented in C/C++ with the following signature:

    void removeChars( char * str, char * rem );

Keeping the same code as above, but modifying str instead, allows us to incur no extra memory by modifying the contents of char*str in place.

Summary

Given the solution provided in the book and the limitations of immutable Strings in Java and C#, we might as well have used a StringBuilder, incurring potentially less excess memory and a simpler solution. StringBuffer should not be used as it’s synchronized and we’re not using Threads. Finally, we were really wanted to achieve the author’s intended solution, we need to use C/C++ and change the method signature.