After quitting my job in February I now started thinking about all the fun I'll
have interviewing for a new one. It's not all bad of course, but I'm not
particularly looking forward to writing yet another anagram
detector or string reversal function. But this did make me wonder: how does
string reversal work with Unicode? For example, Java's String
is made up of chars
.
The char
data type is 2 bytes. But not all visible characters fit in 2 bytes:
- Anything outside of the Basic Multilingual Plane (BMP)
- Combining characters
char
.
What does all this mean? Reversing a String is not like reversing an array.
Take for example the word
Áfficion𝐚ḍ̇o. This is not exactly the
original spelling; but
let's see how it looks in terms of Unicode code units
(chars
) grouped by Unicode code point (individual "building blocks"
of the word):
A | ́ | ffi | c | i | o | n | 𝐚 | ḋ | ̣ | o |
0x0041 |
0x0301 |
0xFB03 |
0x0063 |
0x0069 |
0x006F |
0x006E |
0xD835 0xDC1A |
0x1E0B |
0x0323 |
0x006F |
Initially you might have thought that
"Áfficion𝐚ḍ̇o".length()
would have returned 11
. But it doesn't; it actually returns
12
. This is because String#length()
returns the number
of chars
, not the number of visible characters. Things to note:
- Two visible characters (Á and ḍ̇) have combining characters, thus needing 2
chars
each - ffi is a ligature; it has 3 visible characters, yet only needs 1
char
- 𝐚 is outside of the BMP, so needs 2
chars
2 + 1 + 1 + 1 + 1 + 1 + 2 + 2 + 1 = 12
.
So how about reversing our example? Here are the options I've identified:
-
Reverse the
char
array. This is what most people do when they're asked to write a string reversal function during an interview. But for all the reasons I mentioned above this approach has severe limitations. Though it has to be said that if you're not reversing a Unicode string, this is probably the most efficient way to do it. But that's not the goal at the moment. -
Use
StringBuilder#reverse
which reverses thechar
array with support for code points outside of the BMP (identifying surrogate pairs, so occurances of more than onechar
to represent a unicode code point). This does not reverse combining characters correctly. -
Normalize, then use
StringBuilder#reverse
. Use theNormalizer
on your String to get rid of as many combining characters as possible before reversing. This helps but does not work for all character combinations. Also this has the effect of modifying your data (e.g. you can't reverse the reverse to get back the original). -
Use
BreakIterator
. This is the best way I have found to reverse a Unicode String and also the only way to correctly reverse our example. Note that there is still a limitation with ligatures; those letters will obviously not get reversed. If this is needed you'll probably want to use the Normalizer after all, pehaps on a subset of your data.
Below you can find an example of using BreakIterator
. This
correctly reverses our example (except the ligature) and is a round-trip
function (meaning you can reverse the reverse and get the original input back).
public static String reverseUnicode(String source, Locale locale) { BreakIterator boundary = BreakIterator.getCharacterInstance(locale); boundary.setText(source); char[] reversedChars = new char[source.length()]; int end = boundary.last(); for (int start = boundary.previous(), index = 0; start != BreakIterator.DONE; end = start, start = boundary.previous()) { for(int i = start; i < end; i++) { reversedChars[index] = source.charAt(i); index++; } } return String.valueOf(reversedChars); }
I can honestly not remember the last time I needed to reverse a
String
, but there you go.