TopCoder problem "UTF8Decode" used in SRM 359 (Division II Level Three)



Problem Statement

    

Unicode is a standardized set of characters which includes the alphabets from many languages as well as mathematical and other symbols. UTF-8 is a scheme to encode Unicode characters into a byte stream. It has a number of nice features, most importantly backwards compatibility with ASCII, but the encoding is a little complicated. Each Unicode character is represented by between 1 and 4 bytes, inclusive.

For this problem, you will be given a sequence of byte values representing a UTF-8 encoded string, and must decode it to obtain the Unicode character values. However, the string is transmitted by the very unreliable RFC 1149 protocol, so it may contain some extra bytes. RFC 1149 is also very slow, so you must decide whether to accept or reject each byte as soon as it is received, without knowing what the following bytes are, or even if there are any. You should accept a byte if and only if the accepted bytes would then form the start of some UTF-8 encoded string. If you reach the end of the string and have accepted an incomplete character (e.g., only 2 bytes of a 3-byte character), it should also be discarded.

UTF-8 is used to represent Unicode characters between 0 and 1114111 (10FFFF in hex), inclusive. The following pseudo-code will convert a Unicode character into UTF-8 encoding (the output function appends one byte to the encoded string):

getbits(int value, int firstbit, int numbits):
  return (value SHIFTRIGHT firstbit) MOD (1 SHIFTLEFT numbits)

convert(int character):
  if character <= 0x7F:
    output(character)
  else if character <= 0x7FF:
    output(0xC0 + getbits(character, 6, 5))
    output(0x80 + getbits(character, 0, 6))
  else if character <= 0xFFFF:
    output(0xE0 + getbits(character, 12, 4))
    output(0x80 + getbits(character, 6, 6))
    output(0x80 + getbits(character, 0, 6))
  else if character <= 0x10FFFF:
    output(0xF0 + getbits(character, 18, 3))
    output(0x80 + getbits(character, 12, 6))
    output(0x80 + getbits(character, 6, 6))
    output(0x80 + getbits(character, 0, 6))
  else:
    abort - character is out of range

Write a class that accepts a sequence of bytes, in the order that they are received, and returns the sequence of Unicode character values decoded according to the rules above.

 

Definition

    
Class:UTF8Decode
Method:decodeString
Parameters:int[]
Returns:int[]
Method signature:int[] decodeString(int[] bytes)
(be sure your method is public)
    
 

Notes

-The UTF-8 representation of a character is never a substring of the UTF-8 representation of any other character.
-Different implementations of UTF-8 may allow characters to be encoded in more than one way or may allow characters beyond the range specified here. For this problem, you should consider a byte sequence to be legal if and only if it could have been produced by the pseudo-code above.
 

Constraints

-bytes will contain between 0 and 50 elements, inclusive.
-Each element of bytes wil be between 0 and 255, inclusive.
 

Examples

0)
    
{10,
 207, 168,
 226, 156, 144,
 240, 152, 154, 160}
Returns: {10, 1000, 10000, 100000 }
One case for each possible encoding length.
1)
    
{10, 255,
 207, 17, 168, 193,
 226, 156, 144,
 240, 143, 152, 154, 160,
 240, 152, 154}
Returns: {10, 1000, 10000, 100000 }
The same as the previous case, but with some invalid bytes added.
2)
    
{191, 192, 193, 245, 255, 128, 129, 130, 189, 190, 191, 1}
Returns: {1 }
None of these bytes except the last can start a UTF-8 encoded character.
3)
    
{76, 95, 228, 250, 1, 100, 167, 59, 165, 27, 87, 32, 49, 22, 100, 236, 1,
 111, 209, 100, 155, 35, 156, 47, 135, 56, 114, 131, 32, 73, 24, 104, 92,
 221, 107, 12, 222, 60, 24, 113, 130, 217, 180, 144, 106, 136, 137, 234, 24, 173}
Returns: 
{76, 95, 18917, 27, 87, 32, 49, 22, 100, 50908, 47, 56, 114, 32, 73, 24, 104, 92, 1858, 1652, 106 }
4)
    
{240, 1}
Returns: { }
240 is valid as the first byte of a four-byte character, so it must be accepted. Since we don't have enough bytes to complete this character, the output is empty.

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=7862

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=10770&pm=7862

Writer:

bmerry

Testers:

PabloGilberto , brett1479 , Olexiy , ivan_metelsky

Problem categories:

Encryption/Compression, Simple Search, Iteration