TopCoder problem "Lexer" used in TCI '02 Round 1 A (Division I Level One)

Problem Statement

    A lexer (lexical analyzer) is used in compilers to break input text into pieces called tokens. In this problem you will be given a list of valid tokens. For example:

tokens = {"ab","aba","A"}

Given a list of valid tokens and an input string your lexer will work as follows:

1) a) If the input doesn't begin with one of the valid tokens, remove the first character from the string.

b) If the input does begin with a valid token, determine the longest valid token the input starts with and remove it from the string. The removed portion is considered CONSUMED.

2) Repeat 1 until there are no characters left in the input.

The lexer is CASE-SENSITIVE so a token must exactly match the beginning of the string.

Given a list of valid tokens and an input string your method will return a list containing the CONSUMED valid tokens in the order they were CONSUMED. For example:

tokens = {"ab","aba","A"}

input = "ababbbaAab"

"ab" and "aba" are found at the beginning of the input but "aba" is longest so it is consumed. Now:

consumed = {"aba"}

input = "bbbaAab"

There are no tokens that start with 'b' so the lexer will remove the first 3 characters from the string.

consumed = {"aba"}

input = "aAab"

The 'a' doesn't match the token "A" due to CASE-SENSITIVITY. The lexer removes the 'a' from the beginning of the string.

consumed = {"aba"}

input = "Aab"

The lexer consumes the "A" token.

consumed = {"aba","A"}

input = "ab"

Finally the lexer consumes the "ab" token and completes the process.

consumed = {"aba","A","ab"}

input = ""

The returned list is {"aba","A","ab"}.

Create a class Lexer that contains the method tokenize, which takes a String[] tokens, and a String input, and returns a String[] in the form specified above.


Parameters:String[], String
Method signature:String[] tokenize(String[] tokens, String input)
(be sure your method is public)


-tokens will contain between 0 and 50 elements inclusive
-Each element of tokens will have length between 1 and 50 inclusive
-Each element of tokens will only consist of letters (A-Z,a-z)
-input will have length between 0 and 50 inclusive
-input will only consist of letters (A-Z,a-z)


Returns: { "aba",
  "ab" }
Same as above
Returns: { "aaaaa",
  "aaaaa" }
Make sure to use the longest valid starting token
Returns: { "wo",
  "wo" }
Returns: { "int",
  "boolean" }
Returns: { }
No valid tokens, so nothing is CONSUMED
Returns: { "dEfG",
  "GhIj" }

Problem url:

Problem stats url:




alexcchan , lbackstrom

Problem categories:

String Parsing