TopCoder problem "Lexer" used in TCI '02 Round 1 A (Division I Level One)

Problem Statement

A lexer (lexical analyzer) is used in compilers to break input text into pieces called tokens. In this problem you will be given a list of valid tokens. For example:

tokens = {"ab","aba","A"}

Given a list of valid tokens and an input string your lexer will work as follows:

1) a) If the input doesn't begin with one of the valid tokens, remove the first character from the string.

b) If the input does begin with a valid token, determine the longest valid token the input starts with and remove it from the string. The removed portion is considered CONSUMED.

2) Repeat 1 until there are no characters left in the input.

The lexer is CASE-SENSITIVE so a token must exactly match the beginning of the string.

Given a list of valid tokens and an input string your method will return a list containing the CONSUMED valid tokens in the order they were CONSUMED. For example:

tokens = {"ab","aba","A"}

input = "ababbbaAab"

"ab" and "aba" are found at the beginning of the input but "aba" is longest so it is consumed. Now:

consumed = {"aba"}

input = "bbbaAab"

There are no tokens that start with 'b' so the lexer will remove the first 3 characters from the string.

consumed = {"aba"}

input = "aAab"

The 'a' doesn't match the token "A" due to CASE-SENSITIVITY. The lexer removes the 'a' from the beginning of the string.

consumed = {"aba"}

input = "Aab"

The lexer consumes the "A" token.

consumed = {"aba","A"}

input = "ab"

Finally the lexer consumes the "ab" token and completes the process.

consumed = {"aba","A","ab"}

input = ""

The returned list is {"aba","A","ab"}.

Create a class Lexer that contains the method tokenize, which takes a String[] tokens, and a String input, and returns a String[] in the form specified above.

Definition

Class:	Lexer
Method:	tokenize
Parameters:	String[], String
Returns:	String[]
Method signature:	String[] tokenize(String[] tokens, String input)
(be sure your method is public)

Constraints

tokens will contain between 0 and 50 elements inclusive

Each element of tokens will have length between 1 and 50 inclusive

Each element of tokens will only consist of letters (A-Z,a-z)

input will have length between 0 and 50 inclusive

input will only consist of letters (A-Z,a-z)

Examples

{"ab","aba","A"}

"ababbbaAab"

Returns: { "aba",
  "A",
  "ab" }

Same as above

{"a","a","aa","aaa","aaaa","aaaaa","aa"}

"aaaaaaaaaaaaaaaaaaaaaaaaa"

Returns: { "aaaaa",
  "aaaaa",
  "aaaaa",
  "aaaaa",
  "aaaaa" }

Make sure to use the longest valid starting token

{"wow","wo","w"}

"awofwwofowwowowowwwooo"

Returns: { "wo",
  "w",
  "wo",
  "w",
  "wow",
  "wow",
  "w",
  "wo" }

{"int","double","long","char","boolean","byte","float"}

"intlongdoublecharintintboolean"

Returns: { "int",
  "long",
  "double",
  "char",
  "int",
  "int",
  "boolean" }

{}

"Great"

Returns: { }

No valid tokens, so nothing is CONSUMED

{"AbCd","dEfG","GhIj"}

"abCdEfGhIjAbCdEfGhIj"

Returns: { "dEfG",
  "AbCd",
  "GhIj" }

Remember CASE-SENSITIVITY

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=1047

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=4324&pm=1047

Writer:

brett1479

Testers:

alexcchan , lbackstrom

Problem categories:

String Parsing