TopCoder problem "SpamDetector" used in SRM 205 (Division I Level One , Division II Level Two)

Problem Statement


You are writing part of a spam detection system. Your job is to analyze the subject lines of e-mail messages and return a count of known spam signalling keywords in the subject lines. Your task is made more difficult by the spammers who try to hide the keywords in several ways. Here we will consider just one obfuscation technique: duplicating characters. Duplicating characters means taking an existing character in a word and inserting more copies of that character into the same place in the word. This process can then be repeated on a different character in the word. The spam signalling keyword "credit" might be modified to "creddiT", "CredittT" or "ccrreeeddiitt", etc., but not "credict".

For the purposes of this problem we will consider subject lines which contain only letters and spaces. The "words" in the subject line are delimited by spaces. A word in the subject line is considered a "match" if the entire word is the same as at least one entire keyword, after possibly removing some duplicated characters from the subject word. A keyword that matches only part of a subject word or a subject word that matches only part of a keyword does not count. Note that if a keyword contains a double letter, the subject word must also contain (at least) a double letter in the same position to match ("double letter" means two consecutive letters in the word that are the same). For this application, all matches (and the use of the term "same") are case insensitive.

Given a subject line and a list of keywords, return the count of words in the subject line which "match" words in the keyword list. If multiple words in the subject line match the same keyword, they are each counted, but a word in the subject line that matches multiple keywords is only counted once.



Parameters:String, String[]
Method signature:int countKeywords(String subjectLine, String[] keywords)
(be sure your method is public)


-subjectLine will contain between 0 and 50 characters, inclusive.
-subjectLine will include only letter ('a' to 'z' and 'A' to 'Z') and space (' ') characters.
-keywords will have between 0 and 50 elements, inclusive.
-each element of keywords will contain between 1 and 50 characters, inclusive.
-each element of keywords will consist of only letters ('a' to 'z' and 'A' to 'Z').
-The same letter (ignoring case) never appears more than twice consecutively in any element of keywords. (ie. "aabbAAbb" is ok, but "aaAbb" is not allowed.)


"LoooW INTEREST RATES available dont BE slow"
Returns: 4
"INTEREST" , "RATES" , "available", and "LoooW" match. Note that "slow" does not match, even though it contains the substring "low" which is a keyword.
"Dear Richard Get Rich Quick            no risk"
Returns: 2
Don't match "Richard"
"in debbtt againn and aAgain and AGAaiIN"
Returns: 3
"PlAyy ThEE Lottto     get Loottoo feever"
Returns: 3
"                                   "
Returns: 0

Problem url:

Problem stats url:




PabloGilberto , lbackstrom , brett1479

Problem categories:

String Manipulation