TopCoder problem "LanguageRecognition" used in SRM 250 (Division I Level One , Division II Level Two)



Problem Statement

    

For computers it can be hard to determine in which language a given text is written. A simple way to try to determine the language is the following: for the given text and for some sample texts, for which we know the languages, we determine the letter frequencies and compare these.

The frequency of a letter is the total number of occurrences of that letter divided by the total number of letters in the text. To determine this, we ignore case and non-letter characters.

Once the letter frequencies of the text and of a language are known, we can calculate the difference between the two. This difference we define by the sum of the squared differences of the frequencies:

The lesser this value, the closer text resembles that language. Compare text with each element of languages and return the (0-based) index of the language that has the smallest difference with text. In case of a tie, return the smallest index.

 

Definition

    
Class:LanguageRecognition
Method:whichLanguage
Parameters:String[], String
Returns:int
Method signature:int whichLanguage(String[] languages, String text)
(be sure your method is public)
    
 

Constraints

-languages contains between 1 and 50 elements, inclusive.
-Each element of languages has length between 1 and 50, inclusive.
-text has length between 1 and 50, inclusive.
-Each element of languages and text consists only of characters with ASCII value between 32 and 127, inclusive.
-Each element of languages and text contains at least one letter ('A'-'Z' and 'a'-'z').
 

Examples

0)
    
{"This is an English sentence.",
 "Dieser ist ein Deutscher Satz.",
 "C'est une phrase Francaise.",
 "Dit is een Nederlandse zin."
}
"In welke taal is deze zin geschreven?"
Returns: 3
The differences are 0.0385, 0.0377, 0.0430 and 0.0276, so the sentence is written in language 3, Dutch. Note that Dutch is somewhat similar to German, somewhat less similar to English and not similar to French.
1)
    
{"aaaaa","bbbb","ccc","dd","e"}
"xxx"
Returns: 0
In case of a tie, return the language with the smallest index.
2)
    
{"AABB","AaBb","A? B!","ab!@#$%"}
"ab"
Returns: 0
Ignore case and the non-letter characters.

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=4570

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=7225&pm=4570

Writer:

Jan_Kuipers

Testers:

PabloGilberto , lbackstrom , brett1479

Problem categories:

Brute Force