TopCoder problem "OneTimePad" used in MM 39 (Division I Level One)

Problem Statement

Introduction

A Vignere cipher is a simple encryption algorithm that works using a string of characters, the plaintext, and encodes them according the characters of a given encryption key. Each single character of plaintext is encoded by a single character of the key, using a method known as a Caesar Shift. That is, the plaintext character is shifted by the value of the key character (where A = 0 and Z = 25), wrapping between Z and A. Thus, 'D' encoded with 'C' becomes 'F', 'T' encoded with 'M' becomes 'F', and 'J' encoded with 'A' becomes 'J'. This is repeated for each character of the plaintext, resulting in an encrypted text of the same length.

This encryption technique has the benefit that it is simple enough to be performed using only pen and paper, and could theoretically be carried out by a field operative without access to electronic devices for assistance. A major weakness however, is that since all encoding is based upon a [typically fairly short length] repeated key, it is easy enough for someone to exploit patterns in the encrypted text in order to decode a message even without knowing the key. And indeed, even this can often be done with nothing more than pen and paper.

To remedy this issue, the encoding can be carried out in such a way that the length of the encryption key matches the length of the plaintext. This is known as a "one-time pad". Theoretically, this eliminates the risk of creating identifiable patterns when the key is repeated.

You have intercepted several encoded messages, which you suspect were encoded using a one-time pad as described above. However, you also suspect that your enemy was sloppy, and re-used the same key text for each separate message.

Implementation Details

You will be given several encrypted messages in String[] encoded. Your method, decrypt, should return a String containing the text of the one-time pad that has been used for encryption. You are also given access to an English word list, via the Words.getWords() method, should you wish to utilize that information as part of your solution.

Test Cases

Test cases are generated by first selecting a text to use as the source of the plaintext. Then, another source (different from the first source) is selected to use as source of the encoding key. Then, a text length between 200 and 10000 is chosen as LEN = 10000 - Sqrt(Rand * 96059601.0). (The length will tend slightly towards shorter values) Then, a number of source selections, N, is chosen between 2 and 8. Then, N source excerpts are extracted from random locations within the source text (which are not necessarily word boundaries). Finally, an excerpt of the second source to be used is extracted (location chosen at random) to be used as the key. Encoding of the N plaintext selections are all done using the same key.

Scoring

Your score for each test case will be the proportion of correct characters in your return. That is, correct_chars / total_length. If your return is the wrong length, you exceed the time limit, or otherwise generate an error, you score a 0 for that case. Your total score is the sum of your individual scores.

Tools

A simple tool for offline testing is available for download at http://www.topcoder.com/contest/problem/OneTimePad/vis.html.

Definition

Class:	OneTimePad
Method:	decrypt
Parameters:	String[]
Returns:	String
Method signature:	String decrypt(String[] encoded)
(be sure your method is public)

Notes

Several source texts are used for both the messages to be encoded as well as the encryption key.

Each of the encoded messages is a portion of the same source text, however the encryption key is selected from a different source.

The source texts are all generally well-known literary works, either originally in English, or their English translations.

The source texts are available here: http://www.topcoder.com/contest/problem/OneTimePad/docs.zip. They are formed by removing all non-letter characters from the originals.

The unprocessed source texts are available here: http://www.topcoder.com/contest/problem/OneTimePad/rawdocs.zip. They have not had non-letter characters, spaces, etc, removed, and have not been converted to upper-case.

The word list can be found here: http://www.topcoder.com/contest/problem/OneTimePad/TWL06.txt.

The time limit is 20 seconds, and the memory limit is 1GB.

Your source code may not exceed 300KB.

There are 100 test cases for full submissions.

Constraints

encoded will contain between 2 and 8 elements, inclusive (selected at random)

The length of each encoded text will be between 200 and 10000, selected as len = 10000 - Sqrt(Random * 96059601.0)

Each element of encoded will contain only 'A'-'Z' (uppercase) characters.

Examples

"1"

Returns: 
"Length = 3808

Pad Source: thepitandthependulum-conv.txt
Plain Source: wrnpc12-conv.txt
"

"2"

Returns: "Length = 8432

Pad Source: ttlg-conv.txt
Plain Source: ttluts-conv.txt
"

"3"

Returns: "Length = 6610

Pad Source: ttluts-conv.txt
Plain Source: wutheights-conv.txt
"

"4"

Returns: "Length = 5914

Pad Source: ttluts-conv.txt
Plain Source: ttlg-conv.txt
"

"5"

Returns: 
"Length = 7840

Pad Source: ttlg-conv.txt
Plain Source: descartes-discourse-conv.txt
"

"6"

Returns: 
"Length = 2288

Pad Source: ttlg-conv.txt
Plain Source: thepitandthependulum-conv.txt
"

"7"

Returns: 
"Length = 4580

Pad Source: mobydick.txt
Plain Source: thepitandthependulum-conv.txt
"

"8"

Returns: 
"Length = 3029

Pad Source: thepitandthependulum-conv.txt
Plain Source: wutheights-conv.txt
"

"9"

Returns: 
"Length = 1061

Pad Source: thepitandthependulum-conv.txt
Plain Source: tomsawyer-conv.txt
"

"10"

Returns: "Length = 1100

Pad Source: ttlg-conv.txt
Plain Source: wutheights-conv.txt
"

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=9906

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=13564&pm=9906

Writer:

Unknown

Testers:

Problem categories:

Encryption/Compression