TopCoder problem "SequenceAlignment" used in Exp 1 - Gr A - Ph 1 (Division I Level One)



Problem Statement

    Sequence alignment is a well-studied task that comes in many flavors. One problem involves determining how various pieces of DNA combine to form longer sequences which are then used for various purposes. In particular, we have three sets of sequences: A, B and C. By selecting sequences a, b, and c from A, B and C, we can append them to achieve a longer sequences abc. However, this process is not without complexities, and during the appending process a number of insertions, deletions and mutations might be made, making it difficult to tell what the original a, b and c were.



Implementation Details

You will be given String[]'s A, B and C representing the sets mentioned above. You will also be given a number of query strings. For each query, you must determine which sequence abc most closely matches the query string, where closely is defined by Levenshtein distance. You should return a int[] with three element for each query string. Thus, ret[i*3] gives the index of the element of A to use for query i (indexed from 0), while element ret[i*3+1] gives the element of B and ret[i*3+2] gives the element of C.



Test Generation

The query strings will be formed by selecting one string from each set. For each of the three selected strings, two independent random integers d1 and d2 between 0 and 10, inclusive will be chosen and d1 characters will be deleted from the front of the string and d2 characters will be deleted from the end. Next, four integers (i1 ... i4) will be chosen by sampling from an exponential distribution with lambda = 0.3, and then taking the floor of the sampled value. i1 random characters will be inserted before a, i2 random characters between a and b, i3 random characters between b and c, and i4 random characters after c. Finally, each character will be changed to a different character with probability 0.1.



Each of the sets A, B and C will be randomly (and independently) selected from one of four biological datasets. Thus, there are 64 possible values for the sets (4 for each A, B and C).

Scoring

Your solution should find abc triples that are close to queries quickly. For a set of queries, we find the total Levenshtein distance summed over all queries, DIST. The total runtime over all queries in the set is TIME. Your score will then be DIST * 0.01 + ln(TIME). Your overall score will be the sum over all test cases of 1,000,000/score.

Tools

An offline tester is available.
 

Definition

    
Class:SequenceAlignment
Method:recover
Parameters:String[], String[], String[], String[]
Returns:int[]
Method signature:int[] recover(String[] A, String[] B, String[] C, String[] queries)
(be sure your method is public)
    
 

Notes

-Character always means a, c, g, or t in this problem.
-If the total number of characters to be deleted from a, b or c exceeds the length of the string, the whole string is deleted.
-Aside from the first 5 tests, each test has N = 100,000 queries.
-The time limit is 30 seconds and the memory limit is 1024M.
-There are 64 tests -- one for each choice of A, B and C.
-The final tests will use the same possibilities for A, B and C, which can be seen in the source code of the offline tester.
 

Examples

0)
    
"1"
Returns: "seed = 1<br>
N = 10<br>
ABC sets = (1, 3, 0)"
The set A used in this test is that with index 1, the set B is that with index 3, and set C has index 0.
1)
    
"2"
Returns: "seed = 2<br>
N = 100<br>
ABC sets = (0, 3, 1)"
2)
    
"3"
Returns: "seed = 3<br>
N = 1000<br>
ABC sets = (2, 0, 1)"
3)
    
"4"
Returns: "seed = 4<br>
N = 10000<br>
ABC sets = (2, 2, 3)"
4)
    
"5"
Returns: "seed = 5<br>
N = 100000<br>
ABC sets = (3, 2, 3)"
5)
    
"6"
Returns: "seed = 6<br>
N = 100000<br>
ABC sets = (2, 1, 3)"
6)
    
"7"
Returns: "seed = 7<br>
N = 100000<br>
ABC sets = (3, 0, 1)"
7)
    
"8"
Returns: "seed = 8<br>
N = 100000<br>
ABC sets = (3, 0, 1)"
8)
    
"9"
Returns: "seed = 9<br>
N = 100000<br>
ABC sets = (2, 0, 2)"
9)
    
"10"
Returns: "seed = 10<br>
N = 100000<br>
ABC sets = (1, 1, 0)"

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=10390

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=13796&pm=10390

Writer:

Unknown

Testers:

Problem categories:

Dynamic Programming