TopCoder problem "BioScore" used in TCO04 Semifinal 1 (Division I Level Two)



Problem Statement

     Two equal length sequences of nucleic acids (A,C,T,or G) are judged based on a 4x4 score matrix. The homology score for the sequences is just the sum of the scores at each position in the sequence, looked up in the matrix. For example, if the two sequences were ACTA and GATC then their homology score would be score(A,G) + score(C,A) + score(T,T) + score(A,C).

Our problem is to construct the score matrix. It has the following restrictions:

  • 1) All entries must be integers between -10 and 10 inclusive
  • 2) It must be symmetric ( score(x,y) = score(y,x) )
  • 3) Diagonal entries must be positive ( score(x,x)>0 )
  • 4) The sum of the 16 entries must be 0.
We are given a set of equal length sequences that are known to be homologous. We want to fill in the score matrix so that the average homology score for all the pairs from the set is maximized. If there are n sequences in the set, then there are n*(n-1)/2 pairs to consider.

Create a class BioScore that contains a method maxAvg that is given knownSet, the list of sequences known to be homologous. The method computes the optimum scoring matrix and returns the resulting maximal average homology score for the pairs from knownSet.

 

Definition

    
Class:BioScore
Method:maxAvg
Parameters:String[]
Returns:double
Method signature:double maxAvg(String[] knownSet)
(be sure your method is public)
    
 

Constraints

-knownSet will contain between 2 and 50 elements inclusive
-each element of knownSet will contain only the uppercase letters A,C,T, and G.
-each element of knownSet will contain between 1 and 50 characters inclusive.
-all elements of knownSet will contain the same number of characters.
 

Examples

0)
    
{"AAA","AAA","AAC"}
Returns: 30.0
Make score(A,A) and score(A,C) be 10. It is then easy to satisfy the remaining requirements. All three pairwise comparisons score 30.0.
1)
    
{"ACTGACTGACTG","GACTTGACCTGA"}
Returns: -4.0
Here, there are no positions in which the letters in the two strings match each other. So we should give each exact match (each diagonal entry of the score matrix) the smallest possible score, 1. The remaining 12 pairs are each present at one position in the comparison so the best we can do is to choose those 12 scores so they add up to -4 in any manner. Now the sum of all the matrix entries is 0, and the resulting score for the one pairwise comparison is -4.
2)
    
{"ACTAGAGAC","AAAAAAAAA","TAGTCATAC","GCAGCATTC"}
Returns: 50.5

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=3038

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=5882&pm=3038

Writer:

dgoodman

Testers:

PabloGilberto , lbackstrom , Yarin

Problem categories:

Greedy