TopCoder problem "BlockEditDistance" used in MM 18 (Division I Level One)

Problem Statement

A detailed changelog is maintained for each entry in Wikipedia. Each line in this changelog represents one version of the page, made by some user. In this problem, your task is to figure out how each version transitioned to the final version, and to do this as quickly as possible.

To do this, we will use the idea of block edit distance. When considering a past version and the final version, we want to know how cheaply we can transform the past version into the final version using the following procedure. First, we take a sequence of blocks from the past version, and arrange them however we want, where a block is a continuous subsequence, defined by its start and end index in the past version. Once we've done this, we insert and delete single characters (just like regular edit distance). Each block that we use has a cost of B, while each insertion or deletion costs 1. The goal is to tranform each past version into the final version as cheaply as possible. For instance, consider the following example, where B = 2.

past:  ABCDEFGHIJKL
final: GGHIJMACDEFGZ

One way to do this is to pick two blocks from the past version (characters 6 to 9 and characters 0 to 6), and arrange them as:

  GHIJABCDEFG

Next we insert and delete some characters (illustrated below, we add 3 and delete 1)

  -GHIJ-ABCDEFG-
  GGHIJMA-CDEFGZ

The total cost of these operations is 4 for the two blocks, and 4 for the inserts and deletes.

The input will be given to you as a String[], where each element represents one version of the page, in the order of their revisions, so the final element is the final version. You should return a String[] where each element corresponds to one version of the page (excluding the final version). Each element should first give the blocks from the previous version to arrange, as inclusive ranges separated by spaces. Following these should be a string describing how to change the string formed from the sequence of blocks into the final version. This string should be a sequence of 'M', 'I', and 'D' characters, standing for match, insert and delete. A good way to think about this is to imagine a pointer starting at the beginning of each string: the string formed from blocks, and the final version. A 'M' character advances both pointers, an 'I' advances only the final version pointer, while a 'D' advances only the block pointer. At the end of the sequence of 'M's, 'I's and 'D's, the two pointers should both be one character past the last character of the strings. Furthermore, an 'M' character can only be used if the characters at the two pointer positions match. In the example above, the sequence IMMMMIMDMMMMMI would be correct (note that 'I' corresponds to a '-' in the first sequence, and 'D' to a '-' in the second). Thus, the whole return for the example above could be "6-9 0-6 IMMMMIMDMMMMMI". Note that the cost will be B*numBlocks+Ds+Is where Ds is the number of 'D' characters, and Is is the number of 'I' characters.

Your score will be based on your total block edit distance, summed over all versions, and on the time it takes you. For a particular test case, a baseline score is calculated by multiplying (number of versions minus one) times the length of the final version (this is the cost of a solution that used only 'I's). Your percent improvement will be calculated from this as: (baseline-cost)/baseline. Your percent improvement will then be multiplied by a time modifier. This modifier is based on the rate at which you process the data in KB/s. So, if all the versions have 500,000 characters in them combined, and you process this case in 1 second, your rate would be 500. The modifier function is 1.0/(1+exp(4-rate/200)). This is a sigmoidal curve (try plotting it!) which goes to 1 as rate increases. To give you an idea of what to shoot for, a rate of 1000 would give you a modifier of about 0.73, while a rate of 400 would give you a modifier of about 0.12 and a rate of 1500 would give you a modifier of 0.97. To allow a bit more time for small articles, articles with less than 100K characters will be treated as if they had 100K characters.

Your final score will be the sum over all test cases of your score for each test case, multiplied by 100.

Definition

Class:	BlockEditDistance
Method:	transform
Parameters:	String[], int
Returns:	String[]
Method signature:	String[] transform(String[] versions, int B)
(be sure your method is public)

Notes

- You are translating each version into the final version not into the next version.

- Matching is case sensitive.

- The time limit per test case is max(5,min(size/5e5,60)) seconds, where size is the total number of characters in the test. Thus, you have at least 5 seconds for each test case, and for big test cases you have 1 second per 500K characters, up to a max of 60 seconds.

- The memory limit is 1024M.

- Test cases are chosen by using the random page feature on Wikipedia, and discarding pages with less than 20 versions or more than 200 million total characters or a baseline score over 500 million. Some of them (e.g. Martial_arts) are quite large.

- Download the examples.

- If you want to download more data, you can use this Java program. Simply install Java and run "java -Xmx512M WikiDownload <article>" to download a specific article (which will be saved as a .txt file). If you run it with no parameters it will download random pages until you kill it.

- While the overall thrust of the problem will remain unchanged between phases one and two of this contest (phase two occurs a week after phase one ends), we may make small changes to the problem between phases one and two. For instance, the range of B, or the scoring functions might change in their details. Any changes will be explicitly noted in the phase two problem, and may be posted on the forums ahead of time.

- A negative score on a test case will be increased to 0 when computing the final score.

Constraints

- B will be chosen uniformly between 10 and 40, inclusive.

- The total number of characters will be less than 200 million (this excludes a few of the biggest pages)

Examples

"Mork"

Returns: "Article = Mork
B = 24"

20 versions, 5.8K total.

Imagine that you processed this test case in 86ms, and your total cost was 3587. The baseline for this case is 329*19=6251, so you would have an improvement of (6251-3587)/6251=0.4262. Your rate would be 100K/0.086s = 1163. This gives you a modifier of 1/(exp(4-1162/200)) = 0.860, and an overall score of 0.4262*0.860=0.366.

"Garifuna_language"

Returns: "Article = Garifuna_language
B = 11"

25 versions, 25.8K total

"Jahwist"

Returns: "Article = Jahwist
B = 20"

59 versions, 538K total

"Piter_Fattouche"

Returns: "Article = Piter_Fattouche
B = 27"

22 versions, 17K total

"Alpenglow"

Returns: "Article = Alpenglow
B = 11"

31 versions, 31K total

"Cork_South_West_%28D%C3%A1il_%C3%89ireann_constituency%29"

Returns: "Article = Cork_South_West_%28D%C3%A1il_%C3%89ireann_constituency%29
B = 16"

26 versions, 84K total

"Osaka_University"

Returns: "Article = Osaka_University
B = 10"

78 versions, 338K total

"Nicholas_Lanier_the_Elder"

Returns: "Article = Nicholas_Lanier_the_Elder
B = 40"

28 versions, 56K total

"Philippine_Airlines"

Returns: "Article = Philippine_Airlines
B = 16"

1113 versions, 28M total

"Martial_arts"

Returns: "Article = Martial_arts
B = 25"

2716 versions, 102M total

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=7874

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=10834&pm=7874

Writer:

Unknown

Testers:

Problem categories:

String Manipulation