TopCoder problem "bloggoProximitySearch" used in SRM 214 (Division I Level Three)



Problem Statement

    

To help end users sift through archival content, bloggo makes it possible for content authors to incorporate a search facility into their websites. Among the query types supported by the bloggo search engine is proximity search, which lets users look for passages in which specified words occur near one another. The syntax of a proximity-search query is prescribed by the following grammar.


  Query -> Word | "(" Query " " Near " " Query ")"
  Near -> "+" Num
  Num -> Digit | Digit Digit | Digit Digit Digit
  Digit -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

A passage returned by proximity search is displayed to the user as a substring of a text document. Internally, a passage is represented by a pair of non-negative integers (a,b) such that a<=b, where a and b are the zero-based position numbers respectively of the first and last words in the substring. A word is a sequence of alphabetic characters that is not contained within a longer word. If a passage consists of a single word, then a=b. For example, the passage (9,11) of the following document is "appraising his wheat", while passage (17,17) is "actor". Position numbers have been added for purposes of illustration.


    His expression combined that of a Middle Western farmer appraising
    0   1          2        3    4  5 6      7       8      9

    his wheat crop and that of an actor wondering whether he is
    10  11    12   13  14   15 16 17    18        19      20 21

    observed--the public manner of all good Americans.  (Fitzgerald)
    22        23  24     25     26 27  28   29           30

If a query consists of a word, it matches all passages consisting of just that word. Matching is case-insensitive, so the query "HIS" would match passages (0,0) and (10,10) in the above document.

If a query consists of two queries X and Y joined by the "Near" operator, it matches a passage (p, q) under the following conditions. There must be a passage (a,b) matching query X, and a passage (c,d) matching query Y. Letting n be the number specified by the "Near" operator, it must be true that if the two passages don't overlap, they are separated by at most n words. So if b<c or d<a, we must have (c-b)<=n+1 or (a-d)<=n+1, respectively. Finally, p and q are the leftmost and rightmost endpoints among the two passages, meaning that p=min{a,c} and q=max{b,d}.

You are given a String[], documents, each element of which is a single document that may include, in addition to words and space characters, the punctuation symbols ',', ';', '.', '!', '?', '-', '(', and ')'. You are also given a String, query, containing a proximity-search query. Find the top ten passages matching this query and return them in a String[] formatted according to the specifications below and sorted by descending order of quality, where quality is also defined below. Each passage must be a substring of an element of documents, and must begin and end with a whole word.

The smaller the width q-p of a passage (p,q), the higher its quality. In the event of a tie between two passages, the one occurring in the lower-numbered document takes precedence. If both are in the same document, the one occurring earlier is better. Although every pair of passages in the search results must be different in the sense of covering different spans or occurring in different documents, they may have identical textual content. If there are fewer than ten matching passages among the documents, return the ones that do exist. To make the search results helpful to users, each passage should be presented in the format


  DOC_ID START_INDEX [PASSAGE_TEXT]

where DOC_ID is the zero-based index of the document from which the passage is drawn, START_INDEX is the zero-based character index of the first character in the passage, and PASSAGE_TEXT is the text of the passage. There is one space between DOC_ID and START_INDEX, one space between START_INDEX and the left bracket, and no other spaces except possibly in the PASSAGE_TEXT.

 

Definition

    
Class:bloggoProximitySearch
Method:findPassages
Parameters:String[], String
Returns:String[]
Method signature:String[] findPassages(String[] documents, String query)
(be sure your method is public)
    
 

Constraints

-documents contains between 1 and 50 elements, inclusive
-each element of documents is between 1 and 50 characters long, inclusive
-the only characters allowed in documents are 'a' to 'z', 'A' to 'Z', the space character, ',', ';', '.', '!', '?', '-', '(', and ')'
-query is between 1 and 50 characters long, inclusive
-query consists of a proximity-search query as defined above
 

Examples

0)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,",
  "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining",
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,",
 "for the people, shall not perish from the earth."}
"(can +000 we)"
Returns: { "0 22 [we can]",  "1 0 [we can]",  "1 25 [we can]" }
These documents are made of fragments of the Gettysburg Address. The words "can" and "we" occur next to each other in three places.
1)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,",
  "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining",
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,",
 "for the people, shall not perish from the earth."}
"this"
Returns: { "1 43 [this]",  "15 34 [this]",  "17 0 [this]" }
The word "this" appears three times.
2)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,",
  "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining",
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,",
 "for the people, shall not perish from the earth."}
"((the +099 people) +999 by)"
Returns: 
{ "17 19 [the people, by]",
 "17 23 [people, by the]",
 "17 31 [by the people]",
 "17 19 [the people, by the people]" }
Passages may overlap.
3)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,",
  "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining",
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,",
 "for the people, shall not perish from the earth."}
"((TO +20 (tO +20 tO)) +20 ((TO +20 tO) +20 To))"
Returns: 
{ "4 15 [to]",
 "7 27 [to]",
 "8 0 [to]",
 "10 0 [to]",
 "10 21 [to]",
 "12 19 [to]",
 "10 0 [to be here dedicated to]" }
Word matching is case-insensitive.
4)
    
{"c B b B A C b b A C A a c a B A b C c b b b b A a ",
 "b A A C c c B c A c c C b c B C B A a B b a b c A ",
 "a C C B A b A a A c A B b b a A A C a B C A A b C ",
 "C A a a c a A B C a c B c B b A c B A A c b b B b ",
 "a A a A b C A b c a B B a c B A a c a b c b C b C ",
 "c C A a C C A c A c b a b A b A c C A C b C C B c ",
 "a B C B c A b C c c a a c a c A A c a A B c c A c ",
 "A C A C c a b C C C c A a b a b b C C a A C b a c ",
 "C A B c a A B b b c b C b B C c A b A B A c b B a ",
 "c b a B a C B C c A c b B c C A A C c C a B C a a ",
 "A C C A a C C B a c C a C a c A A A C c A a B a A ",
 "a b c B C C A b b a c c b B A B b A b b b b a a b ",
 "C B c C B b c c a C B C b B B a C B B a A B B c c ",
 "a B B B B A c c C a b b A c B A A b B B B c C c B ",
 "c A B C b B b c a a C b b c b c b a B A c c A a A ",
 "a B A B c c b C A A b b A a A B b C A A B b c b C ",
 "b b b A A c a c a c a B A B b a a A c B b B B A b ",
 "c a b B a b A c C B b B b c A A C c b C B a c c c ",
 "C A A C b C C A a c B c b B C A c B a c c B c b c ",
 "b A A B b B b b b A C b a b A C b B C A c c A B A ",
 "b a c b B a B B B a a a A C a a A A A A A B a c a ",
 "c a A C c A c b a a C B b c c B C B c b a c C B B ",
 "a b A c B b b B B C c b c B A a a b C b a C b c b ",
 "c b b b a a A b b b b c B b b c c a a c a C A a b ",
 "C B b b A A A b B c b C a c C a B A A C C c C b C ",
 "B C a a B A a B c B a C c B A B b c B a b B B b C ",
 "a A a C C b A C a C b b A a B C C C C A c C c C b ",
 "a c a a A a c C b B b c A c B b A A A B a C c c c ",
 "A c B b c C c a b b C B b c B b b A C c B b a b b ",
 "c a c A A B C C b C b C c C c C c b b a A C c C a ",
 "A b a A b B B C c C c c a b C a C a b c c A A B c ",
 "c C a b A a b A C C a C A a B C c a C b c b B C C ",
 "B C C B b B C A c a C B c a B b c B A A a a a b C ",
 "b a a A A B b b a b A b b a b c A B B c C a a A C ",
 "a c a C A b a A C c a C c b b A A B a C B a A C B ",
 "B A B B a a c c a c b A c c c B C A A B a b B B B ",
 "C c B c A c b C b a c b a c C A c A a c b a b c c ",
 "B C b c c A C A b A B c c c B A C b b C c C a a b ",
 "C a b b C a C B a B C b c C C c b b c A b B B b a ",
 "a B A b a A C C B C B A B a b a a c a c b b b B a ",
 "c b c B A C C c A B a b B a A A C b A C b C a a c ",
 "b b a b a A c c A c A A A a A b c a C C b a A C a ",
 "a B b A b C c c a A b A c A B A b B a B a b A c A ",
 "c B b b C b c c c a B C c c c a b a c B B a C c A ",
 "C A a b c C c b a B A c A C C c B A C b A C c b c ",
 "a c A a b C b C A C A A a c B c a B A B B b A c a ",
 "B c C B a B C C B a a B A b b c b b A c A B b A a ",
 "c c c a C A B a c b c a B c b a C B a B A a b c A ",
 "B c b a C B c A A b a a c c c A c B C B B C b b a ",
 "C a C c a c c c b a C C B C a B a C A b a C a c B "}
"(a +3 (b +2 (a +6 (((b +1 c) +3 a) +0 c))))"
Returns: 
{ "0 6 [B A C]",
 "0 8 [A C b]",
 "0 14 [b A C]",
 "0 24 [c a B]",
 "0 30 [A b C]",
 "1 12 [B c A]",
 "1 30 [C B A]",
 "1 42 [a b c]",
 "1 44 [b c A]",
 "2 4 [C B A]" }

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=3026

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=5860&pm=3026

Writer:

Eeyore

Testers:

PabloGilberto , lbackstrom , brett1479

Problem categories:

Search, String Manipulation, String Parsing