TopCoder problem "bloggoSequenceSearch" used in SRM 214 (Division II Level Three)



Problem Statement

    

Bloggo lets content producers integrate a text-search facility into their weblogs, making it easy for end users to sift through archival content. One type of query offered by the bloggo search engine is the sequence search. In such a query, consecutive words are separated by an ellipsis, "...". The query matches strings in which all words of the query appear, in the same order as they are given in the query. Such a string, called a passage, begins with the first word of the query and ends with the last word of the query. For example, the sequence-search query


  with...here...there

matches the following passages.


  With a quack-quack here and a quack-quack there

  With an oink-oink here and an oink-oink there

  With a moo-moo here and a moo-moo there

Observe that matching is not case-sensitive. Also bear in mind that words are defined as sequences of alphabetic characters, 'a' to 'z' and 'A' to 'Z', that are not included in a longer word. Thus, the query shown above does not match either of the following passages in part or in whole.


  without a woof-woof here or a woof-woof there

  with a meow-meow here and a meow-meow thereabouts

In the special case of a single-word query, a matching passage is also one word long. The query


  pudding

therefore matches passages that look like this.


  pudding

In general, a query word must appear in a matching passage at least as many times as it appears in the query. Thus, the query


  a...quack...quack...night

does not match


  a quack in the night

but it does match both of the following.


  a quack in the quack factory at night

  a quack doctor said quack quack last night

You are given a String[], documents, each element of which contains a single document. A document may include, in addition to words and spaces, the punctuation symbols ',', ';', '.', '!', '?', '-', '(', and ')'. You are also given a String, query, consisting of one or more words separated by ellipses. Find the top five passages matching this query and return them in a String[] formatted according to the specifications below and sorted by descending order of quality, where quality is also defined below. Each passage must be a substring of an element of documents, and must begin and end with a whole word.

The shorter a passage, the higher its quality. The length of a passage is counted in characters, and a passage extends from the first character of the first query word to the last character of the last query word. If two passages are equally long, the one drawn from the lower-numbered document takes precedence. In the case of equally long passages in the same document, earlier ones are better. Passages may overlap. If there are fewer than five matching passages among the documents, return the ones that do exist. To make the search results helpful to users, each passage should be presented in the format


  DOC_ID START_INDEX [PASSAGE_TEXT]
  

where DOC_ID is the zero-based index of the document from which the passage is drawn, START_INDEX is the zero-based character index of the first character in the passage, and PASSAGE_TEXT is the text of the passage. There is one space between DOC_ID and START_INDEX, one space between START_INDEX and the left bracket, and no other spaces except possibly in the PASSAGE_TEXT.

 

Definition

    
Class:bloggoSequenceSearch
Method:findPassages
Parameters:String[], String
Returns:String[]
Method signature:String[] findPassages(String[] documents, String query)
(be sure your method is public)
    
 

Constraints

-documents contains between 1 and 50 elements, inclusive
-each element of documents is between 1 and 50 characters long, inclusive
-the only characters allowed in documents are 'a' to 'z', 'A' to 'Z', the space character, ',', ';', '.', '!', '?', '-', '(', and ')'
-query is between 1 and 50 characters long, inclusive
-query consists of a sequence-search query as defined above
 

Examples

0)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,", 
 "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining", 
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,", 
 "for the people, shall not perish from the earth."}
"the...People"
Returns: 
{ "17 19 [the people]",
 "17 34 [the people]",
 "18 4 [the people]",
 "17 19 [the people, by the people]" }
These documents are fragments of the Gettysburg Address. Observe that word matching is case-insensitive and that passages may overlap.
1)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,", 
 "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining", 
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,", 
 "for the people, shall not perish from the earth."}
"Shall...The"
Returns: { "18 16 [shall not perish from the]" }
Note that the substring "the people, shall" in the last document does not match this query.
2)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,", 
 "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining", 
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,", 
 "for the people, shall not perish from the earth."}
"wE...nOT"
Returns: 
{ "0 22 [we can not]",
 "1 0 [we can not]",
 "1 25 [we can not]",
 "1 0 [we can not consecrate -- we can not]" }
3)
    
{"But in a larger sense we can not dedicate --",
 "we can not consecrate -- we can not hallow this",
 "ground. The brave men, living and dead, who",
 "struggled, here, have consecrated it far above",
 "our poor power to add or detract. The world will",
 "little note, nor long remember, what we say here,", 
 "but can never forget what they did here. It is",
 "for us, the living, rather to be dedicated here",
 "to the unfinished work which they have, thus",
 "far, so nobly carried on. It is rather for us",
 "to be here dedicated to the great task remaining", 
 "before us -- that from these honored dead we take",
 "increased devotion to that cause for which they",
 "here gave the last full measure of devotion --",
 "that we here highly resolve that these dead",
 "shall not have died in vain; that this nation",
 "shall have a new birth of freedom; and that",
 "this government of the people, by the people,", 
 "for the people, shall not perish from the earth."}
"we...can...not"
Returns: 
{ "0 22 [we can not]",
 "1 0 [we can not]",
 "1 25 [we can not]",
 "1 0 [we can not consecrate -- we can not]" }
4)
    
{"B A b B B A b a A B b b b a a a B a A b a B B b b ",
 "A B b b A a A b A A a b b A A a A a A b A a A B b ",
 "B B a A a a A A A b A b b a a B B b a A a A b a A ",
 "a b b b b A B a A A A A a b A b A A B a A b b a A ",
 "A b a A b B B a b A A A A B A A B a A B a b b b B ",
 "A A A b b a a a A b b b a A A b a B a a B b B a b ",
 "B A B A a b A B a B b A b B b b A B A A a B A b B ",
 "A A b B a a b B b a b a A A A B A b b b b A b b A ",
 "B B A A B B A b a a b a a A B B A A B B A B b A b ",
 "A a A b B B b A a b b B A a B b b A B b a a a b b ",
 "B B a b A A a b a A a A a B B A B A A a B B b a B ",
 "B b B B a B A B b B B a A a a a b b B a A A b a A ",
 "b B a a a B b a A a A b a a A b A B b a B a b A A ",
 "A A b B b A a B a a a b B A b B B a b B b b b B B ",
 "b B a A A A B B b b b b a b B B a a a b B b A A B ",
 "a B A a a a a A A B A A B b a b A A a A a a B B B ",
 "B B A B b B B a b B A B A b b A b A B A A b A a b ",
 "B A A A b A B a b b b a b b b a a A A b A b a A b ",
 "A a B B B A B b b A B a B a A b B b b a B A B b a ",
 "B a A A B a a b b B b B A A a A b a a b b b a B b ",
 "a a b A B a b B B B A A B a B b B A b B b a b A B ",
 "A a a b a A b B A b B a A B a a A B a A b a b a B ",
 "A b A b b A b b a B b a A a b A a A A A A a A b A ",
 "a B B A a B B a b B b a B b A a B B B a B a a b a ",
 "a a A b B B B a A B a B a A a B a a B b A B b b B ",
 "A a A B A A A b b B A A a A b B b a A b B A b b A ",
 "A A a a b B A b a b a A A A B B b B A a A B A B b ",
 "b a b B a A B a a b B B B A b a A a A A A a a A b ",
 "a b a a B a B b a B B A A b a B A B b A b a A a a ",
 "b a a B a a b A B A a b a B A B b b B A b A a b a ",
 "a B b B a a a A b B B A b b b A A B A b b b B b A ",
 "B B a A b b a B B b A a a A B B A a A b a a B A A ",
 "a A a A b b B B a B B b a b b A B a B A a a b a A ",
 "A B A A a a b A b A B B A b A B b B B a a a A b a ",
 "b A B b b b a b b A A a A B b B B A a b b B a b a ",
 "A B B b b a a a B A B b b a b A a a B A b B A B B ",
 "A b B b B a b a B B B B A a A a a B B b b a b A A ",
 "B B a a b B a A b A A A a A a b b b a B B A a B a ",
 "A B a a B A B B a B B b b a B a b b A A B b A A b ",
 "B b a a A a B b B a b A B B B A b B b A A B a a B ",
 "B b b A B a a b a B a b A A b B a b b a A B B a A ",
 "A A b A a a a a B B b b b b A B A B B a B b A a a ",
 "B B B A b A a a a B a B b b B b a A a b b A B a B ",
 "A B b b B a b A b B b A B A A A a B B a A b b b B ",
 "B a A B A b b A b b a A A b b b A B a b B a b A b ",
 "B a a B a B A b B A b b a b B a b a A b b a A b B ",
 "a b b a A b A a A B B a b B B a A B A a b B a a a ",
 "b b a A a a A a a B b A B a A B B a A a B A b A a ",
 "b b B b B a a b a B B A B b A b A a b b B A B b a ",
 "a b a B a A B a A A A A A a B a A A a B a b b b b "}
"a...b...a...b...a...a...b...b...a...a...b...b"
Returns: 
{ "45 24 [a b B a b a A b b a A b B]",
 "8 12 [A b a a b a a A B B A A B B]",
 "18 18 [A B a B a A b B b b a B A B b]",
 "40 16 [a B a b A A b B a b b a A B B]",
 "47 16 [a B b A B a A B B a A a B A b]" }
5)
    
{ "this government of the people, by the people,", 
 "for the people, shall not perish from the earth."}
"people"
Returns: { "0 23 [people]",  "0 38 [people]",  "1 8 [people]" }

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=3023

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=5860&pm=3023

Writer:

Eeyore

Testers:

PabloGilberto , lbackstrom , brett1479

Problem categories:

Search, String Manipulation, String Parsing