TopCoder problem "Smooshing" used in SRM 284 (Division I Level Two)

Problem Statement

    We write programs in a script language which are embedded in HTML files and downloaded over the internet. To speed up our downloads, we would like to "smoosh" our source code, reducing its size without destroying the program.

One smooshing technique that we are interested in analyzing is to replace all the nice meaningful identifiers in our code with shorter names. This script language is case sensitive, and an identifier in this language consists of (a maximal sequence of) letters. In order to avoid smooshing identifiers that are reserved words, we will encourage our programmers to start their identifiers with an uppercase letter (since all the reserved words start with a lowercase letter) and we will only smoosh identifiers that start with an uppercase letter.

Here is the proposed smooshing algorithm. Find the most frequent identifier that begins with an uppercase letter and replace all its occurrences by its first letter. Continue with the next most frequent identifier, etc. Whenever a one letter abbreviation conflicts by matching an already assigned abbreviation, use the first two letters. If that still gives a conflict, then try the first three letters. If even after using all the letters of the identifier there is still a conflict, continue adding letters to the abbreviation by circling around to the front of the identifier. So it is possible that the abbreviation(!?) for "Car" might be "CarCarCa". When several identifiers appear with the same frequency, they are given abbreviations in the order that they first appear in the program.

Create a class Smooshing that contains a method savings that is given a String[] program that contains the lines of source code. It calculates the number of characters by which the source code would be reduced by smooshing it.

Note that lines of source code are distinct -- an identifier cannot cross line boundaries and the smooshed code will have the same number of lines as the original code.



Method signature:int savings(String[] program)
(be sure your method is public)


-program contains between 1 and 50 elements, inclusive.
-Each element of program contains between 0 and 50 characters, inclusive.
-Each character in each element of program is between ASCII 32 and ASCII 126, inclusive.


{"MyInt = YrInt; if(MyInt==2*That) MyInt++;",
    "    MyDouble = MyShort+MyLong;",
    "return MyDouble;",
    "end" }
Returns: 42
MyInt is most frequent and so is smooshed to M. MyDouble is next most frequent and is smooshed to My. YrInt, That, MyShort, and MyLong all have the same frequency and first appear in the order given, so they are smooshed to Y, T, MyS, and MyL. The smooshed program is
    M = Y; if(M==2*T) M++;
        My = MyS+MyL;
    return My;
The length of the smooshed program is 42 characters less than the length of the original program.
{"MyInt3 = MyInt","MyIntMyInt"}
Returns: 16
This smooshes to
   M3 = M
which is 8 characters versus the original 24 (not counting the end-of-line characters which are never affected by this kind of smooshing).
{"C = Cda; CdC=CdCd+2*Cd;"}
Returns: -2
Here there are 5 different identifiers, each appearing once. The code is smooshed as follows, showing the code after each abbreviation has been performed:
   C   abbrev: C     => "C = Cda; CdC=CdCd+2*Cd;"
   Cda abbrev: Cd    => "C = Cd; CdC=CdCd+2*Cd;"
   CdC abbrev: CdC   => "C = Cd; CdC=CdCd+2*Cd;"
   CdCd abbrev: CdCd => "C = Cd; CdC=CdCd+2*Cd;"
   Cd abbrev: CdCdC  => "C = Cd; CdC=CdCd+2*CdCdC;"

Notice that when Cd is assigned its abbreviation the first Cd in the code at that point is not replaced because it was placed there as an abbreviation at an earlier step.

Problem url:

Problem stats url:




PabloGilberto , brett1479 , Olexiy

Problem categories:

String Parsing