We write programs in a script language which are embedded in HTML files and
downloaded over the internet. To speed up our downloads, we would like to
"smoosh" our source code, reducing its size without destroying the program.
One smooshing technique that we are interested in analyzing is to replace all
the nice meaningful identifiers in our code with shorter names. This script
language is case sensitive, and an identifier
in this language consists of (a maximal sequence of) letters. In order to avoid
smooshing identifiers that are reserved words, we will encourage our programmers
to start their identifiers with an uppercase letter (since all the reserved words
start with a lowercase letter) and we will only smoosh identifiers that start
with an uppercase letter.
Here is the proposed smooshing algorithm. Find the most frequent identifier that
begins with an uppercase letter and
replace all its occurrences by its first letter. Continue with the next most
frequent identifier, etc. Whenever a one letter abbreviation conflicts by matching an
already assigned abbreviation, use the first two letters. If that still gives a conflict,
then try the first three letters. If even after using all the letters of the identifier there is still
a conflict, continue adding letters to the abbreviation by circling around to the front
of the identifier. So it is
possible that the abbreviation(!?) for "Car" might be "CarCarCa".
When several identifiers appear with the same frequency, they are given abbreviations in the order that they first appear in the program.
Create a class Smooshing that contains a method savings that is given a String[]
program that contains the lines of source code. It calculates the number of
characters by which the source code would be reduced by smooshing it.
Note that lines of
source code are distinct -- an identifier cannot cross line boundaries and the smooshed code
will have the same number of lines as the original code.
|