Hide

Problem A
Huffman Encoding

Huffman coding is a lossless compression scheme that is commonly used to efficiently transmit multimedia data over networks. For example, JPEG and MP3 files incorporate Huffman coding. The main idea is to assign a binary code to each symbol in the input data, with shorter codewords assigned to more frequently-occuring symbols, and longer codewords to less frequent ones. Its effectiveness lies in its ability to achieve substantial compression ratios while maintaining lossless data compression. In this problem, symbols will be ASCII characters represented by their byte value.

To create a Huffman code for a symbol sequence, count how frequently each symbol occurs in the sequence. Then create a leaf node for each symbol and record the frequency of that symbol in the node. Connect the nodes into a binary tree by repeatedly taking the two parent-less nodes with smallest frequencies and connecting them as children of a new parent node. The frequency of the parent is the sum of the two child frequencies. (If multiple parent-less nodes have the same smallest frequency, choose the one with the smallest-ASCII-value leaf descendent. Break any ties similarly when choosing the second node.)

To encode a symbol using the Huffman tree, follow the path from the root to that symbol. Each time the path descends from a parent to a left child, append a $0$ bit to the encoding; when descending to a right child, append a $1$ bit. Notice a few important properties of this encoding: more frequently occuring symbols are encoded using shorter codewords, due to being closer to the root; and no codeword is a prefix of any other codeword, which allows unambiguous decoding of a stream of bits given the Huffman tree used for encoding.

See the figure below for an example of one possible Huffman tree built from a string containing $60$ As, $30$ Bs, and $10$ Cs. The resulting (symbol, codeword) pairs are (A,0), (B,10), (C, 11).

\includegraphics[width=.5\textwidth ]{huffman-code}

Canonical Huffman Codes

Canonical Huffman codes improve on the above scheme by eliminating the need to transmit the whole codebook or tree structure alongside the compressed data. To turn a Huffman code into a canonical Huffman code, write down the codebook of (symbol, codeword) pairs and sort it: first by codeword length, and then by increasing ASCII value of the symbol. Assign each symbol a new codeword of the same length as the original codeword as follows:

  • The first symbol’s new codeword is a sequence of all zeros (of the same length as the original codeword).

  • To form each subsequent symbol’s new codeword, treat the previous symbol’s new codeword as a binary number and increment it by one (padding on the left with zeros as necessary to maintain the codeword length).

  • As an addendum to the previous rule: if a symbol’s original codeword is longer than the previous symbol’s original codeword, then after performing the previous step, if the new codeword is still too short, append zeros until the new codeword length matches the old length.

The table below shows an example of applying this process to a codebook with six symbols:

Symbol

Huffman Code

Canonical Huffman Code

C

11

00

D

00

01

A

101

100

B

011

101

E

100

110

F

010

111

It can be shown that the above procedure yields a unique canonical Huffman code for any input sequence of symbols. Notice also that the canonical Huffman codebook is completely determined by the sorted list of symbols and the length of the canonical Huffman code for each symbol.

Encoding a Symbol Sequence

An input sequence of symbols can be compressed losslessly as a canonical Huffman code followed by the encoded input sequence. Specifically, to compress a string of ASCII characters as a sequence of bytes:

  • First, write an $8$-bit integer $n$: the number of distinct ASCII symbols in the input sequence (and length of the codebook).

  • Then write $n$ pairs of $8$-bit integers $s_ i$ and $l_ i$, where $s_ i$ is the ASCII value of the $i$-th symbol in the canonical Huffman codebook and $l_ i$ is the length (in bits) of the corresponding codeword.

  • Replace all symbols in the input sequence with their corresponding codewords to yield a $k$-bit binary string $S$. Write $k$ as a $32$-bit integer.

  • Then write $S$ as $\lceil k/8 \rceil $ bytes. (Append zero bits to $S$ until its length is a multiple of $8$ bits).

Write all integers and bytes above in capitalized hexadecimal (using the characters 09, AF). Write integers in big-endian order (most significant $4$ bits first).

Input

The first line of input is a single string, either COMPRESS or DECOMPRESS, and specifies whether your program should compress an ASCII string as a Huffman-coded byte sequence, or decompress an encoded sequence back into ASCII.

If the first line is COMPRESS, the second line of input is a string of printable ASCII characters (ASCII codes $33$ through $126$) and spaces, and will end with a newline that does not count as part of the string to encode. The string does not contain tabs or other non-space whitespace characters, and does not begin or end with a space. The string consists of between $2$ and $10^6$ characters and contains at least two distinct ASCII characters.

If the first line is DECOMPRESS, the second line of input is a string of capitalized hexadecimal digits of length at most $10^6$. It is guaranteed that this hex string is a valid canonical Huffman encoding of a string of ASCII characters satisfying all of the restrictions listed above.

Output

If the first line of input is COMPRESS, output the capitalized hex characters of the canonical-Huffman-encoded input text (followed by a newline). Do not encode the trailing newline in the input string.

If the first line of input is DECOMPRESS, output the decoded ASCII string (followed by a newline). Your solution must reproduce the spaces in the decoded string exactly to be judged correct.

Scoring

Your submission will be judged separately on compression and decompression test cases. You may implement only one of these two modes for 50% of the problem points.

Sample Input 1 Sample Output 1
COMPRESS
MADAM IM MAD-ADAM
06410244024D0220032D0449040000002884B7DA1E12
Sample Input 2 Sample Output 2
DECOMPRESS
0645012003460352034804540400000024B89FC99580
FREE THE REFEREE

Please log in to submit a solution to this problem

Log in