Balancing decoding speed and memory usage for Huffman codes using quaternary tree
 Ahsan Habib^{1}Email authorView ORCID ID profile and
 Mohammad Shahidur Rahman^{1}
Received: 13 December 2016
Accepted: 26 December 2016
Published: 7 January 2017
Abstract
In this paper, we focus on the use of quaternary tree instead of binary tree to speed up the decoding time for Huffman codes. It is usually difficult to achieve a balance between speed and memory usage using variablelength binary Huffman code. Quaternary tree is used here to produce optimal codeword that speeds up the way of searching. We analyzed the performance of our algorithms with the Huffmanbased techniques in terms of decoding speed and compression ratio. The proposed decoding algorithm outperforms the Huffmanbased techniques in terms of speed while the compression performance remains almost same.
Keywords
Binary tree Encoding and decoding Huffman tree Quaternary tree Data compressionBackground
Huffman (1952) presented a coding system for data compression at I.R.E conference in 1952 and informed that no two messages will consist of same coding arrangement and the codes will be produced in such a way that no additional arrangement is required to specify where a code begins and ends once the starting point is known. Since that time Huffman coding is not only popular in data compression but also image and video compression (Chung 1997). Schack (1994) described in his paper that codeword lengths of both Huffman and Shanon–Fano have similar interpretation. Katona and Nemetz (1978) investigated the connection between selfinformation of a source symbols and its codeword length.
In another research, Hashemian (1995) introduced a new compression technique with the clustering algorithm. In this new type of algorithm, he claimed that it required minimum storage whereas the speed for searching of symbol will be high. He also conducted experiment on video data and found his method very efficient. Chung (1997) introduced an arraybased data structure for Huffman tree where the memory requirement is \(3n  2\). He also proposed a fast decoding algorithm for this structure and claimed that the memory size can be reduced from \(3n  2\) to \(2n  3\), where n is the number of symbols. To attain more decoding speed with compact memory size, Chen et al. (1999) presented a fast decoding algorithm with \(O \left( {\log n} \right)\) time and \(\lceil\frac{3n}{2}\rceil + \lceil\left( {\frac{n}{2}} \right)\log n\rceil + 1\) memory space.
Banetley et al. (1986) introduced a new compression technique that is quite close to Huffman technique with some implementation advantages; it requires onepass over the data to be compressed. Sharma (2010) and Kodituwakku and Amarasinghe (2011) have presented that Huffmanbased technique produces optimal and compact code. However, the decoding speed of this technique is relatively slow. Bahadili and Hussain (2010) presented a new bit level adaptive data compression technique based on ACW algorithm, which is shown to perform better than many widely used compression algorithms in terms of compression ratio. Hermassi et al. (2010) showed how a symbol can be coded by more than one codeword having the same length. Chowdhury et al. (2002) presented a new decoding technique of selfstyled static Huffman code, where they showed a very efficient representation of Huffman header. In paper, Suri and Goel (2011) focused on the use of ternary tree, where a new onepass algorithm for decoding adapting Huffman codes is implemented.
Fenwick (1995) in his research showed that the Huffman codes do not improve the code efficiency at all time. It shows that the performance is always declining when moving to the lower extension to higher extension. Szpankowski (2011) and Baer (2006) explained the minimum expected length of fixedtovariable lossless compression without prefix constraint. Huffman principle, which is well known for fixedtovariable code, is used in Kavousianos (2008) as a variabletovariable code. A new technique for online compression in networks has been presented by Vitter (1987) in his paper. Habib et al. (2013) introduced Haffman code in the field of database compression. Gallager (1978) explained four properties of Huffman codes—sibling property, upper bound property, codeword length property and symbol frequency property. He also proposed an adaptive approach of Huffman coding. Lampel and Ziv (1977) and Welch (1984) described a coding technique for any kind of source symbol. Lin et al. (2012) worked on the efficiency of Huffman decoding, where authors first transform the basic Huffman tree to recursive Huffman tree, and then the recursive Huffman algorithm decodes more than one symbol at a time. In this way, it achieves more decoding speed. Google Inc. recently released a compression tool named Zopfli (Alakuijala and Vandevenne 2013) and claimed that Zopfli yields the best compression ratio.
In summary, it is revealed in the literature that using binary Huffman code it is difficult to achieve a balance between speed and memory usage. In this paper, we focus on the use of quaternary tree instead of binary tree that speeds up decoding time. Here, we employ two algorithms for encoding and decoding quaternary Huffman codes for the implementation of our proposed technique. When compared with the Huffmanbased techniques, the proposed decoding algorithm exhibits excellent performance in terms of speed while the compression performance remains almost same. In this way, the proposed technique offers a way to balance between the decoding time and memory usage. We have organized the paper as follows. In “Quaternary tree architecture” section, traditional binary Huffman decoding technique in data management systems is presented. The overview of our proposed architecture with encoding and decoding techniques is also presented in this section. The implementation technique has been described in “Implementation” section. The experimental results have been thoroughly discussed in “Result and discussion” section and finally “Conclusion” section concludes the paper.
Quaternary tree architecture
The main contribution of this research is to implement a new lossless Huffmanbased compression technique. The implementation of the algorithms has been explained with some mathematical foundations. Finally, implemented algorithms have been tested using real data.
Tree construction
Huffman codes to binary data
Codeword generation using binary Huffman principle
Character  Frequency  Code 

Space  8  000 
A  6  010 
E  5  101 
T  3  1000 
N  3  1001 
F  3  0110 
R  3  0111 
H  2  1101 
I  2  00110 
S  2  00111 
M  2  00100 
U  2  00101 
X  1  11001 
P  1  110000 
L  1  110001 
O  1  11110 
Q  1  11111 
Y  1  11100 
.  1  11101 
Huffman codes to quaternary data
Quaternary tree or 4ary tree is a tree in which each node has 0–4 children (labeled as LEFT child, LEFT MID child, RIGHT MID child, RIGHT child). Here for constructing codes for quaternary Huffman tree, we use 00 for left child, 01 for leftmid child, 10 for rightmid child, and 11 for right child.

List all possible symbols with their probabilities;

Find the four symbols with the smallest probabilities;

Replace these by a single set containing all four symbols, and the probability of the parent is the sum of the individual probabilities.

Replicate the procedure until it has one node.
Codeword generation using quaternary Huffman principle
Character  Frequency  Code 

Space  8  00 
A  6  0100 
E  5  0101 
T  3  0110 
N  3  0111 
F  3  1000 
R  3  1001 
H  2  1010 
I  2  1011 
S  2  1100 
M  2  1101 
U  2  111000 
X  1  111001 
P  1  111010 
L  1  111011 
O  1  111100 
Q  1  111101 
Y  1  111110 
.  1  111111 
Comparison of binary and quaternary tree
Comparison of binary and quaternary tree
Parameter  Binary tree  Quaternary tree 

Level  6  3 
Total node  37  25 
Internal node  18  6 
Weighted path length  190  97 
Reduction of time using quaternary tree
Encoding and decoding time of a tree depends on the weighted path length of a tree. If n is the number of distinct character, \(L_{i }\) is code length of the ith character, and \(f_{i }\) is the frequency of the ith character, then we can write the required traversing time T as
Thus, the traversing time also depends on the height of the tree and frequency of different symbols. The height of a quaternary tree is always smaller than the height of a binary tree. For this reason, traversing time will be reduced for a petite tree.
The structure of header tree for decoding is very simple for the proposed technique. According to Fig. 2, it does not require to store the entire codeword in the header tree for a symbol. The most frequent symbol is stored first in the header which confirms faster decoding. Moreover, retrieving two bits at a time during decoding process also speeds up the process. In the decoding phase, matching (two bits at a time) from encoded bit string with the header starts from level 1 in the header tree. If there is any symbol with codeword of length 2, then it will be found in level 1 in the header tree. Likewise, matching a symbol with codeword of length 4 both the level 1 and level 2 have to be searched. The simplicity of the header tree also contributes to speed up the decoding process.
Implementation
As mentioned earlier, in quaternary tree each node has 0–4 children (labeled as LEFT child, LEFT MID child, RIGHT MID child, and RIGHT child).

Quaternary Huffman encoding

Quaternary Huffman decoding
Encoding algorithm
Encoding is a twopass problem. The first pass is to determine the frequencies of letters. We use this information to create the quaternary Huffman tree. We have used a dictionary to store the frequencies of the symbols. When a quaternary Huffman code has been generated, the symbol will be replaced by the code. This is a modification of Huffman algorithm (Coreman et al. 2001).
In line 1, we assign the unordered nodes, C in the queue, Q and later we take the count of nodes in Q and assign it to n. We assign the value of n to a new variable i. In line 4, we start iterating all the nodes in queue to build the quaternary tree until the count of i is greater than 1 which means that there are nodes still left to be added to the parent. In line 5, a new tree node, z is allocated. This node will be the parent node of the least frequent nodes. In line 6, we extract the least frequent node from the queue Q and assign it as a left child of the parent node z. The EXTRACTMIN (Q) function returns the least frequent node from the queue and removes it from the queue as well. In line 7, we take the next least frequent node from the queue and assign it as a leftmid child of the parent z.
From line 8 to 17, we check the value of i or the number of nodes left in the queue Q. If i equals 2, the frequency of the parent node z, \(f[z]\) will be the summation of the frequency of node v, \(f[v]\) and the frequency of node w, \(f[w]\). Likewise, for i is equal to 3, we extract another least frequent node from the queue and add it as a child and add its frequency to the parent node. For i is greater than 3, we extract two least frequent nodes and add them as rightmid and right child of the parent z and add their frequency to the parent z as well. In line 18, we insert the new parent node z into the queue, Q. In line 19, we take the count of the queue, Q and assign it to i again. The loop continues until a single node is left in the queue. Finally, we return the last and single node from the queue Q as a quaternary Huffman tree.
Decoding algorithm
Decoding is accomplished by reading the encoded data two bits at a time. When iterating the bit stream 00 bit pattern means go LEFT, 01 pattern means go LEFT MID, 10 pattern means go RIGHT MID and 11 pattern means go RIGHT in case of quaternary tree. When a bit pattern matches with a symbol according to the header tree, replace the bit pattern with that symbol and the process is iterated until reached the last bit of the stream.
In the following algorithm 2 in line 1, we assign the quaternary tree T in the local variable ln. Then, we take the total count of bits in n from B. In line 3, we initialize a local variable i with 0 which will be used as a counter. In line 4, we started iterating all the bits in B. As it is a quaternary tree, we have at most four leaves for a parent node: left, leftmid, rightmid, right and 00, 01, 10, 11 represent these leaf nodes, respectively. We take two bits at a time. EXTRACTBIT(B) returns a bit from the bit array B and removes it from B as well. In lines 5 and 6, local variables b1 and b2 are being assigned with two extracted bits from the bit array B.
From line 7 to line 15, we check the extracted bits to traverse the tree from the top. If the bits are 00, we take the left child of the parent ln and assign it to ln itself. For 01, we replace the parent ln with its leftmid child, for 10 we replace it with its rightmid child and for 11 we replace it with the right child. In line 16, we get the key of the replaced ln and assign it in k. Then, we check whether k has any value. If the k has any value, we write the value of the k in the output and update the ln with the quaternary tree T itself. In line 21, we increase the value of i by 2 and the loop gets continued and reads the next two bits.
This section discusses the encoding and decoding technique of a quaternary Huffman architecture. The search time for finding a source symbol using quaternary Huffman algorithm is \(O({ \log }_{4} n)\), whereas for Huffmanbased algorithm it is \(O({ \log }_{2} n)\).
Results and discussion
To verify the applicability and feasibility of the proposed quaternarybased technique, experimental evaluation has been performed on real data. The experimental results are compared with regular Huffmanbased techniques. Our target was to justify query time and the storage requirements in comparison with regular Huffmanbased techniques.
Experimental environment
Each query has been executed five times and the average execution time has been counted. The experiments are conducted on a machine with following specifications:
Data set
Data set
S/L  File name  Description  File size (bytes) 

1  Quaternarysource.txt  The source code of the quaternary Huffman implementation  9861 
2  Quaternarylicense.txt  The license file of the quaternary Huffman implementation  18,651 
3  Lgpl2.1.txt  The famous lgpl 2.1 license  27,032 
4  Thematrixtranscript.txt  The transcript of the movie matrix  46,836 
Decoding performance
Decoding performance of the proposed method and regular Huffmanbased Technique
S/N  Source file  File size (bytes)  Time (ms)  Enhancement rate over regular binary ((RH − QH) * 100)/RH  

Quaternary Huffman (QH)  Regular Huffmanbased techniques (RH) (Chowdhury et al. 2002)  
1  Quaternarysource.txt  9861  3  7  57.14 
2  Quaternarylicense.txt  18,651  6  12  50.00 
3  Lgpl2.1.txt  27,032  7  16  56.25 
4  Thematrixtranscript.txt  46,836  12  27  55.56 
Four source files of different file size have been used altogether to measure the performance. In Table 5, it has been observed that for each case, quaternary Huffman technique is more than 50% faster than the regular Huffmanbased techniques in case of decoding time.
Compression performance of the proposed technique and regular Huffmanbased technique
Source file  Space (byte)  Enhancement rate (Quaternary) ((OS − QH) * 100)/OS  Enhancement rate (regular) ((OS − RH) * 100)/OS  

Original size (OS)  Quaternary Huffman (QH)  Huffmanbased technique (RH) (Chen et al. 1999)  
Quaternarysource.txt  9861  6958  6347  29.44  35.64 
Quaternarylicense.txt  18,651  13,520  10,930  27.51  41.40 
Lgpl2.1.txt  27,032  16,042  15,840  40.66  41.40 
Thematrixtranscript.txt  46,836  30,909  27,816  34.01  40.61 
Performance test with reknown corpus and recent Huffmanbased techniques
We compare the performance of the proposed technique with Zopfli (Alakuijala and Vandevenne 2013), WinZip (2016) and PKZip (2016) algorithms. Google claims that Zopfli produces the highest compression ratio for similar technique. Zopfli uses Huffman coding to replace each value with a string of bits. WinZip and PKZip are the most widely used recent Huffmanbased compression tools. In all cases, we took the average output of five runs.
Comparison of the proposed technique with recent Huffmanbased techniques for Enwik (The Enwik8 Corpus. http://mattmahoney.net/dc/text.html http://mattmahoney.net/dc/enwik8.zip) corpus
Method/algorithm  Space (MB)  Compression enhancement with respect to original file (%)  Compression–decompression time (s)  Time enhancement with respect to Zopfli (%) 

Quaternary  49.67  47.88  186.88  59.66 
WinZip  35.2  63.06  187.65  59.49 
PKZip  34.5  63.80  195.21  57.86 
Zopfli  33.37  64.98  463.26  – 
The result indicates that compression ratio is highest for Zopfli but the compression and decompression speed is very slow. The Zopfli requires over 400 s whereas all other techniques require less than 200 s. If we would compromise between time–space, and when speed is the main factor, then we may choose quaternary technique for this type of large corpus.
Comparison of the proposed technique with recent Huffmanbased techniques for Canterbury corpus
Method/algorithm  Space (MB)  Compression enhancement with respect to original file (%)  Compression–decompression time (s)  Time enhancement with respect to Zopfli (%) 

Quaternary  1.71  35.95  1.37  89.78 
WinZip  0.71  73.40  5.61  46.471 
PKZip  0.69  74.15  2.74  21.26 
Zopfli  0.64  76.07  13.36  – 
If we observe the result, it has been shown that compression ratio is highest for Zopfli but its compression and decompression speed is very slow. The Zopfli requires over 13 s whereas all other techniques require less time.
In this section, we have analyzed both techniques thoroughly with different example in terms of time and space. For decoding speed, the proposed quaternary technique outperforms the regular Huffmanbased techniques. On the other hand, the compression recital is almost similar for most of the files.
Conclusion
A new lossless compression technique based on Huffman principle is implemented in this paper. We introduced quaternary tree instead of binary tree in Huffman principle. We have shown that representation of Huffman code using quaternary tree is more beneficial than Huffman code using binary tree in terms of processing speed with an insignificant increase in required space. When speed is the main factor, then the quaternary tree based technique performs better than the binary tree based technique. Thus, the proposed technique provides a way to balance between the decoding time and memory usage.
Declarations
Authors’ contributions
The authors discussed the problem and the solutions proposed all together. Both authors participated in drafting and revising the final manuscript. Both authors read and approved the final manuscript.
Acknowledgements
Authors are grateful to ministry of posts, telecommunications and information technology, People’s Republic of Bangladesh for their grant to do this research work. The authors would like to thank the anonymous experts for their valuable comments and suggestion for improving the quality of this research paper.
Competing interests
The authors declare that they have no competing interests.
Availability of data
The datasets supporting of this article are available online in the following link.
The famous lgpl 2.1 license, Accessed at https://www.gnu.org/licenses/lgpl2.1.txt
The transcript of the movie Matrix. Accessed at http://thematrixtruth.remoteviewinglight.com/
The Enwik8 Corpus. Accessed at http://mattmahoney.net/dc/text.html http://mattmahoney.net/dc/enwik8.zip
The Canterbury Corpus. Accessed at http://corpus.canterbury.ac.nz/resources/cantrbry.zip
The WinZip compression tool, version 1.0.220.1, released by WinZip Computing, S.L., A Corel Company. Accessed at: http://www.winzip.com/win/en/downwz.html
The PKZip compression tool, version 14.40.0028, released by PKWARE Inc. Accessed at https://www.pkware.com/pkzip
Funding
All the funding provided by the Ministry of Posts, Telecommunications and Information Technology, People’s Republic of Bangladesh [Order No: 56.00.0000.028.33.007.14 (part1)275, date: 11.05.2014; and Order No: 56.00.0000.028.33.025.14115, date 10.05.2015]. The above funding gives the financial support for the designing of the study and conducting experiments.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Alakuijala J, Vandevenne L (2013) Data compression using Zopfli. Google Inc. https://zopfli.googlecode.com/file/Data_compression_using_Zopfli.pdf
 Baer M (2006) A general framework for codes involving redundancy minimization. IEEE Trans Inf Theory 52:344–349MathSciNetView ArticleMATHGoogle Scholar
 Bahadili HA, Hussain SM (2010) A bitlevel text compression scheme based on the ACW algorithm. Int J Autom Comput 7(1):123–131View ArticleGoogle Scholar
 Benetley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. Commun ACM 29(4):320–330MathSciNetView ArticleMATHGoogle Scholar
 Chen HC, Wang YL, Lan YF (1999) A memoryefficient and fast Huffman decoding algorithm. Inform Process Lett 69:119–122MathSciNetView ArticleMATHGoogle Scholar
 Chowdhury RA, Kykobad M, King I (2002) An efficient decoding technique for Huffman codes. Info Process Lett 81:305–308MathSciNetView ArticleMATHGoogle Scholar
 Chung KL (1997) Efficient Huffman decoding. Inform Process Lett. 61:97–99MathSciNetView ArticleMATHGoogle Scholar
 Coreman TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms. The MIT Press, EnglandGoogle Scholar
 Fenwick PM (1995) Huffman code efficiencies for extensions of sources. IEEE Trans Commun 43:163–165View ArticleMATHGoogle Scholar
 Gallager RG (1978) Variations on a theme by Huffman. IEEE Trans Inf Theory 24(6):668–674MathSciNetView ArticleMATHGoogle Scholar
 Habib A, Hoque ASML, Hussain MR (2013) HHIBASE: compression enhancement of HIBASE technique using Huffman coding. J Comput 8(5):1175–1183View ArticleGoogle Scholar
 Hashemian R (1995) Memory efficient and highspeed search Huffman coding. IEEE Trans Comm 43(10):2576–2581View ArticleMATHGoogle Scholar
 Hermassi H, Rhouma R, Belghith S (2010) Joint compression and encryption using chaotically mutated Huffman trees. Commun Nonlinear SciNumerSimulat 15:2987–2999MathSciNetView ArticleMATHGoogle Scholar
 Huffman DA (1952) A method for construction of minimum redundancy codes. Proc IRE 40(1952):1098–1101View ArticleGoogle Scholar
 Katona GOH, Nemetz TOH (1978) Huffman codes and self information. IEEE Trans Inform Theory 22(3):337–340MathSciNetView ArticleMATHGoogle Scholar
 Kavousianos X (2008) Testdata compression based on variabletovariable Huffman encoding with codeword reusability. IEEE Trans Comput Aided Des Integr Circuits Syst 27:1333–1338View ArticleGoogle Scholar
 Kodituwakku SR, Amarasinghe US (2011) Comparison of lossless data compression algorithms for text data. Indian J Comput Sci Eng 1(4):416–426Google Scholar
 Lampel A, Ziv J (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23:337–343MathSciNetView ArticleMATHGoogle Scholar
 Lin YK, Huang SC, Yang CH (2012) A fast algorithm for Huffman decoding based on a recursion Huffman tree. J Syst Softw 85:974–980View ArticleGoogle Scholar
 Schack R (1994) The length of a typical Huffman codeword. IEEE Trans Inform Theory 40(4):1246–1247View ArticleMATHGoogle Scholar
 Sharma M (2010) Compression Using Huffman Coding. Int J Comput Sci Netw Secur 10(5):133–141Google Scholar
 Suri PR, Goel M (2011) Ternary tree and memoryefficient Huffman decoding algorithm. Int J Comput Sci Issues 8(1):483–489Google Scholar
 Szpankowski W (2011) Minimum expected length of fixedtovariable lossless compression without prefix constraints. IEEE Trans Inf Theory 57:4017–4025MathSciNetView ArticleGoogle Scholar
 The PKZip compression tool, version 14.40.0028, released by PKWARE Inc., accessed at https://www.pkware.com/pkzip. Accessed 19 July 2016
 The WinZip compression tool, version 1.0.220.1, released by WinZip Computing, S.L., A Corel Company. http://www.winzip.com/win/en/downwz.html. Accessed 19 July 2016
 Vitter JS (1987) Design and analysis of dynamic Huffman code. J ACM 34(4):825–845MathSciNetView ArticleMATHGoogle Scholar
 Welch TA (1984) A technique for highperformance data compression. IEEE Comput 17(6):8–19View ArticleGoogle Scholar
 Wikipedia short history of Huffman coding. http://en.wikipedia.org/wiki/Huffman_coding. Accessed 31 July 2011