Skip to main content

Huffman Tree and Huffman Coding

1. Huffman Tree

1.1 Definition

A Huffman Tree is a binary tree used for data compression, also known as an optimal binary tree. Its distinguishing characteristic is that it minimizes the weighted path length. Path length is the number of edges from the root node to a leaf node, and weighted path length is the product of the path length and the node's weight.

1.2 Construction Method and Process

The process of constructing a Huffman Tree is as follows:

  1. Initialization: Place all characters to be encoded along with their weights (frequencies of occurrence) as individual single-node trees into a priority queue (min-heap).
  2. Tree Construction:
    • Extract the two trees with the smallest weights from the priority queue. Use them as the left and right subtrees of a new tree, where the root node's weight is the sum of the two subtrees' weights.
    • Re-insert the new tree into the priority queue.
    • Repeat the above process until only one tree remains in the priority queue; this tree is the Huffman Tree.

1.3 Huffman Algorithm Implementation (C)

#include <stdio.h>
#include <stdlib.h>

// Define Huffman tree node structure
typedef struct {
char data; // Character
int weight; // Weight (frequency)
struct HuffmanNode *left, *right;
} HuffmanNode;

// Create a new Huffman tree node
HuffmanNode* createNode(char data, int weight) {
HuffmanNode* node = (HuffmanNode*)malloc(sizeof(HuffmanNode));
node->data = data;
node->weight = weight;
node->left = node->right = NULL;
return node;
}

// Define priority queue (min-heap) structure
typedef struct {
int size;
int capacity;
HuffmanNode **array;
} MinHeap;

// Create a min-heap
MinHeap* createMinHeap(int capacity) {
MinHeap* minHeap = (MinHeap*)malloc(sizeof(MinHeap));
minHeap->size = 0;
minHeap->capacity = capacity;
minHeap->array = (HuffmanNode**)malloc(minHeap->capacity * sizeof(HuffmanNode*));
return minHeap;
}

// Swap two Huffman tree nodes
void swapNode(HuffmanNode** a, HuffmanNode** b) {
HuffmanNode* t = *a;
*a = *b;
*b = t;
}

// Heapify a min-heap
void minHeapify(MinHeap* minHeap, int idx) {
int smallest = idx;
int left = 2 * idx + 1;
int right = 2 * idx + 2;

if (left < minHeap->size && minHeap->array[left]->weight < minHeap->array[smallest]->weight)
smallest = left;

if (right < minHeap->size && minHeap->array[right]->weight < minHeap->array[smallest]->weight)
smallest = right;

if (smallest != idx) {
swapNode(&minHeap->array[smallest], &minHeap->array[idx]);
minHeapify(minHeap, smallest);
}
}

// Check if min-heap contains only one node
int isSizeOne(MinHeap* minHeap) {
return (minHeap->size == 1);
}

// Extract root node of min-heap
HuffmanNode* extractMin(MinHeap* minHeap) {
HuffmanNode* temp = minHeap->array[0];
minHeap->array[0] = minHeap->array[minHeap->size - 1];
--minHeap->size;
minHeapify(minHeap, 0);
return temp;
}

// Insert a node into min-heap
void insertMinHeap(MinHeap* minHeap, HuffmanNode* minHeapNode) {
++minHeap->size;
int i = minHeap->size - 1;
while (i && minHeapNode->weight < minHeap->array[(i - 1) / 2]->weight) {
minHeap->array[i] = minHeap->array[(i - 1) / 2];
i = (i - 1) / 2;
}
minHeap->array[i] = minHeapNode;
}

// Build min-heap
void buildMinHeap(MinHeap* minHeap) {
int n = minHeap->size - 1;
int i;
for (i = (n - 1) / 2; i >= 0; --i)
minHeapify(minHeap, i);
}

// Create and build min-heap
MinHeap* createAndBuildMinHeap(char data[], int freq[], int size) {
MinHeap* minHeap = createMinHeap(size);
for (int i = 0; i < size; ++i)
minHeap->array[i] = createNode(data[i], freq[i]);
minHeap->size = size;
buildMinHeap(minHeap);
return minHeap;
}

// Build Huffman tree
HuffmanNode* buildHuffmanTree(char data[], int freq[], int size) {
HuffmanNode *left, *right, *top;
MinHeap* minHeap = createAndBuildMinHeap(data, freq, size);

while (!isSizeOne(minHeap)) {
left = extractMin(minHeap);
right = extractMin(minHeap);
top = createNode('$', left->weight + right->weight);
top->left = left;
top->right = right;
insertMinHeap(minHeap, top);
}
return extractMin(minHeap);
}

// Print Huffman codes
void printCodes(HuffmanNode* root, int arr[], int top) {
if (root->left) {
arr[top] = 0;
printCodes(root->left, arr, top + 1);
}
if (root->right) {
arr[top] = 1;
printCodes(root->right, arr, top + 1);
}
if (!(root->left) && !(root->right)) {
printf("%c: ", root->data);
for (int i = 0; i < top; ++i)
printf("%d", arr[i]);
printf("\n");
}
}

// Huffman coding main function
void HuffmanCodes(char data[], int freq[], int size) {
HuffmanNode* root = buildHuffmanTree(data, freq, size);
int arr[100], top = 0;
printCodes(root, arr, top);
}

int main() {
char arr[] = {'a', 'b', 'c', 'd', 'e', 'f'};
int freq[] = {5, 9, 12, 13, 16, 45};
int size = sizeof(arr) / sizeof(arr[0]);

HuffmanCodes(arr, freq, size);

return 0;
}

2. Huffman Coding

2.1 Definition

Huffman coding is a lossless data compression algorithm. It uses a Huffman Tree to assign a unique binary code to each character. Huffman coding has the prefix property, meaning that no code is a prefix of another code, thereby guaranteeing unique decodability.

2.2 Prefix Codes

A prefix code is an encoding scheme in which no code is a prefix of any other code. Huffman coding is a type of prefix code. A prefix code is an encoding scheme in which no code is a prefix of another code. In other words, in a prefix code, no code can be the beginning portion of another code.

Let us examine each given encoding set to determine whether any code is a prefix of another.

A. 1111

  • 0 is not a prefix of any other code.
  • 10 is not a prefix of any other code.
  • 110 is not a prefix of any other code.
  • 1111 is not a prefix of any other code.

Set A is a prefix code.

B. {11,10,001,101,0001}\{11, 10, 001, 101, 0001\}

  • 11 is not a prefix of any other code.
  • 10 is a prefix of 101.
  • 001 is not a prefix of any other code.
  • 101 is not a prefix of any other code.
  • 0001 is not a prefix of any other code.

Set B is not a prefix code.

C. {00,010,0110,1000}\{00, 010, 0110, 1000\}

  • 00 is not a prefix of any other code.
  • 010 is not a prefix of any other code.
  • 0110 is not a prefix of any other code.
  • 1000 is not a prefix of any other code.

Set C is a prefix code.

Upon inspection, all given encoding sets are prefix codes. Please re-examine the problem conditions to ensure no prefix relationships have been overlooked.

If you find any errors in the problem description or encoding sets, please provide additional information or clarify the question.

2.3 Applications

Huffman coding is widely used in the field of data compression, such as in file compression (e.g., ZIP files), image compression (e.g., JPEG format), and audio compression (e.g., MP3 format). Its primary advantage is the ability to significantly reduce data storage space while ensuring lossless recovery of data.