Analysis

Posted on January 24th, 2008 by hgndgtl

Introduction

This section dig into the world of basic principles. Sometimes the discussed topic is far behind the "basics lecture notes" but anyway: it is a nice area with a lot of interesting details. Concern the term "basic" as background knowledge.

The first section start with the basic design of hash functions. The second section elucidated some background knowledge about Pseudorandom Number Generators (PRNG).

Avalanche

One key indicator of a hash function is the ability to spread message bits throughout all bits of the resultset. The avalanche behaviour states that any single bit change in the input keys will on average flip half the bits in the output.

The following figure demonstrate the effect if one bit toggles. The smaller arrows points to bit flips in the input and output vector (one changed input bit, four changed output bits):

avalanche example

One note: the input as well as the output is 7bit width (upright in the previous figure) the avalanche behaviour is in the figure 0.571429. Therefore 0 means no effect at all and 1 means change all output bits completly. 0.5 is the ambition! ;-) But the avalanche behaviour concentrate on the average situation - one pass like in the figure is next to nothing!

This charactersitic of a hash function is important because keys tend to be "similar". For example, think about your natural language! First, in English there 7bit are sufficient - the 8 bit is 0. Then on the other side think about the frequency of letters. In German E has a relativ occurrence of 17,40 %, N 9,78 % and so on. But that's not the only relations: The most common initial letter in German is D with 14,2 % followed by S with 10,8 %.

Words are often keys, but that no mandatory regulation. Often keys are symbols. Think about the situation when you use a library (e.g. glibc) that resolves the function names (e.g. getline(3)). The names are saved within a hashtable in the ELF file. Now think about your programm, ... are there are similar patterns to note? For example underscores at the begining. Type readelf -s PATHTOPROGRAM | awk '{print $8}'. This will list all symbols in the object file, local and global ones.

Anyway: a good hash function should calculate good results for every type of key! If thats not true for your hashfunction than this is an idicator for other waek spots too!

The following four figures demostrate the avalanche behaviour of four well-known functions, Goulburn, Hsieh, Phong and Korzendorfer1 (from left to right and top to bottom).
Note: not all figures are scaled in the Z axis from 0 to 10. (0: no effect, 5 perfect and 10 all bits touched).

goulburn hsieh
phong korzendorfer1

For an detailed presentation of the serveral hash function see the following sections. This is the introduction section for avalanche.

Timing Impact

The following figure show the timing impact for different hash functions and variable input size hashkeys.

Hash Speed

Pseudorandom Number Generator - PRNG

Many fields in the information technology requires random factor. The main problem is that computers are very predictable devices. This section discuss some random number generator algorithms, whose characteristics and qualities. This section does not elucidated the usage when you should call srand().

Entropy

In the information theory you proceed from the assumption that for an communication a finite set of symbols is available. This set is commonly refereed as the alphabet. Every transmitted symbol appear with a certain probability. Imagine an data stream of your favorite newspaper: some symbols are quit often (e.g. the character A) and other (e.g. the character Q) not so often.

This assumption lead to the following characteristic: a symbol with a high appearing probability are more often received and is therefore a low acquisition of information (an vice versa). Claude Shannon, the "the father of information theory", was the man of the hour. Search the World Wide Web for background information.

To close the circle: on major value to characterize the medial information content is the entropy. The entropy stated which average information content available is. The averaging depends on the probability of occurrence.

/dev/random and /dev/urandom

The pseudo devices /dev/random and /dev/urandom are sources for random numbers provided by Linux and BSD derivates. They gathers environmental noise from a bundle of devices. The difference between these two devices is that /dev/random will block if the entropy fall below a defined threshold. Both devices referring to the same random pool.

Example input sources for example are keyboard interrupts are one source of random. Different types of devices can register their self as a source of randomness. These gained values are shifted into the entropy pool and mixed with a CRC function. And because the generator account how many bytes are mixed into the pool, the pool can determine the amount of entropy (thats the former mentioned threshold). If a user gather random from the pool then the released information is checksummed with a SHA checksum. This prevents malicious users to determine the internal state of the pool.

Glibc - rand(3), lrand48(3) and random(3)

lrand48(3) use the linear congruential algorithm and 48-bit integer arithmetic to produce non-negative long integers 0 and 2^31. The initial seed is 1. lrand48 is declared obsolete rand(3) should be used instead.

MT19937 - Mersenne Twister

The MT19937 algorithm is a variant of the twisted generalized feedback shift-register algorithm. The period of the algorithm is 2^19937 - 1 where the output is distributed in 623 dimensions. The initial seed is 5489.