Introduction
This section dig into the world of basic principles. Sometimes
the discussed topic is far behind the "basics lecture notes" but
anyway: it is a nice area with a lot of interesting details.
Concern the term "basic" as background knowledge.
The first section start with the basic design of hash functions.
The second section elucidated some background knowledge about
Pseudorandom Number Generators (PRNG).
One key indicator of a hash function is the ability to spread message
bits throughout all bits of the resultset. The
avalanche
behaviour states that any single bit change in the input keys will on
average flip half the bits in the output.
The following figure demonstrate the effect if one bit toggles. The smaller arrows points to
bit flips in the input and output vector (one changed input bit, four changed output bits):
One note: the input as well as the output is 7bit width (upright in the previous figure) the avalanche behaviour
is in the figure 0.571429. Therefore 0 means no effect at all and 1 means change
all output bits completly. 0.5 is the ambition! ;-)
But the avalanche behaviour concentrate on the average situation - one pass like in the figure
is next to nothing!
This charactersitic of a hash function is important because keys tend to be "similar". For example,
think about your natural language! First, in English there 7bit are sufficient - the 8 bit is 0. Then on the
other side think about the frequency of letters. In German
E has a relativ occurrence of 17,40 %,
N
9,78 % and so on. But that's not the only relations: The most common initial letter in German is
D with
14,2 % followed by
S with 10,8 %.
Words are often keys, but that no mandatory regulation. Often keys are symbols. Think about the situation when you
use a library (e.g. glibc) that resolves the function names (e.g.
getline(3)). The names are saved within a
hashtable in the
ELF file. Now think about your programm, ... are there are similar patterns to note? For example
underscores at the begining.
Type
readelf -s PATHTOPROGRAM | awk '{print $8}'. This will list all symbols in the object file, local and
global ones.
Anyway: a good hash function should calculate good results for every type of key! If thats not
true for your hashfunction than this is an idicator for other waek spots too!
The following four figures demostrate the avalanche behaviour of four well-known functions,
Goulburn,
Hsieh,
Phong and
Korzendorfer1 (from left to right and top to bottom).
Note: not all figures are scaled in the Z axis from 0 to 10. (0: no effect, 5 perfect and 10 all bits touched).
For an detailed presentation of the serveral hash function see the following sections. This is the
introduction section for avalanche.
The following figure show the timing impact for different hash
functions and variable input size hashkeys.
Many fields in the information technology requires random
factor. The main
problem is that computers are
very predictable devices.
This section discuss some random number generator algorithms, whose
characteristics and qualities. This section does not elucidated the usage when
you should call
srand().
Entropy
In the information theory you proceed from the assumption that for an
communication a finite set of symbols is available. This set is commonly
refereed as the alphabet. Every transmitted symbol appear with a certain
probability. Imagine an data stream of your favorite newspaper: some symbols
are quit often (e.g. the character A) and other (e.g. the character Q) not so
often.
This assumption lead to the following characteristic: a symbol with a high
appearing probability are more often received and is therefore a low
acquisition of information (an vice versa).
Claude Shannon,
the "the father of information theory", was the man of the hour.
Search the World Wide Web for background information.
To close the circle: on major value to characterize the medial information
content is the entropy. The entropy stated which average information content
available is. The averaging depends on the probability of occurrence.
/dev/random and /dev/urandom
The pseudo devices /dev/random and /dev/urandom are sources for random
numbers provided by Linux and BSD derivates. They gathers environmental noise
from a bundle of devices. The difference between these two devices is that
/dev/random will block if the entropy fall below a defined threshold.
Both devices referring to the same random pool.
Example input sources for example are keyboard interrupts are one
source of random. Different types of devices can register their self as a
source of randomness. These gained values are shifted into the entropy pool
and mixed with a CRC function. And because the generator account how many bytes
are mixed into the pool, the pool can determine the amount of entropy (thats
the former mentioned threshold). If a user gather random from the pool then the
released information is checksummed with a SHA checksum. This prevents
malicious users to determine the internal state of the pool.
Glibc - rand(3), lrand48(3) and random(3)
lrand48(3) use the linear congruential algorithm and 48-bit integer
arithmetic to produce non-negative long integers 0 and 2^31. The
initial seed is 1.
lrand48 is declared obsolete
rand(3)
should be used instead.
MT19937 - Mersenne Twister
The MT19937 algorithm is a variant of the twisted generalized feedback shift-register
algorithm. The period of the algorithm is 2^19937 - 1 where the output
is distributed in 623 dimensions. The initial seed is 5489.