-.- |- FlashHash & BackHash for all-efficient multiple recycled hashes e.g. for dispersed virtual hashing in Bloom filters (tutorial + math) CopyRight (C) 2002 + 2004 , Jan Hajek , NL NO part of this document may be published, implemented, programmed, copied or communicated by any means without an explicit & FULL REFerence to this author together with the FULL title and the website www.humintel.com/hajek for the freshest version, or www.matheory.info , plus the CopyRight note in texts and in ALL references to this. An implicit, incomplete, indirect, disconnected or unlinked reference (in your text and/or on www) does NOT suffice. All based on 1st-hand experience. ALL rights reserved. Version 1.23a , 2004-9-9 , lines of < 79+CrLf chars in plain ASCII for easy view, unlikely to be updated soon, submitted to the webmaster of http://www.humintel.com/hajek ( = the freshest version ). This epaper may read better (more of left margin, PgDn/Up) outside your email. This epaper has new facilities for fast finding and browsing. Please save the last copy and use a file differencer to see only where the versions differ : download Visual Compare VC154.ZIP and run VCOMP vers1 vers2 /k /i which is the best & the brightest colorful line-by-line comparer for plain .txt files. Your comments are welcome, preferably interlaced into this .txt lines-file. New terms: FlashHash , BackHash , recycled randomness , rotated/recycled hash combinatorial implosion , combinatorially imploded , hashcrosstalk paired hashes , hashpair , dispersed hasher , dispersed virtual hashing , distributed hashing , distributed virtual hasher , monolithic Bloom filter , Bloom filter of the 1st 2nd 3rd kind , virtually stored , bitsharing factor , cumulated false positives , reproducible acyclic pseudo-entropy , imported/injected pseudo-entropy , hashclash , clashhash , and ALL inflections of ALL these new terms are all also subject to my CopyRight conditions above. ALL based on experience. ALL rights reserved. Browsers may like to search here for the following markers : !!! !! ! ?? ? { comments } [ refs ] -.- separates sections |- ------- framed stuff +Contents: +Executive summary = the beef at fiesta entropicana +Mottos +Basic & new terms, abbreviations & symbols +Bloom filters (BFs) = a tutorial + a quantitative design study +A comedy of errors = compact virtual hashing CH or HC vs. distributed / dispersed virtual hashing DH or BF +The proof of the pudding = Tab. 3 +FlashHashing & BackHashing = (m)eat the beef again +More explanations and hints for programmers +Bitwise rotations aka circular shifts aka barrelshifts +On hashing functions +Summary of flashhashing +From mission impossible via combinatorial implosion to mission accomplished +References -.- +Executive summary = the beef at fiesta entropicana : FlashHash can be easily understood in intuitive terms thus: An item, state, or a string-of-bits S is laboriously and irreversibly reduced by a HashFun(S) to a number called hash h0 viewed as a string of 0's and 1's 0011...01011 (or call them pearls and beads :-). By repeatedly closing the string h0 into a necklace, rotating it by few bits, and reopening or cutting it, k different open-string designs are easily produced. If the original h0 is remembered i.e. saved/stacked in a store, then all k strings can be almost effortlessly reproduced or regained from h0 by BackHash. The net effect is that k sufficiently stochastically independent hashes are obtained at almost no expense from a single call of a HashFun(S). This metaphor was inspired by recalling the poems by Hermann Hesse (1877-1962) near the end of his 1946 Nobel Prize winning novel Das Glasperlenspiel, or The Glass Bead Game, aka Magister Ludi (= The Master of the Game), which some folks see as a precursor of computing culture, algorYthmics, and www . But I designed also non-rotated versions based on "injected pseudo-entropy". Doubting Thomases may now like to find 22 occurrences of "independ" here. This new class of FlashHash algorithms is all-efficient i.e. extremely fast, space saving (all [re]generations are done by blitzy incremental bitshifting fewliners), short and simple, hence easy to do for both (wo)men and (wo)machines when generating k > 1 hashes necessary e.g. for well designed & well working Bloom filters. FlashHash was created especially for, but not only for, CPU-bound apps like e.g. Gerard Holzmann's (Bell Labs) SPIN's bitstate model checking. Dispersed virtual hashers ( DH ) like Bloom filters ( BF ) are explained first and compared with compact virtual hashers CH, all without a full collision resolution which would require full storage of all distinct S's during hasher's life-cycle. The main purpose of a virtual hasher is to avoid storage of full S's. Both dynamic and static design formulas for (non)optimized Bloom filters are given (find also "IT & TI"): f(n) = (1 - (1 - 1/M)^(k*n))^k is the instantaneous probability of a false positive AFTER the n-th item was "stored" in a BF[0..M-1] of bits; f(n) may be called a false+rate or a failrate. f(n) is the "general dynamics" formula for BFs. fo = 1/2^k and k*n = M*ln(2) or k*n = M*2/3 are for a BF filled optimally wrt memory utilization (useful in quasistatic apps when the final n can be estimated or fixed beforehand). At such a load, f has dropped exponentially with k increased linearly. fo is the "optimal statics" formula for BFs. Fn = Sum_i=1..n( f(i) ) is the cumulated number of false positives (for large n use compiled code on a fast PC); = Integral_i=1..n[ f(i) ] for fast numerically evaluated approximation, which is very good for very large n . Fn < n*f(n) ; display Fn as a float, as Fn < 1 or Fn > 1 may result. Fn is directly verifiable or falsifiable empirically (see below) in easy experiments based on simple I/O through a BF and counting the I-O's: Fn = (number of unique items in) - (number of unique items out) = number of unique items lost. (unique = distinct = no duplicate). Pnoloss = Prod_i=1..n( 1 - f(i)) < 1 probability of NO loss at all; if too near to 1 then better reads: Ploss = 1 - Prod_i=1..n( 1 - f(i)) < 1 = 1 - probability of NO loss of any item or state = probability of ONE OR MORE items or states lost = probability of AT LEAST ONE item or state lost = 1 - exp( Integral_i=1..n[ ln(1 - f(i)) ] ) for fast numerically evaluated approximation which is very good for very large n. (all AS IF during insertions of n unique i.e. distinct items) Ploss is a computable theoretical number which (unlike Fn ) is neither directly verifiable nor falsifiable in the spirit of Karl Popper's philosophy of science. Hence Ploss is a quasi-Platonic quantity. Ploss is introduced only for comparisons with other work (find Pnc , Pno ). Lets start with a 2-liner version in edu-C of FlashHash for monolithic BF1 with hashcrosstalk, which is the simplest version. Put overflow checking off especially if you have to use signed integer h0 in 2-complement arithmetic which has asymmetric subranges of absolute values for integers < 0 and > 0 . Choose your moduli M1, M well; they should either be prime numbers, or if M1, M are powers of 2, use M-1, M1-1 as masks i.e. bitwise AND-ed with h0. When trying out FlashHash, do not optimize prematuraly; first try with the % M operator with M prime, then with & Mmin1 [i.e. & (M-1) ], compare and check. Start with a simple version. Don't be pound foolish and penny wise :-) 1st version of FlashHash for a monolithic BF1 employs rotation of bits in h0: d = w - c; h0 = HashFun(S); h = h0 % M; /* use h or h[1] = h; */ for j = 2 to k do begin h0 = (h0 << c) | (h0 >> d); h = h0 % M; end; 2nd version of FlashHash for a monolithic BF1. Unlike the 1st version, the if j > 1 then ... allows concentration at one place of an application dependent blitzy additional in-line code following each h = h0 % M; typically the in-line code will test/check or set bits in a BF[0..M-1]. Like the 1st version, this 2nd one also saves precious bits from being skipped and wasted, although a c prime wrt w would finally visit all w bits in h0 . Wasting means that the c distance between bits will have to be smaller for the given w and k , if k <= 1 + (w-1) div c is wanted. Smaller distance c makes hashes less independent. d = w - c; h0 = HashFun(S); for j = 1 to k do begin if j > 1 then /* may take this line out, but will waste the 1st c bits */ h0 = (h0 << c) | (h0 >> d); /* bits rotated by c bits up ie to msb */ h = h0 % M; /* Here test/set h-th bit BF1[h], or do h[j] = h; */ end; 3rd version of FlashHash for monolithic BF1 (I call it Fiesta enTropicana): /* Initially fill an array ENTROPIC[1..kmax] of integers with arbitrary, preferably huge integers typed e.g. by Eddington's monkeys. These huge integers must remain CONSTant during the whole lifecycle of a BF as an "entropic database" ( which, btw, cannot be read by any 3-letter word agency like, say, KGA [ find KGA ] ). In BPascal I just declared & banged in these "pseudo-entropic numbers" : here kmax = 11 Const ENTROPIC : array [ 1..11 ] of LongInt = ( 194758204, 293745367, 389486698, 406837398, 585769263, 669402120, 734068134, 836141173, 928486384, 035590389, 107805073 ); Note that these are NOT numbers from a random generator, because it is desirable to obtain entropic bitstrings by means of blitzy xor which flips bits over the whole width of a LongInt h0 , hence we need huge integers within the range of LongInt h0 . Quiz: why my entropic numbers have their first digits 1, 2, 3, .. etc ? */ h0 = HashFun(S); /* Use this version if c too low, e.g. if k > w/3 */ for j = 1 to k do /* use h or do h[j] = h; */ begin if j > 1 then /* ..PIC[1] above allows to take THIS one line out */ h0 = h0 xor ENTROPIC[j]; h = h0 % M; /* put inline use of h */ end; 4th version of FlashHash for a monolithic BF1 is just another implementation of the 3rd version (also with IMPORTED/INJECTED pseudo-ENTROPY ): h0 = HashFun(S); for j = 1 to k do /* Use this version if k > w/3 ie if c too low */ begin /* if j > 1 then... is not needed anymore, but allowed */ /* xor with k distinct, arbitrary, huge numbers but */ case j of /* all k nrs within the range of integer h0 */ begin /* These nrs may be typed by Eddington's monkeys */ 1: h0 = h0 xor 132950380; 2: h0 = h0 xor 293745367; 3: h0 = h0 xor 389486698; 4: h0 = h0 xor 406815298; 5: h0 = h0 xor 585021763; 6: h0 = h0 xor 669402120; 7: h0 = h0 xor 734481234; 8: h0 = h0 xor 836141173; /* ... etc up to kmax ever needed */ else: h0 = h0 xor 007805073; /* Correctness insurance */ end case; h = h0 % M; /* or: h = offset + (h0 % M1) for BF3 or BF2 */ /* Use h or do h[j] = h; i.e. your in-line code */ end for; 5th version of FlashHash for a monolithic BF1 combines the 2nd & 3rd version so that higher number of hashes k are easily obtained without c too low and without too many entropic numbers typed by Eddington's monkeys: /* c prime is recommended as it will visit all w bits wrap around */ c = 13; d = w - c; /* may now be constants eg 13 and 19 = 32 - 13 */ h0 = HashFun(S); i = 0; for j = 1 to k do /* no if j > 1 then.., as we have enough entropy now */ begin /* declare and fill ENTROPIC[1..kmax]; may recycle % Emax */ if (j & 1)= 0 then h0 = h0 xor ENTROPIC[++i] /* or: PIC[(++i) % Emax] */ else h0 = (h0 << c) | (h0 >> d); h = h0 % M; /* Here inline test/set h-th bit BF1[h], or do h[j] = h; */ end; The 5 versions above are readily adopted to my favorite BF3 versions below. |--- No overkill & no overhead with blitzy FlashHash & BackHash : ! ! - No more than one call by FlashHash of a HashFun(S) per item, or ! string, or state S is needed. The received wisdom was k calls. ! - No other call on any other function is needed, not even on a slooow ! pseudorandom generator, also not desirable wrt XOR above. ! - No slooow floating point arithmetic needed. ! - No slooow multiplication needed, not even in integer arithmetic! ! - No division or the mod-op % is needed iff M or M1 is a power of 2. ! ! Making k calls of a HashFun(S) looks now hopelessly inefficient. ! Even k calls of a Random( h0 ) would be hopelessly inefficient, ! simply because k is a small number, at most few dozens ! (because we want use our bits well), hence we do not need a computa- ! tionally expensive pseudorandom sequence of numbers with a long period. ! Non-rotational versions of FH use a small set of entropic numbers. ! Non-rotational and rotational versions were combined into a HYBRID ! version nr. 5 which will RECYCLE (via % Emax) the INJECTED ENTROPY if ! h0 = h0 xor ENTROPIC[ (++i) % Emax] where Emax < k is allowed. ! ! By now even entrenched fundamentalists will have to admit that their ! traditional computations of k separate HashFuns per S are about as ! efficient as shooting at flies (or bits :-) with a nuclear howitzer, ! at least in the context of BFs and other non-Manichean (find below) ! apps i.e. where no enemies or adversaries are expected (find KGA ). ! |------------ The latter versions (re)produce REPRODUCIBLE ACYCLIC PSEUDO-ENTROPY, which may become useful when we want larger k , so that c would become too small and bit-patterns less independent when c < 3 or 4. In fact any reproducible set or sequence of remappings of h0 will do, provided they will be sufficiently pseudo- random, pseudo-chaotic, deterministically chaotic i.e. pseudo-entropic (ideally MaxEntropic) w.r.t. the Bloom filter BF[0..M-1]. Perfectionists' unattainable Platonic ideal of k stochastically independent hashing functions laboriously recomputing from S uniformly distributed hash- values is now not unlike the monuments of the Ideal Society Leaders (read Plato's Republic). Where are those Ideal Ideologists and their monuments now ? On graveyards together with their victims: 100000000 = 100 MegaDead murdered by the Kommies! That's the result of an ideology applied to people. An ideology in computer science does not waste lives, just cycles. But waiting in a queue for food at a store in a Kommie kantry was also wasting life-cycles of people who were there buried alive behind the barbed wire, watch towers manned by the subservient opportunists trying hard to obtain favors from their Ideal Leaders. Now with FlashHash you do not have to waste your life-cycles anymore when waiting for a result of, say, a model checker like SPIN, and you do not have to rerun it and wait again and again as you can get now 1000 times higher quality of a result in a single run at no extra cost of your boringly repeated sequential re-running, re-waiting, re-staring at crt i.e. your re-wasting your limited biocycles. FlashHash in edu-C for monolithic BF1 with hashcrosstalk : /* 2nd version with comments : 2nd version allows concentration of in-line code following the h = h0 % M; /* M = k*M1; for comparisons between BF1 and BF3 or BF2 */ /* Mmin1 = M - 1; for faster % iff M is a power of 2 ; see below */ d = w - c; /* w = nr bits in h0; use max c < w/2 for k <= 1 +(w-1)/c */ h0 = HashFun(S); /* Store THIS h0 now iff needed */ for j = 1 to k do /* generates k hashes per item or state S : */ begin /* max odd or prime c <= (max c allowed) will be the best c */ if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */ h = h0 % M ; /* put HERE inline code using h */ /* or: h = h0 & Mmin1; this is faster; use it iff M power of 2 */ end; /* h = might be h[j] = ... ; use h before and in the loop */ FlashHash in edu-C for BFs without hashcrosstalk i.e. for BF3 and BF2 : /* FlashHashing is likely to work better on partitioned / segmented BFs. FlashHashing can work almost as well as k independently computed HashFun(S). "Works" means "achieves low Fn", see Tab. 3. Huge "addressing bitspace" M = k*M1 bits may need larger w in bits, which also allows larger distance c bits i.e. independence. */ M = k*M1; M1min1 = M1 - 1; /* or: M1 = M div k i.e. int/int */ d = w - c; /* w = nr bits/word; use max c < w/2 for k <= 1 +(w-1)/c */ offset = -M1; h0 = HashFun(S); /* Store THIS h0 now iff needed */ for j = 1 to k do /* generates k hashes per item or state S : */ begin /* max odd or prime c <= (max c allowed) may be the best c */ if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */ offset += M1; /* For j = 1 is offset = 0 */ h = offset + (h0 % M1); /* Put HERE inline code using h */ /* or: h = offset + (h0 & M1min1), iff M1 power of 2 , is faster */ end; /* h = might be h[j] = ... ; Use h within loop or store it */ BackHash is always like FlashHash without h0 = HashFun(S) but with h0 = FetchFromStore; (e.g. from a stack) in its place. For easy maintenance BackHash can be integrated with FlashHash by means of a parametrized condition : if wantFH then h0 = HashFun(S) else h0 = FetchFromStore; BackHash regenerates from one hash h0 by bitshifts all k hashes per S. As long as h0's are stored & available e.g. on a stack, only one h0 per each S, no HashFun(S) needs to be recomputed for the S's done so far. HashFun(S) may be your favorite, or this Bytewise 1-liner (or better): h0 = 1; for ix = loB to upB do h0 = (h0 << 8) + h0 + S[ ix ]; The logically partitioned BF3[0..M-1] of bits, with M = k*M1, is my favorite for the following reasons : - The declaration & access is as simple as for a monolithic BF1[0..M-1], yet: - There is no direct hashcrosstalk between pairs of distinct states S1, S2. Alas, there is an indirect hashcrosstalk among 3-tuples of states. Combinatorial implosion works via the (k! - 1) nonmatching permutations as at least an insurance or last ditch defense against direct hashcrosstalk. - Smaller modulus M1 will distribute more uniformly than M = k*M1. - The h = h0 % M1 uses less bits from h0 than the h = h0 % M does. - If we wish to use a prime modulus M1 , then there is no need for finding primes M =~ k*M1 for a BF1. With a single prime M1 we get k*M1 partitions, each with a prime number of bits. This means easier and less error prone changes during tests and maintenance. Many technical details are pointed out, quantitative evaluations and comparisons are tabulated, and ready to use algorithmic auxiliaries are worked out including "hashpairs" i.e. paired hashes as if from a 64 bit machine, probably better than a single 64 bit hash. Clearly, the bigger the M , the more bits w >> c in a hash will be better for lower Fn. This I call the "reversed/inverted stochastic pigeonhole principle". Hasty cognoscenti or unbelieving connoisseurs may now jump to my term "hashcrosstalk" and to the "k!" permutations argumentation for logically segmented "BF3" of the 3rd kind. For quality-performance numbers find "Tab.", "Fn" and "w = 64" without quotes. The proof of the pudding is in "Tab. 3". The cardinals of fundamental computing science are invited to look through the telescope (translation: run FlashHash before you judge). -.- +Mottos : Simple, few parts, easy to maintain. [ Gen. Chuck Yeager, USAF, tested Bell X-1; The Right Stuff ] "The idea is crazy, but not crazy enough." [ Niels Bohr (1885-1962), 1922 Nobel Prize for physics ] "When a distinguished but elderly scientist (JH: over 30 in physics :-) states that something is possible, he is probably right. When he states that something is impossible, he is very probably wrong." "The only way to discover the limits of the possible is to go beyond them to the impossible." "Any sufficiently advanced technology is indistinguishable from magic." [ 3 laws of Arthur C. Clarke; invented a geostationary satellite in 1945 ] - Expect nothing short of a miracle, as only where miracles are expected, miracles are common. - Never say never. - Don't say it with flowers. Do IT with Bloom filters! - Bloom Bit said: Just turn me On. -.- +Basic & new terms, abbreviations & symbols (henceforth in edu-C+BP/Ada) : bag a multiset (often implemented with hopefully small counters). item a chunk of data, e.g. a data vector, record, state i.e. a string S . S may be a state vector of a finite state machine, or a state S in a model checker or a protocol verifier using a combinatorially explosive reachability analysis to systematically generate and check as many reachable states S as feasible (ideally all S's) like e.g. started by [Hajek 1977, 1978a, 1978b, 1979], [Holzmann 1988, 1990]. hashfun = a short word for a hashing function HashFun(S) which reduces an item i.e. a string-of-bytes-ie-of-bits S to a number called hash. Hashing is an irreversible reduction stronger (= produces less bits) than a reversible compression. The original S cannot be recovered from its hash by an inverse operation like a decompression would be. An ideal, perfect hashfun would provide one-to-one mapping of unique S's into unique i.e. distinct numbers called hashes. Alas ...... : hashclash = clashhash = my new words for a hashing collision when two or more hashes clash i.e. become hashing synonyma i.e. become equal while they ideally should differ. Hashclashing is an undesirable but generally unavoidable many-to-one mapping. Only for particular, small sets of fixed i.e. a priori known items (typically the reserved words in a programming language) a perfect hashing i.e. one-to-one mapping may be possible. false positive = an error of the Fn-kind: let after n trials or inserts Fn be called "cumulated false positives". Then (# means number of): Fn = # wanted (to get or to know) misidentified as unwanted = # forecasts which did not come out { who wants non-forecasts? } = # irrelevant documents retrieved { who wants irrelevantia? } = # healthy persons diagnosed as ill { Docs see ill as + } = # foes misidentified as friends { if you want to id foes }, or = # friends misjudged as foes { if you want to id friends } = # new i.e. unseen states or strings S misidentified as old i.e. as if already present i.e. as if "stored" in a BF, so that consequently they will be lost because discarded as if (pseudo)duplicates. Bloom filter (BF) = a dispersed virtual hasher for probabilistic set membership testing i.e. for checking if some S almost certainly is/was present or seen before (deja vu). The items or states S are not actually stored in a BF; a new S sets k bits in a BF i.e. it bitmarks a BF-bitmap to look AS IF stored, hence the items are "stored" only VIRTUALLY. A new S sets k bits, but some of them may be set already, hence insertion of n distinct S's into a BF sets less than k*n distinct bits, so that bits become shared i.e. "overloaded". Filtering refers to BF's discrimination capability to filter out duplicates. Unlike the monolithic BF1, slotted i.e. segmented BF2 or BF3 are partitioned either "physically" or logically and therefore do not suffer from hashcrosstalk like BF1 does. Unlike compact virtual hashers which keep their bits together in adjacent positions, dispersed virtual hashers distribute their bits over the whole monolithic BF1[0..M-1] or over its segments of M1 bits in logically partitioned BF3[0..M-1] or in BF2. Note0: a BF[0..M-1] of bits must be initialized to all bits = 0. Of course we could also initialize all bits to 1 and do resetting of bits to 0 instead of setting bits to 1 . Note1: ideally the k bitindexes per item, string, or state S should be all distinct, but it is not a must. If they are distinct, less gamble will take place at the cost of some programming and runtime overhead spent on (usually superfluous) checking that they indeed are distinct k bitindexes per S. By distinct per S I do not mean all k*n bitindexes being distinct (which would be the unattainable Platonic ideal of a perfect Ideal Hasher playing in the same league as Maxwell's demon, Laplace's omniscient demon, Ideal Observer, or Carnap & Bar-Hillel's Ideal Receiver [Kahre 2002]. Moreover even an Ideal Being cannot violate the pigeonhole principle (from combinatorics) in its elementary form, i.e. (s)he cannot put 1001 distinct pigeons into M = 1000 pigeonholes without at least one pair of pigeons sharing a hole. The same applies to setting M+1 bits in a BF[0..M-1], but: Note2: Due to bitsharing, k hashfuns applied to n distinct states will always set less than k*n bits in a well filled BF. In fact on one hand bitsharing saves precious memory, but on the other hand increases the chances of faults of the false positive type only which will lead to some losses (omissions, collisions). See the "optimal statics" formula for the fo . Hence : - pessimists see negatively sounding clashes :-( , while - optimists see positively sounding bitsavings due to bitsharing :-) Note3: not even an Ideal Being can individually address more than (i.e. separately each of all), say, 2^32 bits with a single hash h0 which is an unsigned integer of 32 bits. Therefore BFs with huge M may need two hashes h0 and h00 forming a hashpair of a width w = 64 bits (find "w= 64"). This is my "reversed/inverted pigeonhole principle", as trivial as the pigeonhole principle (from combinatorics) in its elementary form. hashcrosstalk = my new term for the nowhere even mentioned hashclashes of the 2nd/3rd kind. There are 3 kinds of clashes between S1, S2 in BFs : Explained in shorthand (here S2 and S1 are distinct items or states) : If HashFunA(S1) = HashFunA(S2) then hashclash of the 1st kind; If HashFunA(S1) = HashFunB(S2) then hashclash of the 2nd kind in BF1; If HashFunA(S1) = HashFunB(S1) then hashclash of the 3rd kind in BF1; If HashFunA(S1) <> HashFunA(S1) then irreproducible nondeterminism due to a bug (e.g. an uninitialized variable). There are no other possibilities to be considered when only a single pair of S1, S2 is considered in splendid isolation. Alas, as Dr. Dragan Bosnacki has pointed out, a 3rd state S3 may be lost due to a hashcrosstalk via S1, S2 over the (logical) boundaries of the segments of partitioned BFs i.e. BF2 and BF3. Quiz: how ? Hint: create a case. Hence "BFs without hashcrosstalk" holds only for pairs of states in splendid isolation from the third state S3 . Therefore a partitioning of BFs has more limited, yet non-negligible practical consequences (find "favorite" here). Clashes between splendidly isolated S1, S2 explained in longhand : - Clashes of the 1st kind (in BF1, BF2, BF3) : distinct states S1, S2 to which the exactly same hashfun is applied may produce the same hash-address (e.g. due to the % M1 operation); - Clashes of the 2nd kind (in BF1 only) : different hashfuns on S1, S2 may produce the same hash-address. This cannot happen in BF2 or BF3 which are partitioned, hence different hashfuns work only their own strictly separate parts of BF[0..M-1]. - Clashes of the 3rd kind (in BF1 only) : k different hashfuns on S1 only may produce less then k different hashes. This can happen only on BF1 with a poor hashfun, or too small M . Note: by hashcrosstalk I will mean that of the 2nd or 3rd kind in BF1, unless specified otherwise. FlashHash is a multihashing function employing a single HashFun(S) to produce a single hash h0 from which all k hashes are obtained i.e derived (e.g. for a BF by e.g. stepwise rotations of hashbits or by injecting entropy ). BackHash is like FlashHash, but without any hashfun. BH recovers hashes by bitwise rotations from the stored 0-th flashhash h0. msb the most significant bit (in a number called hash here). { } embrace comments like /* */ in the C programming language. ^ raises to power neither in C nor in BPascal. 2^32 = 4294967296; 2^24 = 16777216; 2^20 = 1048576; 2^16 = 65536; 2^31 = 2147483648 = MaxLongInt + 1 in BPascal ( = a Mersenne prime + 1 ) = abs(MinLongInt) in 2-complement arithmetic will overflow ln(2) = 0.6931 is e-based Napier's logarithmus naturalis e.g. log in C := assignment operator in ADA, BP; i.e. = in C mixes with math = equality operator as in math i.e. == in C <> not equal relational operator i.e. != in C % modulus operator i.e. mod in BP aka BPascal div integer division operator i.e. / in C is "overloaded" | bitwise inclusive OR operator i.e. OR in BP is "overloaded" & bitwise AND operator i.e. AND in BP is "overloaded" xor bitwise XOR operator i.e. ^ in C mixes with math << bitwise leftshift operator i.e. shL in BPascal on integers >> bitwise rightshift operator i.e. shR in BPascal on integers Note: the % and << are also used a couple of times each in their usual mathematical meanings of "percent" and "much smaller" respectively. -.- +Bloom filters (BFs) = a tutorial + a quantitative study : BFs typically work as approximate set/bag membership testers/checkers. Membership testing means checking the current/past presence/occurrence, or its opposite i.e. absence/nonoccurrence by testing/checking k bits. BFs employ probabilistic bitmapping technique of multiple hashes per item (e.g. a state S) for marking/setting and testing/checking multiple bits scattered (with semantic overload or overlap) over a bitmap i.e. in one or more bitarrays. The three versions of BFs are: Bloom filter of the 1st kind (BF1) is monolithic: BF1 is a single bitmap BF1[0..M-1] of bits, where all k hashfuns per each state or string S operate i.e. pseudorandomly set and test/check k bits per S over the whole range of a bitmap: h2 h1 - - - hk h3 all k bits set by one S v v v v 01111001010001....011110010001100100001 ^ ^ ^ ^ h3 h1 - - - h2 hk all k bits set by another S Bloom filter of the 2nd kind (BF2) is partitioned / segmented "physically": BF2 consists of k smaller bitmaps, each BF2[0..M1-1] of bits, so that k*M1 = M bits, each reserved for only one j-th hashfun: 0110010100101 M1 bits used by hashfun nr j = 1 only 0010010000100 M1 bits used by hashfun nr j = 2 only - - - - - - - - - - - - 0100010100100 M1 bits used by hashfun nr j = k only BF2 is handy when e.g. a compiler does not allow to declare, or OS to allocate, a single huge hashtable as big as needed. Bloom filter of the 3rd kind (BF3) is partitioned / segmented logically: BF3 is BF2 in a single bitarray BF3[0..M-1] of bits where each hashfun works its own reserved subrange of M1 bits so that k*M1 = M bits. |.........|.........|..........|.........| For BF3 the j-th hash is then finalized thus : Offset is the absolute bitindex of the 0-th bit in a segment of BF3 as-if BH2[0..M1-1]; the 1st segment (for j=1) has offset = 0; 2nd offset = M1, etc, i.e. : offset = M1*(j - 1); { in loops reduce * to repeated + M1 } h = offset + ( h % M1 ); /* in C */ or faster iff M1 is a power of 2, and M1min1 = M1 - 1 : h = offset + ( h & M1min1); /* in C */ h := offset + (abs(h) AND M1min1); { in BPascal } offset := M1*LongInt(j - 1); { safely typecast in BPascal } h := offset + (abs(h) mod M1); { safe for any M1 in BPascal } The abs(h) must be used and overflow checking must be off if a signed integer h is in 2-complement arithmetic (with asymmetric subranges). An overflow of the offset M1*(j-1) must be avoided even if the overflow checking, if any, is off. This bad bug did hit me when my M was decreased! Quiz: How come that smaller numbers could cause an overflow, but larger did not (in BPascal)? Although I do not make many bugs anymore, or see them instantly, it took me hours to kill this badbug. Hint: M1 was a constant, j was a byte. Despite of such badbugs, BP was a love at dz 1st byte. In all three BF-schemes the same bit can be shared i.e. set/tested by hashfuns generated from different items or states. However unlike in BF1, there is no "hashcrosstalk" in slotted i.e. partitioned i.e. k-segmented Bloom filters like BF2 and BF3, where a j-th hashfun cannot share a bit with an i-th hashfun if i <> j. Hence BF2 and BF3 are much less semantically bit-mixed than a monolithic BF1. This will turn out to be the conditio sine qua non for quality flashhashing. BFs are resilient (wrt external errors) due to the dispersion of bits (not unlike holograms) belonging to each particular item. The bits' overlap i.e. bitsharing i.e. semantic overloading of individual bits shared by different hashfuns (originated from different hashed items S) makes BFs memory-efficient but also (internally or inherently) error-prone like any imperfect hashing scheme without a collision resolution backup. However the false positive rate is calculable hence under math control at any moment, and can be updated and reported in-flight. BFs are dispersed bithashers without collision resolution, hence with calculable risks of errors of the false positive kind only. Approximate set membership testing and multiset/bag membership checking is an operation classically used by spelling checkers without a fully stored dictionary, or by model checking programs like SPIN without explicitly storing whole generated states of, say, computational communicating processes [Holzmann 1988, 1990, Bell Labs] who calls Bloom filter a bitstate hashing. Nowadays model checking means and does what automated verification or validation of protocols (e.g. for networks, communication, security, passwords, etc) used to mean and to do earlier [Hajek 1977, 1978a, 1978b, 1979]. BFs fall into the category of VIRTUAL HASHing algorithms [Morris 1968], [Murray 1970, p. 180-183], [Bloom 1970, p. 423, Methods 1 & 2] which form a subclass of hashing algorithms. Virtual hashing without any collision resolution was invented and loosely described at the end of the classical paper by [Morris 1968, Bell Labs], where on pp. 42-43 he used the terms virtual scatter, and also virtual hash 3 times on p. 43. Virtual scatter or hashing has been mathematically analyzed for the first time by a computer scientist from Cornell University [Murray 1970]. BFs clearly fall, in my view, under the general paradigm of HINTing algorithms. Note that this insight is not a commonly received wisdom as it is hard to find, but see [Lampson & Sproul 1979, pp. 100-102, 3.6]. BFs can be viewed as algorithmic versions of ancient notched-edge cards manual filing systems i.e. needle-card based manual documentation & retrieval systems where card columns were at premium, so the ancients had to be extremely inventive. Read on the method of SUPERIMPOSED CODING [Knuth vol.3], or on SUPERIMPOSED CODES [Roberts 1979, Bell Labs], [McIlroy 1982, p. 95, Bell Labs], also called Zatocoding or Peekaboo, no joke! BFs operate optimally (w.r.t. the memory use under the given load i.e. at their capacity to store items with the required false positive rate) under the maximum entropy condition (for BFs simply the equal chances that a bit is On or Off on average), hence can be understood in terms of (non)classical information theories like e.g. Shannon's classical Mathematical Theory of Communication [Shannon 1948], and Jan Kahre's unifying neoclassical Mathematical Theory of Information [Kahre 2002]. BFs being virtual hashers, behave virtually i.e. almost AS IF they would be really encompassing much larger "name-space" of a set/bag with much larger cardinality i.e. number of possible names, states, strings or items. In this respect, like all virtualities, BFs have their philosophical patron in the little known German philosopher of ALS OB (= AS IF) Hans Vaihinger [Vaihinger 1923], [Kahre 2002, see his Indexes]. BFs are predictable: In this section all explanations pertain to the BF1 scheme only, but apply also to the BF2 and BF3 schemes (unless stated otherwise) which behave virtually w.r.t. BF1 i.e. almost AS IF they were BF1. All formulas are derived under the assumption of stochastic independence, as usual. After n items were stored in BF1 by setting k bits (some may have been 1 already and will be shared) per item, the false positive rate is: f(n) = (1 - (1 - 1/M)^(k*n))^k is the instantaneous probability of a false positive AFTER the n-th item was "stored" in a BF[0..M-1] of bits; f(n) may be called a false+rate or a failrate. f(n) is the "general dynamics" formula for BFs. For large enough M bits is f(n) very well approximated by f(n) = (1 - exp(-k*n/M))^k , k*n < M; optimal k*n = M*ln(2) or M*2/3 From the latter formula for f(n) obtains fo = 1/2^k and k*n = M*ln(2) or k*n = M*2/3 for a BF filled optimally wrt memory utilization (useful in quasistatic apps when the final n can be estimated or fixed beforehand). At such a load, f has dropped exponentially with k increased linearly. fo is the "optimal statics" formula for BFs. From the instantaneous f(n) obtains the so far cumulated number of false positives Fn (not explained anywhere and misunderstood everywhere): Fn = Sum_i=1..n( f(i) ) is the cumulated number of false positives, = Integral_i=1..n[ f(i) ] for fast numerically evaluated approximation which is very good for very large n . Fn < n*f(n) ; display Fn as a float, as Fn < 1 or Fn > 1 may result. Fn is directly verifiable or falsifiable empirically (see below) in easy experiments based on simple I/O through a BF and counting the I-O's: Fn = (number of distinct items In) - (number of distinct items Out) = number of distinct items lost. (distinct = unique = no duplicates) Ploss = 1 - Prod_i=1..n( 1 - f(i) ) < 1 , check it in programs; = 1 - probability of NO loss of any item or state = probability of a loss of ONE OR MORE items or states = probability of a loss of AT LEAST ONE item or state = 1 - exp( Integral_i=1..n[ ln(1 - f(i)) ] ) for fast numerically evaluated approximation which is very good for very large n . (all AS IF during insertions of n unique i.e. distinct items) Ploss is a computable theoretical number which (unlike Fn) is neither directly verifiable nor falsifiable in the spirit of Karl Popper's philosophy of science. Hence Ploss is a quasi-Platonic quantity. Ploss is introduced only for comparisons with earlier work by others. Fn and Ploss are as exact as only reasonably possible and cannot be approximated by nice analytical formulas without making large errors, but can be easily computed numerically. Fn may be approximated as an integral and also simply counted in practice, so that this theoretical model Fn can be easily verified by empirical "measurement" thus: Input & "store" n unique (i.e. known to be nonduplicate) items (e.g. strings) into a well filled BF with not too large M and count how many items have NOT passed thru this BFilter. Uniqueness is necessary for a simple & fast yet reliable measurement, otherwise one might be losing all but the 1st from just few bad-luck "muplicates" if each reinserted many times. This method was used for creating a reliable Tab. 3. Be aware of the fact that the differences between the theoretical Fn and the empirical Fn are due to the total systemic properties of the "Grey Box = HashFun(S) + items S input", i.e.: 1. The quality of the hashes used, which interacts with: 2. the quality of unique items (their differences, size and redundancy). 3. The amount of unique items (the more items the lower the variance of Fn) Note that from nonunique items any Fn or Ploss may result :-) The criterion of optimality for BFs is usually memory utilization when the tolerable f is given i.e. specified beforehand. The approximate f-formula (without the optimality condition) is briefly mentioned in [Knuth, vol.3, any ed., just see Bloom]. In the next e-paper I will derive these design formulas in several ways, including my derivation from the very 1st principles. Here it suffices to say that an up to 4-way trade-off is involved among f, M, k, n : f = false positive rate (AS IF present while absent) i.e. false+rate (errors, omissions, collisions are not specific hence ambiguous). M = memory space in bits available in a single BF[0..M-1] of bits. n = number of items or states (virtually AS-IF-)"stored" in BF. k = number of hashes and bits (0 or 1) set per item; proportionally increases CPU-time spent on hashing, and in some applications like e.g. model checking with SPIN, larger k may need extra memory for additional data structures. It is this k-dependent CPU-time (and probably an additional space overhead) which FlashHash & BackHash can drastically reduce to almost zero. A quantitative example: M = 400 MBytes * 8 bits/Byte = 3200 Megabits are available; f = 0.001 is acceptable i.e. false positive rate of a BF AFTER n items were stored is one lost item per 1000 "stored" (the AFTER is important!) Q: How should we design our BFs ? asks a programmer. k = ? i.e. how to choose the number of hashfuns k ? n = ? i.e. how many unique items S can we store for the k chosen ? A: You may repeatedly solve one of the above formulas numerically for different pairs of k and n. Or, if you believe information theoreticians, you just use the formula for optimally utilized BF i.e. operating at its "optimally filled" (though not full) capacity: fo = (0.5)^k = 1/2^k where 0.5 is the probability of an average bit being On (this is the true meaning), or Off, because 0.5 happens to mean equal chances of 1 vs. 0). Hence fo is calculable without a calc and even without a piece of paper if you manage powers of 2: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, so that for our task at hand we have fo = 1/1024 = 1/2^10 < 0.001 required, hence our optimized k is k = 10 hashes per item or state S "stored" in a BF. The computation of the capacity of n items of our optimally filled (yet far from full) BF may be obtained by solving for n one of the general f-formulas far above, which yields k = ln(2)*M/n = 0.6931*M/n > 0.5*M/n ; hence n = ln(2)*M/k = 0.6931*M/k hence in our case: n = 0.6931*3200 Megabits /(10 bits/item stored) = 222 MegaItems > 0.5*3200/10 = 160 MegaItems (or Mstates in SPIN). The > is due to the semantic overload or overlap i.e. bitsharing. Note that 222/160 = 1.39, and 160/222 = 0.72 bits. This means that setting 100 bits actually sets i.e. turns On only 72 bits on average. k*n = M*2/3 is my engineering rule of thumb with a small extra safety margin. This together with fo = 1/2^k allows us to design near optimally operating Bloom filters by hart. The expected cumulated number of false positives is easily computed numerically, e.g. for the M = 3200 Mega, n = 80 Mega as above, we get: f = (1 - (1 - 1/M)^(k*n))^k = 0.0000002804 if nr. hashfuns k = 10; Fn = 2.29 states missed out of 80 MegaStates "stored" with k = 10; this Fn << 22.43 = f*n = 0.0000002804*80000000 iff 80M random membership tests would be done AFTER 80 MegaStates were "stored" already, the latter being the correct interpretation of what would be an improper calculation of Fn . Tab. 1: M = 400 MBytes, n = 80 MegaStates; cumulated losses Fn : k = 2 3 4 5 6 7 8 9 10 11 12 13 14 Fn = 64372 7743 1363 314 89 30 11 4.847 2.288 1.176 0.652 0.386 0.248 Ploss = 1 1 1 1 1 1 1 0.992 0.899 0.692 0.479 0.320 0.215 This table says a lot: - It desperately cries for k > 2 for BFs (e.g. bitstate hashing in SPIN ). - It shows that Fn provides much more information than Ploss does, especially when more than 1 item or state is expected to be lost. To tell that bitstate in SPIN will lose 64372 states is about 60k times more informative than to say that at least one state will be lost. - In fact Ploss approximates Fn, albeit a bit roughly when losses exceed 1 state :-) Then Ploss tells that 1 or more states will be lost, while Fn quantitatively informs us how many states or items will be lost. -.- +A comedy of errors = compact virtual hashing CH or HC vs. distributed / dispersed virtual hashing DH or BF : Medical warning! This section is for healthy hearts & strong nerves only! Lets now compare a BF with the classical virtual hasher (which does not disperse individual bits over a bitmap) invented by [Morris 1968], analyzed by [Murray 1970], and reinvented by [Wolper & Leroy 1993] who have called it inventively HASHCOMPACT, not to be confused with a similar though not identical COMPACT HASH by [Cleary 1984]. Caution! The curves in fig. 2 in [Wolper & Leroy 1993] are "approximate" by the factor 17.32 i.e. are 1632 % off mark. Lets look at one clear point there: for n = 10^7 , k = 10 hashfuns and M = 1 Gigabits my numerical integrations as well as the straight summation both yield Fn , and straight numerical computations yield the product Ploss : Fn = 0.0000577302 = expected cumulated number of lost items or states Ploss = 0.0000577285 = probability of ONE OR MORE losses i.e. = probability of AT LEAST ONE loss = 1 - probability of NO loss (omission, failure, fault) during insertions of n unique i.e. distinct items. Ploss is semantically equivalent to 1 - Pnc = 0.001 = 10^-3 proclaimed for the fig. 2 in [Wolper & Leroy 1993]. But here they are not equal, since (1 - Pnc)/Ploss = 0.001/0.0000577285 = 17.32 is the factor by which their equations and fig.2 are "approximate" wrt The Truth of Ploss & Fn. That Ploss will approximate my new Fn was pointed out to me by Dr. Ulrich Stern (ex-Stanford University), when he noticed my misinterpretation of Pnc and a small typo in the previous version of this paper, all after I kept insisting for weeks on the incorrectness of the graphs in fig. 2 in [Wolper & Leroy 1993] which has been accepted as ok for almost 10 years. Due to the continuities and orderings of the curves in their fig. 2, the single point here identified as 1632 % in error makes their fig. 2 invalid. However, Ploss cannot approximate Fn in general because it always holds Ploss <= 1 , while Fn > 1 is quite a common number of lost items. Quiz: Why then is Ploss = 0.0000577 so close to Fn = 0.0000577 above ? Ploss (always <= 1) can be computed theoretically but not easily falsified as the Popperian philosophy of science requires, nor directly counted like Fn > 1 which can always be enforced by decreasing M or by increasing k*n. Moreover Ploss contains much less information than Fn does (see Tab. 1). For the same point k = 10 , n = 10^7 , M = 10^9 bits I computed three values which must be semantically equivalent (all under the assumption of stochastic independence) : Pnc = probability of NO collision i.e. NOT a single state S lost; Pno = probability of NO omission i.e. NOT a single state S lost; Ploss = probability of AT LEAST ONE state lost i.e. of ONE OR MORE items lost, which is the complement of Pnc as well as of Pno . Since the numbers too close to 1 do not read well (we are used to ignore chains of 00000's but not of 99999's), complements are preferably computed and displayed on screen and here : Ploss = 0.0000577 =~ 6^-5 introduced here above is the True Value, unlikely to be improved upon, hence the claimed 0.001 = 10^-3 is far from True Value for the given k, n, M ; 1 - Pnc = 0.0009995 =~ 10^-3 from the equation (4) in [Wolper & Leroy, 1993] is erroneous but consistent with their fig. 2; 1 - Pno = 0.0000909 =~ 10^-4 from the e-paper [Wolper, Stern, Leroy, Dill ] is 11 times smaller then in their fig. 8 which is identical with fig. 2 in [Wolper & Leroy 1993]. Their equation has improved since 1993 but the graphs have not :-) because: (1 - Pnc)/Ploss = 17.32; (1 - Pnc)/(1 - Pno) = 11 (1 - Pno)/Ploss = 1.58 Another clearly identifiable point in their identical figs. 2 and 8 is for k = 5 , n = 10^4 , M = 10^6 bits which yields: Fn = 0.000468157 by numerical integration 0.000468293 by numerical summation which is the True Value Ploss = 0.000468047 by numerical integration of the logarithmicized formula 0.000468183 by numerical multiplication which is the True Value 1 - Pnc = 0.00312 by the approx-approx equation (4) [Wolper, Leroy 1993] which is 312 % of their proclaimed 0.001 for fig. 2. 1 - Pno = 0.00052 by the approx-approx equation after their equation (5) in [Wolper, Stern, Leroy, Dill ], which is 52 % of their proclaimed 0.001 for fig. 8. The value 0.001 is too far from the True Value for the k, n, M given above. Their equations (3) for Pnc and their identical (5) for Pno have built in a 1st order approximation of i*k bits being On after i items or states were inserted i.e. virtually "stored" in a BF or bitstate hasher. This fundamental weakness cannot be removed by improving their 2nd order approximations. Moreover I could not get a meaningful result from a straight numerical evaluation of their equations (3) or (5), despite my careful logarithmization with overflow checking On, extended floating point type of 10 bytes BPascal and using relatively small numbers like k = 5, n = 10^4 and M = 1 Megabit, which still form a clear point in their fig. 2. Note that here I tear down the formulas and fig. 2 in [Wolper & Leroy 1993] and fig. 8 in [Wolper, Stern, Leroy, Dill ] concerning Holzmann's "bitstate hashing" [Holzmann 1988,1990] aka Bloom filtering [Bloom 1970], [Knuth 1973], aka "probabilistic hashing" [Sincovec & Wiener 1986, p.396] aka "multiple hash functions" [Reingold & Hansen 1986, p.410-413]. I did not check yet Wolperian theories behind hashcompact HC [Wolper & Leroy 1993] aka compact hashing CH [Cleary 1984] aka virtual hashing since [Morris 1968, Bell Labs], virtual scatter [Murray 1970, Cornell Univ.] and virtual hash [Wiener 1986]. Since one comparison assumed that 400 MBytes of fast memory are available and accessible, saving it makes no sense, and since hashcompact cannot take more than up to 80 MegaStates, only one nontrivial comparison makes sense : Q: Which fault rates f are achievable for M = 400 MB and n = 80 MegaStates? A: BF cannot win over HC for the given M, n because for k = 28 BF's lowest here achievable losses are Fn = 0.017 =~ Ploss = 0.017 > 0.0013 = 0.13 % claimed for HC in [Wolper, Stern, Leroy, Dill ]. On the other hand, at the time when HC with k = 40 bits/state has filled all the available bitspace of M = 3200 Megabits, a BF with k = 28 ( k <> 28 would worsen Fn and PLoss ) has used only about M/2 = 1600 Megabits. The optimal bitsharing factor Obsf = 0.5*M/(k*n) = 0.5*3200/(28*80) = 0.71 i.e. from the k*n bits ever set 29 % were already On hence were set again so that 29 % of the On-bits are bitshared and thus semantically overloaded. When HC has used (almost) all M bits, HC will start to lose each & every new state simply because there will be no single bit left to store, while BF can go on quite comfortably much further, until it will start losing much more often than acceptable. Note the minimal Fn = 0.017 =~ 0.017 = Ploss for k=28: Tab. 2: Fn = the smaller the better; M = 3200 Megabits k Ploss =~ Fn for Fn < 1, but here BF cannot achieve Fn < 0.017 HC: 40 Ploss = BEST 0.0013 n < 80 MegaS = HC's abs.limit, HC used ALL bits BF: 40 Fn = 0.036 n = 80 MegaS, BF far from full (bitsharing) BF: 28 Fn = best 0.017 n = 80 MegaS, BF used only M/2 = 50% of bits BF: 20 Fn = 0.038 n = 80 MegaS, BF far from full (bitsharing) BF: 10 Fn = 2.26 n = 80 MegaS, BF far from full (bitsharing) BF: 2 Fn = 64372 lost; n = 80 MegaS, BF far from full (bitsharing) Here BF is the loser in general, and for k = 2 a Huge Loser in particular. But again, when HC uses all M bits available, HC with k = 40 will turn into a Henceforth-All-Loser long before BF with k = 28 (or less) will start to lose unacceptably. So there is no clear cut criterion or rule of thumb for situations when n is not known beforehand. One clear advantage of classical virtual COMPACT HASHers ( CH like HC i.e. hashcompact ) over neoclassical virtual DISPERSED HASHers ( DH like BF ) with k > few has been until now their speed, but this difference is virtually annihilated by the new algorithm called FlashHash for the first time described here above and in more detail below. -.- +The proof of the pudding = Tab. 3 : Here are some numbers for a BF (operating with my 1-liner hashfun) on n = 63609 real-world strings (all unique!, many similar, ave.length 19) M1 = 116663 bits/segment in a partitioned BF3, and M = k*M1 in BF1 hence (k*n)/M = (k*n)/(k*M1) = n/M1 = 63609/116663 = 0.545, well filled. As said above, Fn = number of lost strings during 1 to n distinct insertions (i.e. no duplicates [re]inserted). Tab. 3: BF1 is monolithic; BF3 is partitioned; cumulated losses Fn: FlashHash Yes/No ; k = 2 3 4 5 6 7 8 9 10 Theor. num. sum BF1: No Fn = 4205 1353 460 162 59 22 8 3 1 Theor. integral BF1: No Fn = 4272 1385 474 169 61 23 9 3 1 Really counted BF1: No Fn = 4259 1390 434 151 64 17 6 2 1 Really counted BF1: Yes Fn = 4316 1411 460 159 64 33 25 7 7 rotated by c < w=32 bits c = 13 13 7 7 5 5 4 3 3 Really counted BF1: Yes Fn = 4360 1387 503 160 59 33 11 8 3 = version with h0 := h0 xor EnTroPic[j]; Further for BF3 all really counted cumulated losses Fn: - k hashfuns/S : NO FlashHash No Fn = 4331 1395 482 178 64 24 8 4 0 - 1 hashfun/S only: Yes Fn = 4259 1366 465 161 56 24 12 5 3 = version with h0 := h0 xor EnTroPic[j]; - 1 hashfun/S only: FlashHash w = 32 Yes Fn = 4312 1399 488 175 70 30 13 11 5 rotated by c bits c = 13 13 7 7 5 5 4 3 3 - 2 hashfuns/S only: FlashHash w = 64 Yes Fn = 4354 1354 465 164 70 20 10 2 1 rotated c <= 13 c = 13 13 13 13 12 10 9 7 7 - 2 hashfuns/S only: FlashHash w = 64 Yes Fn = 4311 1390 503 174 65 20 10 2 1 rotated c <= 11 c = 11 11 11 11 11 10 9 7 7 Number of hashes: k = 2 3 4 5 6 7 8 9 10 The table empirically validates that a k-flashhashed single h0 := HashFun(S) can compete with k separately (re)computed HashFun(S). The theoretical k!-based reason follows further below (find "k!" in this text). FlashHash with 1 hashfun was competitive up to k = 5 and only then the decresing c has begun to take toll when h0, h had w = 32 bits/word. Of course the word width w > 32 bits would help, and so did "hash pairing" of two 32 bit results h0, h00 of two hashfuns into one final "hashpair" of w = 64 bits. The results were close to those of k separate hashfuns and also close to the theoretical cumulated losses Fn for BF1. Like it is impossible to address large spaces with too few bits, larger M may require larger word width w > 32 bits for the hashes (find "w = 64"). The IT & TI engineering rule of thumb for Fn in a flashhashed BF3 or BF2: For your M, n and k compute my new theoretical summing formula for BF1 Fn = Sum_i=1..n( fi ) with f(i) = (1 - (1 - 1/M)^(k*i))^k and M = k*M1 (do not forget to initialize Sum := 0 :-) and this Fn multiply by the following corrective ratio from the Tab. 3 : ( flashhashed Fn[k] ) / ( theoretical numerical sum Fn[k] ) . The safety margin built into this engineering rule is that the Tab. 3 is based on my simple (good enough for my apps) 1-liner HashFun(S) : h0 := (h0 << 5) + h0 + (S[ix] & 31); for subset-of-ASCII strings. Common technical sense tells us that: the better the HashFun(S) and the larger the distance c and the width-bits w of the 0-th hash h0 (as if concatenated and rotated with h00 ), the lower the actual Fn will be. Competent readers using better hashfuns should achieve better results, especially if their S is densely packed i.e. without too many redundant bits contributing no information (hence my << 5, & 31 ; find these below). Competent users of FlashHash will obtain their custom designed corrective ratios based on their versions of my Tab. 3 obtained for their own hashfuns working on their typical sets of states or items. The tab. 3 above shows that there is no need to compute k HashFun(S) iff k > 1 hashes are obtained by means of FlashHash : 1. The k > 2 greatly decreases BF's losses Fn and the failrate f ; 2. for k < 6 only 1 HashFun(S) will suffice iff FlashHash is employed; 3. for k > 6 only 1 HashFun(S) may suffice iff FlashHash is employed, but 2 HashFun(S) will further decrease Fn and f . Inevitably, like with all algorithms, there surely will be some range(s) of parameters for which a compact hash CH or hashcompact HC will be worse or better than a dispersed hash a la BF. To determine these ranges is left as an exercise for the spinners (= kind, thoughtful readers in general and the reinventors of compact hash or hashcompact in particular). This paper gives them proper formulas for the cumulated losses Fn. -.- +FlashHashing & BackHashing = (m)eat the beef again : The basic idea of flashhashing is to compute only one hash h0 (as good as possible or available; here good means that its bits are quite independent) and then to derive all k hashes from it by means of one stepwise rotation of h0's bits per each hash. Since the above analysis was done under the assumption of k independent & uniform hashing functions I am now trespassing into the holy ground of those theoreticians who would claim that my flying machine cannot fly, as Professor Simon Newcomb, then the US "astronomer-royal" residing at Harvard, and Professor Lord Kelvin in the UK have claimed [Newcomb 1903]. To take their fear of flying away, I can say more than "Sirs, with all due respect, I have flown my FlashHash". Lets switch from BF1 to BF3 (or BF2), to really appreciate why BF3 is our favorite scheme. As said, BF3 or BF2 do not suffer from hashcrosstalk i.e. no bits are shared by j-th and i-th hashfuns if i <> j. This observation makes FlashHash and BackHash workable. As all k hashes are generated in FIXED ORDER, each works only its own j-th segment of BH3. It is possible that states S1 and S2 will generate the same k hashes, but the higher the k, the more unlikely it is that hashes will be generated in the same order, afterall the number of their permutations is k! (this is factorial, not an exclamation! :-) and only one of the k! permutations will match. The same in still other words: When two different states Si, Sj produce two different 0-th hashes h0i, h0j and FlashHash rotated h0i by Zi steps and h0j by Zj steps, then both hashes may become identical, and they both would match the same bits in a monolithic BF1, but not in a partitioned / segmented BF3 or BF2 where each hash has its own segment or partition. Against such mismatches in BF1 resulting from the rotated h0 works the COMBINATORIALLY IMPLOSIVE nonmatching logic of k! . The flashhashing fewliner is presented in edu-C+BP, the lingua franca of ancient algorithmicians, defined above as a mixture of BP-Sanskrit and C-Latin: Unsigned integers : If you have to use signed integers e.g. in a 2-complement arithmetic i.e. if abs(MinInt) <> abs(MaxInt) then you must put overflow checking (if any) off, and for speed put it off anyway, since flying as fast as possible without a parachute is the right fun, isn't it ? M1 = nr. of bits in a BF[0..M-1]; for BF1 is the modulus; h0 := h0 % M M = nr. of bits in each segment of a BF3 or BF2, where h0 := h0 % M1 j = the order number of the j-th hash produced per one call of HashFun(S) h0 = the 0-th hash produced by HashFun(S) called only once per S h = any hash subsequently (re)generated from h0 ; h could be h[j] w = number of bits in h0 and h i.e. hash-word width in bits; often w = 32 bits, or w = 64 bits for h0 hashpaired with h00. k = total number of hashes (not hashfuns' calls) to be used per item S ; c = number of bits shifted; 0 < c < w , the larger the c the better because mutually more distant bits will be more stochastically independent. Pre-check to make sure that: Use c such that k <= 1 + ((w-1) div c) and odd c < (w div 2) c prime w.r.t. w would guarantee that all w bits will be visited even in the case of i.e. c*k > w - 1. An item S is a chunk of data to be hashed, e.g. a string of variable size, or a fixed sized data vector like e.g. a bitstring or a bytestring representing anything, e.g. a state S of a model, mail or femail :-) FlashHash for BF3 or BF2 (i.e. no hashcrosstalk) to be run only once per S. This version and that in the executive summary above are both functionally equivalent but coded differently : d := w - c; { use max c < w/2 for k <= 1 +((w-1) div c), w = #bits in h0 } M1 := M div k; { i.e. M := k*M1; } M1min1 := M1 -1; h0 := HashFun(S); { store THIS h0 now, iff needed } offset := 0; h := h0 % M1; { h := h0 & M1min1; is faster } { Use/test/set BF3[h] for j = 1 , or store in h[1] := } for j := 2 to k do { generates k > 1 hashes per item or state S : } begin { max prime or odd c <= (max c allowed) will be the best c } h0 := (h0 << c) | (h0 >> d); { h0 rotated c bits to the left } offset := offset + M1; h := offset + (h0 % M1); { Use/set BF3[h] for k=j, or: h[j]:= } { or: h := offset + (h0 & M1min1); is allowed iff M1 power of 2 } end;{ h := might be h[j] := ... ; use h before and in the loop } BackHash is like FlashHash with h0 := HashFun(S) replaced with h0 := FetchFromStore; (e.g. from a stack) Iff h0's are stored (e.g. stacked, only one h0 per each S), no HashFun(S) has to be redone for S's done so far, as BackHash can regenerate all k hashes per S from its single h0 by bitshifts or, surprise, even without h0 if flashhashing was done with k*c = w e.g. 32, and if no final transformation like h := offset + (h0 % M1) was applied, which would mean that a nonsegmented BF1 would have to be used. This seductive save-no-h0 trick is not recommended and therefore not discussed. Quiz: Why? Having read so far, the answer should not be too difficult. FlashHash for monolithic BF1 (with hashcrosstalk) to be run only once per S: { M := k*M1; for comparisons of this BF1 with BF3 or BF2 } { Mmin1 := M - 1; for faster % iff M is a power of 2 ; see below } d := w - c; { w = #bits in h0; use max c < w/2 for k <= 1 +((w-1) div c) } h0 := HashFun(S); { store THIS h0 now, iff needed } h := h0 % M; { h := h0 & Mmin1; is faster; use iff M power of 2 } { Use/check/set BF1[h] for j = 1 , or store in h[1] := } for j := 2 to k do { generates k > 1 hashes per item or state S : } begin { max prime or odd c <= (max c allowed) will be the best c } h0 := (h0 << c) | (h0 >> d); { h0 rotated by c bits to the left } h := h0 % M; { Here use ie test/set BF1[h] for k=j, or: h[j] } { or: h := h0 & Mmin1); is faster but allowed iff M is a power of 2 } end;{ h := might be h[j] := ... ; use h before and in the loop } FlashHash in edu-C for monolithic BF1 with hashcrosstalk; run it once per S: /* This version allows concentration of an inline code at one place */ /* M = k*M1; for comparisons between BF1 and BF3 or BF2 */ /* Mmin1 = M - 1; for faster % iff M is a power of 2 ; see below */ d = w - c; /* w = nr bits in h0; use max c < w/2 for k <= 1 +(w-1)/c */ h0 = HashFun(S); /* Store THIS h0 now iff needed */ for j = 1 to k do /* generates k hashes per item or state S : */ begin /* max odd or prime c <= (max c allowed) will be the best c */ if j > 1 then h0 = (h0 << c) | (h0 >> d); /* rotate bits */ h = h0 % M ; /* put HERE inline code using h */ /* or: h = h0 & Mmin1; this is faster; use it iff M power of 2 */ end; /* h = might be h[j] = ... ; use h before and in the loop */ -.- +More explanations and hints for programmers : HashFun(S) above may be your favorite, or this Bytewise 1-liner (or better): h0 := 1; for ix := loB to upB do h0 := (h0 << 8) + h0 + S[ ix ]; h0 := 0 would be ok, but h0 := 1 may be better. Quiz: why ? FH & BH employ a repeated barrelshift which cyclically step-wise rotates bits in the last word h0 by c bits to the left i.e. by (j-1)*c bits w.r.t. the original hash h0. Each turn of the loop yields one of the k hashes per item S so that FlashHash works as an ultraefficient multihashing function. BackHash is even faster. If you need to remember but do not want to store all k hashes per item ( or per state when generating too many states e.g. for model checking in SPIN ), then you can do with storing only the 0-th hash h0 and recover i.e. reconstruct all k hashes with BackHash which is much faster than FlashHash, because no hashfun needs to be ever recomputed again as long as the 0-th hash h0 from FlashHash is still stored and readily accessible e.g. on a stack or in a queue (LIFO / FIFO). Here is how to generate c and d for the chosen k : Either precompute c for h0's word width w = 32 bits thus: for ctry := 1 to w do { huge M=k*M1 may need w = 64 bits i.e. h0 and h00 } if k > ( 1 + ((w -1) div ctry) ) then begin c := ctry -1; Goto cLa end; cLa: { c = 13 or 11 are large enough distances for bits' independence } if c > 13 then c := 13; { prevents c > 32/2, e.g. c = 31 for k = 2 } if (c % 2) = 0 then c := c - 1; { odd c is better than even c } d := w - c; Or pre-set c as follows, with odd c preferred: if k > 1 then case k of { the best c seems to be c = prime or odd: } 2, 3: c:=13; 4: c:=9; 5: c:=7; 6, 7: c:= 5; 8: c:=4; 9, 10: c:=3; else: { here can be the "Either" loop from above } end; { c = 13 or 11 are sufficient distances for bits' independence } if c > 13 then c := 13; { prevents c > 32/2, e.g. c = 31 for k = 2 } if (c % 2) = 0 then c := c - 1; { odd c is better than even c } d := w - c; Check( 1 < k <= 1 + ((w - 1) div c ) ); Check( 1 < c ); { 1 << c makes hashes less dependent } By now it should be clear that FlashHash deserves its name as its blitzy performance in time & space derives from the "incremental" re-computation based on computationally simple & blitzy bitshifting which can be justified by not too simplistic reasoning about k! permutations of k hashes for k-segmented hence "combinatorially imploded" Bloom filters of the 2nd and 3rd kind i.e. for BF2 and BF3 only. Negative tests on BF1 mentioned above have empiricaly confirmed this conclusion. Positive results are in Tab. 3. By negative test results I mean that a flashhashed BF1 will almost always make more false positives than a flashhashed BF3 or Bf2, but the difference does not need to be large. So far on FlashHash. No more spinning about how to obtain k hashes in general and how to do it fast & memory efficient in particular i.e. how (not to have) to save them all in some applications like e.g. in SPIN. Despite the "Never say never", I firmly believe that there never was and never will exist more efficient multihash algorithms than FlashHash & BackHash which have reached the absolute limit of what is possible. -.- +Bitwise rotations aka circular bitshifts aka barrelshifts : The << , >> are bitwise shiftLeft, shiftRight ( shL , shR in BPascal); the filling bits are zeroes. The | is a bitwise OR in C i.e. bitwise OR on integers in BP. BarrelshiftLeft by c bits of an unsigned 32 bit integer h0: d := 32 - c; h := (h0 << c) | (h0 >> d ); e.g. h := (h0 << 8) | (h0 >> 24); e.g. h := (h0 << 1) | (h0 >> 31); BarrelshiftLeft between two 32 bit words h0 and h00 taken AS IF they form a one w = 64 bit word (see its lower losses Fn in Tab. 3 above): d := 32 - c; h0 := HashFun1( S ); h00 := HashFun2( S ); t0 := h0 >> d; { t0 is a temporary aka shunt like for swap } h0 := (h0 << c) | (h00 >> d); h00 := (h00 << c) | t0; h := offset + (h0 % M1); { for BF3. For BF1 use h := h0 % M; } { h := offset + (h0 & M1min1) iff M1 = 2^i ; is fastest } BarrelshiftLeft by cL = 5 bits of a w = 20 bits integer bitstring within a 32 bits word h0 so that 12 msbs are 000000000000 : offbits := 32 - w; {= 12 bits to be put Off i.e. zeroed } offvalue := 1048575; {= 2^w - 1 = 00..12x..000111...20x..111 } offpluscL := offbits + cL; {= 12+5 = 17 } h := (((h0 & offvalue) << offpluscL) >> offbits) { safe Version 1 } | (h0 >> (32 - offpluscL)); or: iff we know that the 32-w msbits are all 0 then it suffices : h := ((h0 << offpluscL) >> offbits) | (h0 >> (32 - offpluscL)); { V2 } E.g. when we computed a h0 of 32 bits and we want to test AS IF h0 has w = 20 bits only, we use Version 1 for j = 1st hash, and V2 for all further hashes j = 2 to k, per state S. Of course we could use the Version 1 for all j as well. A new S must start with Version 1. -.- +On hashing functions : If you know how to obtain multiple mutually almost independent hashes, then Bloom filters are quite easy to program, but nevertheless less easy to understand i.e. to derive BF-formulas and to choose design parameters in order to be able to exploit their engineering trade-offs properly for a near optimal design and operation. Good hashes are the keys to good BFs. To obtain excellent HashFun(S) just search the www for: "Bob Jenkins" and hashing. Bob Jenkins works for Oracle and his hashfun is a funnel- less miracle. Please consult the www on how hashing funnels may harm. To try out FlashHash you need just one hashfun. If you do not have one handy and you do not want to study & test hard, you can use the quick & clean (yet good enough for many not too critical applications) 1-liner operating Bytewise on an item or a dataslab S[loB..upB] treated as a string or an array of bytes to be hashed. As words consist from characters or bytes, and bytes consist from bits, this hashfun is a pretty flexible 1-liner. When hashing ASCII strings, masking away 1,2,3 msb's will improve the quality of hashes and decrease the losses Fn anf the failrate f. h0 is an unsigned integer of, say, 32 bits. Put off overflow checking, if any, to make sure that no overflow happens, and for speed anyway: h0 := 1; for ix := loB to upB do h0 := (h0 << 8) + h0 + S[ ix ]; h0 := 0 might be worse. Quiz: guess why? Specialized 1-liners: { Masking msb's off with & : } For strings in ASCII < 128 decimal: (h0 << 7) + h0 + (S[ix] & 127); for smaller subsets of chars, e.g.: (h0 << 6) + h0 + (S[ix] & 63); for even smaller subsets, e.g.: (h0 << 5) + h0 + (S[ix] & 31); where some chars may be pre-mapped (i.e. pre-converted to other, preferably unused or rarely used characters) which does not harm and actually is likely to improve the hashes and lower the faults Fn, f . On an unsigned integer my 1-liner is just a fast way of repeatedly computing its mathematical equivalent h0 := h0*257 + S[ix]; note that 257 is a prime number (this is not so important as long as it is an odd number. Quiz: why odd ?), but do not forget that the goal of hashing is not math, but an (ir)reversible compression of a chunk S into its much shorter hence generally non-unique signature aka fingerprint called hash i.e. here the final numerical value of h0 which is used for accessing-by-indexing an array called hash table (of bits for BFs). If the range of h0 (e.g. 2^32 -1 = 4294967295 bits = 536870912 bytes) is larger than the highest index number M-1 of the last bit in the hashtable[ 0..M-1 ] of bits, then: If M is a power of 2 i.e. M = 2^Int where Int > 0 then the final hash value of h0 can be reduced to h0/2 (or h0/4 etc) by zeroing its msb(s); the AND-masking will do it: h0 := h0 & mask; { where mask = 0..0111..111 } else clip the final h0 with the post-final h0 := h0 % M1; where M1 is preferably a prime not too close to such "round numbers" like powers of 2, 3,.., 10, etc. If k different hashfuns are needed then we may obtain them as follows: Precomputations (done only initially, only once per session): 1st: Generate a sequence of numbers loB..upB and store it in Seq[1, loB..upB]; 2nd: Initialize a pseudorandom generator of integers; 3rd: Generate k pseudorandomly permuted sequences of the nrs loB..upB: for j := 1 to k do pseudorandomly Swap the elements in Seq[j, loB..upB] until k pseudorandom sequences (each containing all numbers loB..upB permuted) are stored in Seq. Note: shake well, mix well i.e. do rather more swaps than less, say, 10*(upB - loB + 1) or more swaps. The hashing proper (done repeatedly, k times per hashed item): 4th: The j-th hash is obtained by scanning & hashing the item S[loB..upB] in the order given by its precomputed pseudorandom Seq[j, loB..upB]. This simple process yields k fairly independent hashfuns, albeit at a price of considerable CPU-time spent. Such an overhead may be small for I/O-bound applications which do lots of reading/writing (e.g. of strings) from/to a disk which is 1000s times slower than a fast memory. However for CPU-bound "generative" applications, like e.g. model checking in SPIN which generate millions of states each hundreds of bytes long, their (re)comput- ation and storage are the dominating performance factors. For such apps I developed flashhashing which of course can be used in I/O-bound apps too. -.- +Summary of FlashHashing (FH) : FH requires from you to have only one (very) good hashfun h0, and it will (re)generate more of decent hashes from it in no measurable time. FH makes it unnecessary to store more than one hash h0, because all k hashes per state or item can be recovered from h0 in no time measurable with a fewliner called BackHash which is almost identical to FlashHash, but much faster, as BackHash does not really execute any hashfun, it just recovers hashes from one h0 stored per state S. Of course the hashes will not be as independent as they could be if generated separately, but this suboptimality can be partially compensated by a increasing by 1 the number k of hashes used. Just remember the common-sensical engineering rules of thumb : "The bigger the M = k*M1 , the larger the w will be better for low Fn. "The larger the constant c, the more independent the hashes will be" (it rhymes, so our Alzheimers can easily remember it :-) For example if SPIN model checkers want k = 10 or 11 hashes, then they can use c = 3 if the unsigned integers h0, h have 32 bits. The maximal c can be easily precomputed or predefined for k = 2, 3, 4,.., 10, .. up to 32 in which case it must obviously be poor c = 1. The general condition which must be satisfied is k <= 1 + ((w - 1) div c) , hence we may do: if k < 1 + ((w - 1) div c) then begin Increase k by 1 and use as the last extra hash: h0 := (h0 << (w - 1)) | (h0 >> 1); { e.g. h0 := (h0 << 31 ) | (h0 >> 1); } end. In fact c might vary (according to a fixed reproducible pattern!) during the scan, but constants make our lives easy and computations breezy :-) -.- +From mission impossible via combinatorial implosion to mission accomplished : It would be no fun to write & to read this e-paper in the conventional dull, historically falsified style using reductionistic terms "merely" and "nothing but" or the phrases I hate, like "it obviously/easily follows", since not much is obvious to anybody but few genuine geniuses belonging to the category of Ideal Observers (IO) or Ideal Receivers (IR) [Kahre 2002]. For such Superior Beings (almost) all mathematics is nothing but a rather trivial tautology not worth of pursuing. Having to rely most of the time on received wisdoms, I keep being surprised by what finally shows up to be possible in science and technology in general and in IT in particular, as IT is less limited by "material" restrictions than most other technologies. I am writing this section because initially I believed that for high quality Bloom filtering it is impossible to avoid generating k separate hashfuns. I believed that, because I had my take of papers by theoreticians. However, few times before this, I have done things deemed impossible: - Beating down the combinatorial explosions during protocol verification by APPROVER = the first Automated & Proven Protocol Verifier introduced for the first time at the Lake Arrowhead Workshop [Hajek 1977, 1978a]. - In 1988 I have beaten combinatorial explosion down again when generating one solution of the N nonattacking queens problems for all chessboards up to N = 1800 (by 1800) on an XT PC, while then the world record (on a PC ?) was for N = 96 by an ex-Stanford professor Dr. Harold Stone then the manager of the advanced architectures studies at the IBM Thomas J. Watson Research Center [Stone & Stone 1987], [Hajek 1990]. My bad luck was that at about the same time two groups of researchers have succeeded to turn this NP?-problem (yes, I have asked professor Stephen Cook at the University of Toronto, who should be called Dr. NP) into N.P. i.e. into No Problem finally for N = 3M [Sosic & Gu 1990], and the NASA researchers working on the scheduling problem for the Hubble telescope [Minton, Johnson, Philips, Laird 1990], albeit on fast workstations with lots of memory, not on an XT with unexpandable 520 kBytes which were my absolutely limiting factor. Again a mission impossible has become a routine (for cognoscenti). Unless you are an expert on CSPs, please feel free to try N-queens for N > 100 to get a feeling for what I am talking about. - About 1990 I decided to try my hand & head on the problem of in-flight deletions from open addressed hash tables which has been described as if impossible without their complete rehashing: "The situation can be improved, by periodically rehashing the table" [Standish 1980, pp.157-9], "The only solution to the problem of degraded search times is the ultimate (total) reconstruction of the table by a process called rehashing" [Reingold 1986, p.406 (1983, p.361) ]. Despite being explicitly warned by these good professors and implicitly by others who offered no true solution to such true deletions, I succeeded to construct a simple but not simplistic algorithm for in-flight deletions. Weeks later, I found its basic idea described, though not worked out (guess where? :-). Only 12 years later when adding this section to this e-paper, I noticed a hint (but no more) in [Reingold 1986 & 1983, section 7.4, Exercise 13]. - These three examples, all somehow related to the task at hand, suffice. I am writing up these experiences mainly to show that one (wo)man's impossibility is another (wo)man's possibility or even probability, and that especially in algorithmics there are few known boundaries to human ingenuity. Afterall, the Laws of InfoProcessing (LIPs :-) still wait to be uncovered by an info-Ein Stein, but see [Kahre 2002] for a nice try. Recalling these hands on-&-in experiences and still believing in the impossibility of avoiding the computation of k hashfuns as independent as only possible, I searched for an indirect inspiration by re-scanning the works referred further in this section. Amazingly, it did not take more than keeping the problem in mind for a week before I saw the light. There are impossibilities and "impossibilities"; see [Richards 1975] who starts with a children's verse: "While one man was proving it couldn't be done, another man tried and he did IT." See also [Garfield 1977] together with [Newcomb 1903] reprinted there. More "down to the PC" remarks are from [Knuth 1977, p.69]: "interpolation ... works better in spite of the proof that binary search is best", followed on p. 80 by the Conclusions of this most eminent professor emeritus (Stanford) of algorithmics: "Even the 'best possible' algorithm can sometimes be improved if we change the ground rules." FlashHash does change the classical rules; read on to find out which one. Concerning classical approaches: "The word 'Classical' means only one thing in science: it's wrong!" is how [Frieden 1983,1991,2001, the 1st page of the chap. 16] has paraphrased Robert J. Oppenheimer, the co-father of our nuclear age. Oppenheimer has also said that "There are children playing in the street who could solve my top problems in physics, because they have modes of perception that I lost long ago." His statement fits perfectly with that of a zen master Shunryu Suzuki in his book Zen Mind, Beginner's Mind: "In the beginner's mind there are many possibilities, but in expert's there are few. ... You should be rather grateful for the weeds you have in your mind, because eventually they will enrich your practice. ... True understanding is actual practice itself." Om Mani Padme Hum. Om Wagi Shori Mum. Om Vajra Pani Hum. In this e-paper the "classically impossible" was reached : - by literally "recycled randomness" (which may seem to be an oxymoron) i.e. by stepping over the mental threshold of the classical requirement of k independent hashfuns, by relativization of their independence w.r.t. the observer, human, animal, machine or an abstract Gedanken-It. In fact, any programmed randomness is an oxymoron as well, as it is pseudorandom anyway. It is my firm opinion that the requirement of randomness for applications should be defined w.r.t. the observer, who does NOT (!) have to be the omniscient Ideal Observer (IO like e.g. the well known Maxwell's demon, or Laplace's demon) quite similar to the Ideal Receiver (IR) of Bar-Hillel & Carnap employed by [Kahre 2002], as long as we do not deal with such 3-letter word Organizations like e.g. KGA (= Kommietet of Gruesome Adversaries) and other Manichean devils [Wiener 1954, chap. II & XI, spelled Manichaean]. By not being paranoid when not necessary i.e. by assuming only entropic demons of confusion and not of willful malice, we can compensate for the weak spots in the "recycled randomness" by employing what I call "combinatorial implosion" explained next. Manichean demons can be coped with by employing miniMax strategies which minimize maximal possible losses. The price payed for such (over)cautious miniMaxing are average losses higher than with other less conservative strategies. - by incremental computations pushed to the extreme - to cyclicity, which perfectly fits with my "recycled randomness" - which allows blitzy & simple constructions of multiple hashes, and their reconstructibility or recoverability by means of literal recycling. - by recognizing that there is "hashcrosstalk" in the usual BF (BF1) and by the key insight that the "no crosstalk" property of segmented Bloom filters of the 2nd and 3rd kind (BF2 and BF3) can be exploited because of k! permutations of potential matches (of which always only one will actually match i.e. the chances of an unwanted pseudo-match decrease hyperexponentially with k ). More often than not the combinatorial expressions like k! (which is about the worst simple one we can get) are our enemies we have to fight in combinatorially explosive tasks (e.g. in constraint satisfaction problems aka CSP's like the mentioned classical N-queens problem). But here the k! is our ally which works against the weaknesses inherent in "recycled hashes". Hence my new terminus technicus "combinatorial implosion". It would be interesting to try to employ it on a wider basis i.e. to try to attack algorithmic problems by explicit or even systematic reference to "combinatorial implosion". The synergistic effect of these elements is that k hashes are recycled, and if necessary recovered, from a single result h0 of a single call per S of one HashFun(S) only. I started this e-paper with the necklace metaphor inspired by Hermann Hesse's poems, containing such jewels like e.g. the following fragments: "It is a game whose rules are nice and neat." ....... "Those hieroglyphs once so significant { Those progs ... } that now are only colored bits of glass, { bits of chips } he lets them roll until their force is spent { roll i.e. recycle bits } and silently they vanish in the sand." { silicon is made from sand } ....... "Proof that for every color may be found { .. for every state S be found } in music a proper corresponding key." { in an algorYthm a ... hashkey } ....... "The pattern sings like crystal constellations, { crystal clear bitpatt } and when we tell our beads, we serve the whole, ..." { we count our bits } Telling i.e. counting the beads (or bits :-) is what The Tibetan Book of the Dead mentions exactly at what I have identified as its culmination point: "Even at the time that the pebbles are being counted out, be not frightened, nor terrified; tell no lies; and fear not the Lord of Death. ..... The Lords of Death (aka the Executive Furies) are thine own hallucinations. ... Apart from one's own hallucinations, in reality there are no such things existing outside oneself like Lord of Death, or god, or demon, ... Act so as to recognize this." -.- +References [ refs ] : Q: Which refs are books and which are papers ? A: Unlike in the titles of (e)papers here listed, in the titles of books and periodicals all words start with a CAP, except for the insignificants like eg.: a, and, der, die, das, for, from, in, of, or, to, the, with, etc. Bloom B.H.: Some techniques and trade-offs affecting large data base retrieval times, Proc. ACM Nat. Conf., 24, 1969, 83-95. Bloom B.H.: Space/time trade-offs in hash coding with allowable errors, CACM 13/7, July 1970, 422-426. Carter L., Floyd R., Gill J., Markowsky G., Wegman M.: Exact and approximate membership testers, Proc. 10th Annual ACM Symp. on Theory of Computation, San Diego, 1978, 59-65; they refer to Bloom 1970. Cleary J.G.: Compact hash tables using bidirectional linear probing, IEEE Trans. on Computers, C-33/9, Sept. 1984, 828-834. Czech Z.J. et al: Perfect hashing, Theoretical Computer Science, vol. 182, 1997, 1-143 = a strong but long story for the strong on long weekends :-) Frieden Roy. B.: Probability, Statistical Optics, and Data Testing, 1st ed. 1983, 2nd ed. 1991 has new chap. 17. In the 3rd ed. of 2001 start reading the vastly expanded chap. 17 on p. 413-etc. Garfield Eugene: Negative science and "The outlook for the flying machine", Current Comments in his Current Contents, Nr. 26, June 27, 1977, 155-166; followed by [Newcomb 1903]. Gremillion L.L.: Designing a Bloom filter for differential access, CACM 25/9 (1982), 600-604. Hajek Jan: the 1st presentation of APPROVER = Automated & Proven PROtocol VERifier at the Lake Arrowhead Workshop, California, August 1977. From its very beginning, APPROVER was sufficiently advanced to perform reachability analysis of protocols specified as procedures in Burroughs Algol, so that much more primitive finite state models (FSA) had to be translated into Algol and not vice versa. ARPA's (later: DARPA) TCP = Transmission Control Program (later: Protocol) has been specified by Dr. Carl Sunshine, Rand Corp., and design bugs were found in TCP by APPROVER [Sunshine 1978]. ARPANET was the mother of Internet still running on TCP. Hajek Jan: Progress report on the automatic and proven protocol verifier, Computer Communication Review, ACM SigComm 8/1, January 1978, 15-16. Hajek Jan: Automatically verified data transfer protocols, Proc. 4th Intl. Computer Communications Conference, Kyoto, Sept. 1978, 749-756, with 4 protocols and their 4 Effective Progress/Stagnation graphs aka P/S-graphs. On p. 752 left column, 2nd line from below: please fix the otherwise nonexistent ":= phaseA" into the correct ":= phasesA". Hajek Jan: Protocols verified by APPROVER, Computer Communication Review, ACM SigComm 9/1, January 1979, 32-34. Hajek Jan: A new dynamic backtracking algorithm for constraint satifaction problems, technical report CC-Note 50, 1990, 18 pages. Holzmann Gerard J.: An improved protocol reachability analysis technique, Software Practice and Experience, 18/2, Feb. 1988, 137-161, on pp. 144 and 161 refers to [Morris 1968] and to [McIlroy 1982]. Holzmann Gerard J.: Algorithms for automated protocol verification, AT&T Technical Journal, Jan./Feb. 1990, 32-44. Kahre Jan: The Mathematical Theory of Information, 2002 Knuth Donald: The Art of Computer Programming, vol. 3: Sorting and Searching, 1973, see his index for Superimposed coding, and for Bloom method for which the correct approximate yet sufficiently precise formula is on p. 562 mid in the 1st edition, and on p. 573 up in the 2nd edition. Knuth D. E.: Algorithms, Scientific American, April 1977, 63-80. Lampson B.W., Sproul R.F.: An operating system for a single-user machine, Proc. 7th Symp. on Operating Systems Principles, Dec. 1979, 98-105. McIlroy M.D.: Development of a spelling checker, IEEE Trans on Commun., COM-30/1, Jan. 1982, 91-99; hashing, superimposed codes, Bloom, Morris, Floyd & Gill on p. 95 and in the references, where [10] is Knuth vol.3. Minton Steven, Johnston Mark D., Philips Andrew B., Laird Philip: Solving large scale constraint satisfaction problems using a heuristic repair method, International Joint Conference on Artificial Intelligence, 1990, IJCAI-90, 17-24. Morris R.: Scatter storage techniques, CACM 11/1, 1968, 38-44. Mullin J.K.: A second look at Bloom filters, CACM 26/8 (1983), 570-571. Murray D.M.: A scatter storage for dictionary lookups, Journal of Library Automation, Sept. 1970, 173-201, mathematical analysis on pp. 177-183, virtual hashing analyzed on pp. 180-183 by a computer scientist from Cornell University. Newcomb Simon: The outlook for the flying machine, The Independent - A Weekly Magazine, Oct. 22, 1903. Reprinted in [Garfield 1977, 167-172]. Reco to search www for: Newcomb AND flying Nix Robert: Experience with a space efficient way to store a dictionary, CACM 24/5, 1981, 297-298; refers Carter et al., describes Bloom filter without directly referring to Bloom (but Carter et al do refer to Bloom). Ramakrishna M.V.: Practical performance of Bloom filters and parallel free-text searching, CACM 32/10 (1989) 1237-1239; the reference 3. to Gremillion should be Sept. 1982 and not July 1982. Reingold Ed.M., Hansen W.J.: Data Structures in Pascal, 1986 (there was a 1983 edition), section 7.4.3 on Multiple hash functions, 410-413; their math on pp. 412-413 starts with the assumption that maximum amount of entropy is good (for Bloom filters, but Bloom is not named). Richards Ian: Impossibility, Mathematics Magazine, Nov-Dec. 1975, 249-262 Robert Ch.S.: Partial-match retrieval via the method of superimposed codes, Proceedings of the IEEE, 67/12, Dec. 1979, 1624-1642. Shannon C.E.: The Mathematical Theory of Communication, 1948, 1949. Sincovec R.F., Wiener R.S.: Data Structures Using Modula-2, 1986, pp. 396-408 on probabilistic hashing and on virtual hashing. Sosic Rok, Gu Jun: A polynomial time algorithm for the N-queens problem, ACM Sigart Bulletin, vol. 1, no. 3, 1990, 7-11. Two more articles in ACM Sigart Bulletin, vol. 2, no. 2, 1991, 8 and 22-24. Further in ACM Sigart Bulletin, vol. 3, no. 1, 1992, 8-12. Stallings W.: Password generation by Bloom filters, Dr. Dobbs Journal, Aug. 1994, 119-etc, formulas are ok, but fig.2 is not. Standish Th. A.: Data Structure Techniques, 1980 Stone Harold. S., Stone Janice M.: Efficient search techniques - An empirical study of the N-queens problem, IBM J. Research & Development, vol. 31, no. 4, July 1987, 464-474. Private communication by letters with Harold Stone in 1988 & 1989. Sunshine Carl A.: Survey of protocol definition and verification techniques, ACM Computer Communication review, vol.8, July 1978, nr.3, 35-41, see Table 1 for APPROVER & Hajek & ARPA TCP on p. 39. Vaihinger Hans: Die Philosophie des Als Ob - Volks-Ausgabe, 1923 Wiener Norbert: The Human Use of Human Beings - Cybernetics and Society, 2nd (expanded) edition 1954, (1st in 1950). Wiener Richard S.: An efficient virtual hash algorithm for a spelling checker, Journal of Pascal, Ada & Modula-2, Jan-Feb. 1986, 23-29; with modules programmed in Modula-2 Wolper Pierre, Leroy Denis: Reliable hashing without collision detection, in Computer Aided Verification, Proc. Int. Workshop, Elounda, Crete, Lecture Notes in Computer Science, Vol. 697, Springer-Verlag, June 1993, pp. 59-70 Wolper Pierre, Stern Ulrich, Leroy Denis, Dill David L.: Reliable probabilistic verification using hash compaction, on www only, 30 pp. (temporarily withdrawn by Ulrich Stern) -.-