Previous Section - FireHose WWW Site - Next Section

Analytics

The current version of FireHose has three analytics:

  1. anomaly1 = anomaly detection in a fixed key range
  2. anomaly2 = anomaly detection in an unbounded key range
  3. anomaly3 = anomaly detection in a two-level key space

Source code for these is in the tarball directory "analytics". Each sub-directory has code for multiple versions of the analytic, developed in different languages and for different streaming frameworks. Instructions for how to build and run the analytics is given in the Running the benchmarks section.

These are the implementations we currently provide. Over time, as users contribute them, we hope to add more implementations of the analytics, written for various streaming frameworks.

As explained below, each analytic is required to perform a specific computation on the streaming data it receives via a UDP port from one of the stream generators described in the Generators section. The format of the stream is defined by the generator.

An implementation of the analytic should follow the rules discussed below. Most importantly, it should be robust to the possibility it will miss packets in the stream which arrive before the analytic is ready to read them. This is because the generators can be run at high rates, which may outpace the analytic.

If you wish to implement an analytic yourself, or to run a benchmark on your hardware, you are free to implement it in a different manner from the provided code. But you need to perform the computations outlined below and adhere to the following guidelines, at least if you wish to contribute your implementation to the code archive and your performance results to the Results section.

Any additional anayltic-specific rules are discussed below.



(1) Anomaly detection in a fixed key range

This analytic is designed as a simple test of anomaly detection; keys are flagged as unbiased or biased (anomalous) depending on the values associated with an individual key as it appears multiple times. Only a small fraction of the keys are biased.

It is designed for use with the biased-powerlaw generator, which produces packets of datums, each of which is an ASCII text string in the following format:

1000582694,0,1 

The first field is a "key", which can be represented internally as a 64-bit unsigned integer. The second field is a "value" of 0 or 1. The third field is a "truth" flag, also 0 or 1.

The computational task for this analytic is to identify biased keys. An unbiased key will have 0 versus 1 values appear with equal probability. A biased key will have 0 versus 1 values appear in a skewed 15:1 probability ratio.

Detection can be performed by keeping track of the values seen for each key, e.g. in a hash table of fixed size (see below). When an individual key has been seen 24 times, the key is labeled "biased" if a value of 1 appeared 4 or less times; otherwise the key is labeled "unbiased". When the decision is made, the answer is checked against the "truth" flag appended to each datum. The large majority of the results will be keys correctly flagged as unbiased (true negative). The other 3 cases are more rare; the count for each of the 4 categories is tallied by the analytic:

The analytic also emits a line of output when a detection event in one of the first 3 categories occurs, in this format, where the last field is the key:

true anomaly = 12345
false positive = 34567
false negative = 67890 

Note that due to statistical variation in the way the generator produces random values, for the 3 rare outcomes, true anomalies should mostly be detected, but about 20% of the rare cases will be false positives, with an occasional false negative as well.

The analytic also keeps track of the following statistics which are printed in the following format at the end of the benchmark:

packets received = 999986 
datums received = 49999300 
unique keys = 85578 
max occurrence of any key = 16709489 
true anomalies = 34
false positives = 6
false negatives = 0 
true negatives = 3220 

A reasonable way to implement this analytic is with a hash table that stores a running tally of the values seen for each unique key. Note that the analytic does not know how many times an individual key will appear in the stream, though it need not retain a key after it has been seen 24 times.

Optional command-line switches in the serial reference implementations (C++ and Python):

Additional implementation rules:



(2) Anomaly detection in an unbounded key range

This analytic is also designed as a simple test of anomaly detection; keys are flagged as unbiased or biased depending on the values associated with an individual key as it appears multiple times. However the number of unique keys in the stream is effectively infinite for long benchmark runs, so the analytic must decide which keys to store state for.

It is designed for use with the active set generator, which samples from an unbounded range of keys.

In benchmark runs using this analytic, huge numbers of unique keys will be generated, which means it will be inefficient to use a hash table to store all of them, unless some provision for expiring keys is implemented (see below).

Aside from this change in the distribution of keys in the data stream, the computational task of detecting biased keys is identical to the decscription for analytic #1, as is the format of output the analytic should generate. The only exception is that a final output for "max occurrence of any key" is no longer a relevant value to tally.

A reasonable way to implement this analytic is with an expiring hash table that attempts to keep track of values for keys in the active set of the generator, e.g. via a least-recently used (LRU) metric, where the key selected for expiration (removal from the hash table) is the one seen longest ago in the stream. As discussed below, the size of the set of active keys is limited by the generator which can be used by the analytic to age keys efficiently. Note however that the analytic does not knowhow many times an individual key will remain in the generator's active set or appear in the stream, though it need not retain a key after it has been seen 24 times.

Optional command-line switches in the serial reference implementations (C++ and Python):

Additional implementation rules:



(3) Anomaly detection in a two-level key space

This analytic also performs anomaly detection; a set of keys are flagged as unbiased or biased depending on the values associated with an individual key as it appears multiple times. However, the key/value pairs for which the bias detection is done, are derived from state information that must be stored for a different set of keys, namely those which appear in the data stream. This requires two iterations of storage and lookup in different hash tables.

It is designed for use with the two-level generator, which produces packets of datums, each of which is an ASCII text string in the following format:

3929309884956,57364,1 

As explained on the doc page for the two-level generator, the first field is an "outer" key (64-bit unsigned integer), the second field (16-bit unsigned integer) is a portion of an "inner" key, and the the third field is a "counter" for how many times the outer key has been emitted.

The first 4 times the outer key is seen, the analytic should store the 4 values. The fifth time the outer key is seen, the 4 previous values are concatenated (right to left, i.e. low-order to high-order) into a new "inner" key, which also becomes a 64-bit unsigned integer. The value for the inner key is the rightmost digit of the value for the fifth occurrence of the outer key, which is a 0 or 1.

The inner key/value pairs are treated the same as the key/value pairs in the first analytic and checked for bias. I.e. when an individual inner key has been seen 24 times, the inner key is labeled "biased" if a value of 1 appeared 4 or less times; otherwise the inner key is labeled "unbiased". When the decision is made, the answer is checked against the "truth" flag, which was the leftmost digit in the value of the fifth occurrence of any of the outer keys which generated this inner key.

The four categories (true anomalies, false positives, false negatives, true negatives) of inner keys, and the format of output the analytic should generate, is the same as described above for the anomaly1 analytic. As in the second analytic, the only exception is the final output of "max occurrence of any key" is no longer a relevant value to tally.

Note that if packets are dropped, the analytic should not try to construct inner keys from incomplete information. The third counter field in the outer key datum can be monitored to insure a specific outer key is seen all 5 times it is needed.

A reasonable way to implement this analytic is with two expiring hash tables, one for outer keys and another for inner keys, and to expire keys from each via a least-recently used (LRU) strategy, as explained for the anomaly2 analytic. Outer keys will be emitted at most 5 times (but may be emitted less); individual inner keys will be encoded in the stream an unknown number of times.

Optional command-line switches in the serial reference implementations (C++ and Python):

Additional implementation rules: