Engineering & Data Science - Bamieh Tech blog

Detect anomalies in your data with Elasticsearch & Kibana

Ahmad Bamieh — Wed, 02 Feb 2022 19:20:15 +0200

A three parts series to cover anomaly detection using Elasticsearch and Kibana. My goal is to provide a digestible introduction to anomaly detection without diving too deep into data science. I'll guide you through creating your first anomaly detection job through Kibana then adding alerts on top. Hoping to excite you to explore the incredible ML capabilities of the Elastic stack.

Part 1: Get started with anomaly detection (you're here)
Part 2: Create your first anomaly detection job (coming next week)
Part 3: Add alerting to your ML jobs (coming later)

Not so long ago, machine learning (ML) and applications like anomaly detection were only accessible to ML specialists and seasoned data analysts. Lucky for us, solutions like Elasticsearch and Kibana allow data professionals and engineers to gain unique insights from their data through ML quickly and with ease.

I've recently read the book Machine learning with the Elastic Stack, the main driver for writing this series. I highly recommend reading this book if this article piques your interest as it goes into great detail explaining Elastic's ML capabilities, configurations, and applications in practice.

Entering anomaly detection

Anomaly detection is simply a way to find data points or patterns in your data that are different from usual. There are two heuristics that we could use to define the different kinds of anomalies:

Temporal: Something is unusual if its behavior diverges significantly from an established pattern in its own behavior over time.
Population: Something is unusual when it is drastically different from its peers in population.

population vs. temporal anomalies

Elasticsearch uses unsupervised learning to detect anomalies in the data. The gist of unsupervised learning is that the algorithms learn the data patterns independently with no outside guidance or assistance from humans, which is a massive win for us!

Why you should care

It is not uncommon to see companies handle terabytes of critical information through continuous data streams from their solutions. Examples of such data vary from things like purchase logs, user interactions, and all the way to system logs and network activity.

In general, there are three main approaches to sifting through this plethora of data:

Manually watch the data looking for anomalies through visualizations and analyses.
Define rules or conditions to trigger under specific requirements.
Use Machine learning to detect anomalies in data and proactively take action.

Approaches to sifting through data — Icons from Elastic’s EUI library

It goes without saying that manually watching incoming data to detect issues proactively is both costly and error-prone. A less obvious fault of this approach is that a human eye cannot detect all anomalies. There are a few key things to recognize about these less-than-obvious anomalies:

A pattern is not anomalous by itself but is interestingly significant.
Lack of expected values can be anomalous if there's an expectation that events should occur.
An anomaly spans multiple entries rather than a single data point. These are called multi-bucket anomalies.

Setting Thresholds or rules to catch anomalies proactively is a lot better than manual labor. However, it is unlikely to define the entire ruleset needed to get reliable and accurate results. Plus, the velocity of changes in the applications and environments could quickly render any static ruleset useless. Analysts find themselves chasing down many false positive alerts, setting up a boy who cried wolf paradigm rendering the generated results useless.

Using anomaly detection enables teams to act proactively on early signs only surfacing a small set of relevant data points to help in the identification of the root cause while filtering out the noise of irrelevant behaviors that might distract human analysts from the things that actually matter.

Detecting anomalies through the Elastic stack is fast, scalable, accurate, low-cost, and easy to use.

Many important use cases revolve around detecting anomalous events over time (temporal anomalies), such as:

Detect an unusual purchasing behavior of specific customers or a sudden change in overall sales.
Proactively detect unexpected piling up of messages in application log files.
Track down unauthorized access attempts or suspicious user activity.

Finding outliers in a dataset (population anomalies) is critical in several applications such as fraud detection or detecting defects in manufacturing lines.

How does it work?

Anomaly detection works on live data streams by ingesting time series data grouped into discrete time units called buckets. The model allows users to specify a detector function such as average or sum, computed on each bucket. The model then calculates the probability distribution of each bucket and continuously updates this distribution as more data is ingested. The model scores the data points based on their probability distribution. The lower the probability of the data points, the more likely it'll be flagged as an anomaly.

A rather complex orchestration occurs to enable the ML models to continuously ingest and learn from live data streams. Elasticsearch automatically handles all the complex logistics required to make it all happen, from maintaining the model states to data ingestions and managing the cluster.

Machine learning jobs process — Icons from Elastic’s EUI library

The machine learning nodes in Elasticsearch are responsible for running the anomaly detection jobs, which analyze the incoming data against the ML model. While the models keep their state in memory, snapshots of the latest states are also synced into Elasticsearch. This allows users to revert a job into a previous state in case something unexpected happens. The analysis results are stored in Elasticsearch to be consumed by Kibana or through API access.

The anomaly job model comes with many powerful out-of-the-box features that are also highly configurable. I'll highlight a few here:

De-trending: Elastic’s ML models automatically factor out trends in the data such as linear and cyclical patterns. De-trending is essential for modeling real-world datasets to account for seasonal cycles and linear growth and shrinkage.

Splitting jobs: Elastic’s ML allows splitting the analysis based on categories in the data. Splitting jobs helps the model find more detailed patterns in each category and run the analysis for each in parallel.

Influencers: Elastic ML automatically identifies relevant fields in the dataset that have contributed significantly to anomalous behavior.

Final thoughts

There are many more exciting details and advanced configurations Machine learning with the elastic stack goes into detail about elastic's ML capabilities in both supervised and unsupervised learning.

Amazon.com: Machine Learning with the Elastic Stack: Gain valuable insights from your data with Elastic Stack’s machine learning features, 2nd Edition eBook : Collier, Rich, Montonen, Camilla, Azarmi, Bahaaldine: Kindle Store

Visit Amazon’s Rich Collier Page

Next week I will publish the second part of this series: Create your first anomaly detection job: a step-by-step guide to write your first anomaly job through Kibana within 15 minutes! Stay tuned and follow to get notified.

You’re missing out on ImmutableJS records

Ahmad Bamieh — Sun, 16 Jul 2017 16:26:00 +0300

ImmutableJS records are immutably beautiful!

ImmutableJS records are super simple to use and provide tremendous advantages over the regular Map . This post will introduce you to records, what are they, and how to start using them right now.

Record Properties

A Record is similar to ImmutableJS Map , however, it has the following unique features that make it special:

You cannot add more keys to it once it has been constructed.
You can define default values for new record instances.
The properties of a Record Instance can be accessed like regular JS objects and can be destructed as well.
You can name a record, for better debugging and error handling.
You can extend the Record, to provide derived data from within the record.

This post will discuss all the following properties, but first, let us create our first record!

Creating a record

The Record method returns a constructor function in which new instances could be made out of it.

const LivingCreature = new Immutable.Record({
    name: 'Unknown',
    species: 'Human',
    age: 0,
});

const fooBar = new LivingCreature({name: 'Foo Bar', age: 24});

In this code snippet, we’ve created a Record of LivingCreature and made a new instance called fooBar. Note that this instance has the default species of Human .

Descriptive Name

Records accept a second parameter for a descriptive name that appears when converting a Record to a string or in any error messages.

const NamedRecord = new Immutable.Record({ ... }, "[NAME HERE]");

Retrieving Properties

Unlike other ImmutableJS objects, records can be accessed like normal JS objects.

const {name, species} = fooBar;

fooBar.name; // Foo Bar
fooBar['species']; // Human

Replacing Values

To replace fooBar living creature with another creature, simply *swap records*.

fooBar = new LivingCreature({
    name: 'Foo Bar Junior',
    species: 'Half Blood',
    age: 8,
});

Updating Values

Using records, we can update multiple values at once using merge, in addition to using set to update a single value.

fooBar.set('age', 20);

// or

fooBar.merge({
 age: 25,
 species: 12,
});

Adding new keys

The record throws an error if you attempt to add non-initialized keys on it. The following examples will throw an error.

const newFooBar = new LivingCreature({ status: 'its complicated' });
const mergeFooBar = fooBar.merge({ status: 'its complicated' });

Removing Keys

Records always have a value for the keys they define. Removing a key from a record simply resets it to the default value for that key.

const newFooBar = fooBar.remove('name');
console.log(newFooBar.name); // → 'Unknown'

Derived Values

One of the most powerful features that I personally love about records is their ability to derive data from within the record itself.

For example, let's say that we have a 'Cart' of two items, and their sum . In normal cases, every time we update an item value, we have to update the sum as well. Soon you will feel that this is not the best practice;

👉

Records come to the rescue!

class Cart extends Immutable.Record({ itemA: 1, itemB: 2 }) {
  get sum() {
    return this.itemA + this.itemB;
  }
}
var myCart = new Cart();
myCart.sum; // 3

Now we can update any value, and since the sum is derived from the record properties, there is no need to worry about updating it manually.

const updatedCart = myCart.set('itemA', 5);
updatedCart.sum; // 7

Conclusion

Records provide an amazing advantage of allowing your immutable objects to be treated like normal objects, by having standard accessors and object de-structuring, hence any library or component that does not mutate objects will welcome the records like one of their own!

Additionally, since record keys must be specified when the record is created, reading the record will clarify its use and self-document its purpose. It also enforces a more strict code style since you cannot add any more keys to the record.

My team and I have been using ImmutableJS Records for a while now. I am surprised why it is usually overlooked and less popular than the standard Map .

NodeJS: Constant HashTable Seeds Vulnerability

Ahmad Bamieh — Thu, 13 Jul 2017 00:00:00 +0300

You might have heard about the high-impact security vulnerability issue recently fixed and announced by the NodeJS team. This post will attempt to explain the issue, how and why it happened.

Thanks to the amazing effort of the team at NodeJS, this vulnerability has been fixed immediately across releases (4.x, 6.x, 7.x, and 8.x).

Make sure you upgrade your node version as soon as you can (preferably when you finish reading this post!). For upgrading your node head to the official nodeJS blog for details.

The vulnerability roots from Hash tables, so let us start with a quick recap on HashTables, just in case you missed your computer science classes.

HashTables

HashTables fulfill the dream of *constant time* insertions and access.

Some of the places where HashTables are used:

Method lookups.
Sets, Objects, and Maps.
HTTP headers.
JSON representations.
URL-encoded post form data.

A HashTable is a group of linked lists, the linked lists must be small in order for the performance to be good.

The create HashTable function pre-allocates the number of linked lists it can contain, throughout this post, the number of linked lists the HashTable can hold is denoted by the word l.

l = 2**n

The best scenario for optimal performance is to have n/l entries in each linked list, where n is the number of entries in the HashTable. so if we have 256 entries in the HashTable, the perfect hash function would distribute them into the 256 lists, each holding only 1 entry.

Example: Storing strings one to ten in a HashTable

In this example, our hash function is the following:

H(s) = first byte of s
l = 256

In our “one” to “ten” example, inserting the values into the HashTable will result in something like the following:

{
  "o": ["one"],
  "t": ["two", "three", "ten"],
  "f": ["four", "five"],
  "s": ["six", "seven"],
  "e": ["eight"],
  "g": [],
  "h": [],
  "n": ["nine"],
  "r": [],
  …
}

Notice that we have empty linked lists in the HashTable because they are not filled up by any value. Additionally, there is a pile-up on the t list in the HashTable. This is a bad hash function since it does not do a good job of distributing the strings through the linked lists. No matter how many linked lists you have, if the hash function does a bad job in the distribution among them you will get bad performance, since linked lists are slow, hash tables are fast by having only short linked lists in them.

To avoid overflowing the HashTable, since there are more H(s) results than l, the hash function is always reduced by l using mod

H(s) % l

H(s) = first byte of s is not a good hashing function. NodeJS implements a very good hash function that quickly looks through the whole string, and gives randomly looking results to spread the results across the HashTable.

Since we have a good hashing function, normal usage should *never* cause HashTables to over-flood. However, an attacker just might.

Hashing malicious strings.

The attacker provides strings where H(s)%l will result in the same HashTable index, hence all these strings will be stored in the same linked list, making the app go super slow.

To solve such issues, programming languages looked at different solutions, some are the following:

Solution #1

👉

Replace linked lists with another structure, such as red black trees or any balanced tree structure.

However, most programs go with the simpler approach of using simple linked lists to avoid bugs, heavy re-writes, debugging issues, and implementation complexity.

Solution #2

👉

Stop inserting into the linked list if it gets more than n entries.

This solution looks pleasant, but for real-world applications, you cannot just discard newly coming results, hence this solution is not viable for all situations. It can be only used in caching solutions.

Solution #3

👉

Use crypto hashing functions like sha256.

Crypto hash functions should be collision-resistant, you should not find any two strings with the same output.

But this solution is troublesome for two main reasons:

The hash function should be fast, crypto hashes are not the fastest.
Even if the crypto hashes were fast enough, this does not solve the problem. The attacker does not need to find collisions in the crypto hashing function itself, instead, they need to find collisions in the output which is reduced to the number of l linked lists specified. H(s)%l is not collision-resistant based on the hashing function, since the output will always be reduced to the 2**n lists. So the attacker might run a few million iterations to find critical amounts of collisions.

Current Solution

Randomizing the hash function with a secret key.

These attacks were repeated many times throughout the years under different names: low bandwidth denial of service attacks, hash flooding, algorithmic complexity attacks, etc.

Attacks were targeted at many programming languages, such as Perl, Redis, and python. Usually the response to these attacks is to secretly randomize the hash function on each application boot, or hashtable initialization, this way the attacker has no way to guess the hashing function hence they cannot find collisions because they do not know the secret key of the hashing function.

NodeJS uses the same approach, by randomizing its “HashTable seed”.

**Whether this is actually secure or not is a debate for another time.

The NodeJS Constant HashTable vulnerability

The vulnerability came along by a bug introduced in the randomizer of the HashTable seed where the seed was always the same for a given version of node.

NodeJS was susceptible to hash flooding remote DoS attacks as the HashTable seed was constant across a given released version of NodeJS.

For example, an HTTP node server is vulnerable to this attack since the URL-encoded post form data is stored in a HashTable on the heap. The attacker can send small payloads each with different input but yield the same hash key, across multiple post requests (say 10 bytes each), each request will accumulate the payload in the same HashTable linked list stored on the heap.

So after accumulating the requests and abusing the same linked list, the server will face huge delays in responses since it is synchronously trying to access the same linked list in the HashTable.

The Cause

The JavaScript specification includes a lot of built-in functionality, from math functions to a full-featured regular expression engine. Every newly-created V8 context has these functions available from the start. For this to work, the global object (for example, the window object in a browser) and all the built-in functionality must be set up and initialized into V8’s heap at the time the context is created. It takes quite some time to do this from scratch.

V8 uses 'snapshots' to solve this issue: A snapshot is basically a saved context that can be re-used in future boots.

Fortunately, V8 uses a shortcut to speed things up: just like thawing a frozen pizza for a quick dinner, we deserialize a previously-prepared snapshot directly into the heap to get an initialized context. On a regular desktop computer, this can bring the time to create a context from 40 ms down to less than 2 ms. On an average mobile phone, this could mean a difference between 270 ms and 10 ms.

Snapshots basically enabled NodeJS to reduce boot time by taking a snapshot of the pre-initialized boot context and using it for the next runs. Additionally, users are allowed to use custom snapshots to build on top of the original context.

The root cause of the vulnerability originates from having the v8 snapshots feature enabled by default, so the initial randomized hash table seeds are constant across each version of NodeJS. This minor error resulted in NodeJS being susceptible to remote DOS attacks via hash flooding.