How to Detect Anomalies in Splunk Using Streamstats (2024)

By Josh Neubecker|Published On: September 16th, 2021|

Detecting anomalies is a popular use case for Splunk. Standard deviation, however, isn’t always the best solution despite being commonly used.

In this tutorial we will consider different methods for anomaly detection, including standard deviation and MLTK. I will also walk you through the use of streamstats to detect anomalies by calculating how far a numerical value is from its neighbors.

The problem with standard deviation

Standard deviation measures the amount of spread in a dataset using the value’s distance from the mean. Using standard deviation to find outliers is generally recommended for data that is normally distributed. In security contexts, user behavior is most often an exponential distribution, low values being commonly seen with high values being more rare. Standard deviation can be used to find outliers but a certain percentage of data will always be seen as outlier. This means more data equals more outliers equals more alerts.

One example would be if we were looking for users logging in from an anomalous number of sources in an hour. The distribution of source count is an exponential distribution:

How to Detect Anomalies in Splunk Using Streamstats (1)

If we were to apply a standard deviation outlier detection to the whole dataset upperBound=avg+stdev*2 there would be 3,306 results between 672 users.

There isn’t that much actual anomalous behavior happening in this example, but what’s normal for one user can be abnormal for another.

Applying a single upper bound to all users doesn’t actually capture anomalies. However, what if we were to have separate upper bounds for each user? Interestingly, this is worse–8,061 results, and that’s only looking at users with over 30 data points. Much of the activity was buried under a high upper bound from users or accounts that regularly log in from many sources.

Now let’s look at the hour of day and weekdays/weekends. We’ll need to look at 30 days to have enough data points for grouping.

Requiring more than 15 data points, there are 14,298 results. It’s getting even worse because more events aren’t getting buried by high counts during certain hours of day.

What about MLTK?

Splunk’s Machine Learning Toolkit (MLTK) adds machine learning capabilities to Splunk. One of the included algorithms for anomaly detection is called DensityFunction. This algorithm is meant to detect outliers in this kind of data.

Unfortunately, outside of editing config files and making sure you have enough processing power, the DensityFunction is limited to 1024 groupings and 100,000 events before it starts sampling data.

If identity data in Splunk for different types of users is high quality, reflects different usage patterns, and there are less than 1024 of them then MLTK may be the direction to go.

Using streamstats to get neighboring values

As an alternative to MLTK, I use streamstats to mimic how I–as an analyst–investigate an alert.

For our example of a user being seen logging in from an anomalous number of sources, I would start by looking at historical source counts over the past 30 days. If the source count was significantly higher than any previous source counts I would consider it anomalous.

Using streamstats we can put a number to how much higher a source count is to previous counts:

1. Calculate the metric you want to find anomalies in.

Copy to Clipboard

In our case we’re looking at a distinct count of src by user and _time where _time is in 1 hour spans.

2. Sort the metric ascending.

Copy to Clipboard

We need the 0 here to make sort work on any number of events; normally it defaults to 10,000.

3. Run streamstats over the data to get the lower values for each value calculating the sum and how many previous values there were.

Copy to Clipboard

Current=f to only look at the previous values. For window=5 we’re looking at the previous 5 lower values but the number here isn’t too important, it just needs to be enough to get a good sample of previous values. Global=f needs to be used since we’re using a window and want to have separate windows for each user. I’m also listing out the previous values for added context.

4. Sort the metric descending.

Copy to Clipboard

Same as ascending we need to use sort 0.

5. Run streamstats over the data descending to get the higher values for each value calculating the sum of higher values and how many higher values there were.

Copy to Clipboard

For this we can look at all higher counts that have been seen so no window is required.

6. Use fillnull to fill in 0 if there were no values found for one of the calculations.

Copy to Clipboard

7. Calculate the total number of nearby values and their sum.

Copy to Clipboard

8. Calculate a distance metric.

Copy to Clipboard

9. Filter the results on the distance metric.

Copy to Clipboard

Adjust the threshold for distance score based on your results. Add a fallback threshold if you still want results if there is no history.

Putting it all together

To put this together as a correlation search, we need to make sure we’re pulling in the data we want and that it’s normalized. It can also be useful to add additional metrics to filter on.

In the case of this search, in addition to src_count, I’ve added a new_src_count for a count of sources only seen a single day in the past 30.

| tstats `summariesonly` count from datamodel=Authentication where Authentication.signature_id=4624 NOT Authentication.user=”-” NOT Authentication.user=”ANONYMOUS LOGON” NOT Authentication.user=”unknown” NOT Authentication.src=”unknown” by Authentication.user Authentication.src Authentication.dest_nt_domain _time span=1h
  • Tstats to quickly look at 30 days of data
  • Focusing on Windows authentication 4624 events
  • Removing events with unknown an irrelevant data
  • Grouping by user src and dest_nt_domain which contains the user’s domain
| rename Authentication.* as * dest_nt_domain as user_domain
  • Remove datamodel from field names and rename dest_nt_domain to be more accurate
| `get_asset(src)`
  • Pull in Splunk assets to get src hostname (src_nt_host)
| eval src=lower(if(match(src, “([0-9]{1,3}\.){3}[0-9]{1,3}”) AND isnotnull(src_nt_host), mvindex(src_nt_host, 0), src))
  • For src values that are IPs replace src with src_nt_host from asset data if it exists
  • Lowercase for normalization
| eval user=lower(mvindex(split(user, “@”), 0))
  • Normalize user, lowercasing and pulling just user from user@domain
| where lower(src)!=lower(user_domain)
  • Filter out local authentication
| bin _time span=1d as day
  • Create day field
| eventstats dc(day) as day_count by user src
  • Count how many days a user src combination has been seen
| stats dc(src) as src_count dc(eval(if(day_count=1, src, null()))) as new_src_count by user _time
  • Src_count: how many total sources a user has been seen from in an hour
  • New_src_count: how many of those sources have only been in a single day in the past 30
| sort 0 src_count
  • Sort src_count ascending
  • Don’t forget “0” or it will only sort 10,000 events
| streamstats window=5 current=f global=f count as events_with_closest_lower_count sum(src_count) as sum_of_last_five list(src_count) as previous_five_counts by user
  • For each event, get the sum and count of the previous 5 values. The values being the next smallest because of the sort
| sort 0 -src_count
  • Sort descending
| streamstats current=f count as events_with_higher_count values(src_count) as higher_counts_seen sum(src_count) as sum_of_higher_count by user
  • For each event get the same and count of all the previous values. The values being greater because of the sort
| fillnull events_with_higher_count events_with_closest_lower_count sum_of_higher_count sum_of_last_five
  • Fill null values with 0 if no higher or lower events were found
| eval count_of_nearby_values=events_with_higher_count+events_with_closest_lower_count, sum_of_nearby_values=sum_of_higher_count+sum_of_last_five
  • Calculate the total number of surrounding values that were seen and their sum
| eval distance_score=(src_count*count_of_nearby_values)/sum_of_nearby_values
  • Calculate the distance metric
| where (((distance_score>2 AND new_src_count/src_count>0.3) OR distance_score>5) OR (count_of_nearby_values=0 AND src_count>3)) AND _time>=relative_time(now(), “-4h”) | rename _time as orig_time | convert ctime(orig_time)

Alert conditions:

  • Distance score is greater than 2 and more than 30% of sources seen were new
  • Or distance score is greater than 5
  • Or if there is no history alert if more than 3 sources were seen
  • Filter to events in the past 4 hours otherwise we would get all results for the past 30 days every time the search runs

Running this search over 30 days returns 10 results, and even accessing a few new sources can trigger an anomaly.

How to Detect Anomalies in Splunk Using Streamstats (2)

In Conclusion

What’s anomalous or an outlier depends on context. You need to ask what you think an outlier would be in the data, and then base your detection method around that.

If standard deviation is providing those results, stick with it. But in my experience, standard deviation has provided more noise than actionable results for our use cases in security.

This method has worked well, providing results that we would see as anomalous. For a concrete example, I’ve used this method in a Kerberoasting search that reliably detected activity from pentests.

You can use this method and apply it to your own detections to reduce analyst workload by focusing on what they would already consider abnormal behavior. Or, if this isn’t providing the results you would like, modify it and come up with another method to find what you consider anomalous in your use case.

How to Detect Anomalies in Splunk Using Streamstats (3)

What is the use of Streamstats in Splunk? ›

The SPL2 streamstats command adds a cumulative statistical value to each search result as each result is processed. For example, you can calculate the running total for a particular field, or compare a value in a search result with a the cumulative value, such as a running average.

How do you detect data anomalies? ›

Statistical tests can be used by data scientists to detect data anomalies by comparing the observed data with the expected distribution or pattern. For example, the Grubbs test can be used to identify outliers in a data set by comparing each data point to the mean and standard deviation of the data.

What is anomaly detection in Splunk? ›

By Muhammad Raza. Anomaly detection is the practice of identifying data points and patterns that may deviate significantly from an established hypothesis.

How do I check for errors in Splunk? ›

Best practice: In searches, replace the asterisk in index=* with the name of the index that contains the data. By default, Splunk stores data in the main index. Therefore, index=* becomes index=main . Use the OR operator to specify one or multiple indexes to search.

What is StreamStats used for? ›

StreamStats provides estimates of various streamflow statistics for user-selected sites by solving equations that were developed through a process known as regionalization.

What is the difference between StreamStats and Eventstats? ›

eventstats adds the desired stats function result to the event, derived from the entire set of events. Streamstats adds the desired stats function result to the event, derived from the point in time of the current event in the stream. An example is a moving average.

How do you test for anomalies? ›

Using ultrasound as a guide, a fine needle is passed through the wall of the abdomen into the amniotic fluid that surrounds your baby. Within the fluid are cells that contain the same chromosomes as your baby. A small sample of this fluid is drawn off and sent to a laboratory for testing.

What is the best method for anomaly detection? ›

Some common statistical methods for anomaly detection include the percentile and interquartile range (IQR) methods. The percentile method is based on identifying data points that fall outside a specific percentile range.

What are the 5 ways to detect outliers anomalies? ›

Here are five ways to find outliers in your data set:
  • Sort your data. An easy way to identify outliers is to sort your data, which allows you to identify any unusual data points within your information. ...
  • Graph your data. ...
  • Calculate the z-score. ...
  • Calculate the interquartile range. ...
  • Use a hypothesis test.
Jun 8, 2023

How can Splunk help with automating the detection of anomalies? ›

How can Splunk Enterprise Security help with anomaly detection?
  1. Risk-based alerting that allows the security domain to use fewer event criteria driven sources. ...
  2. A use case library that gives you analytic stories to build content from. ...
  3. Recommendations for data sources, source types, and data models.
Mar 11, 2024

How many types of anomaly detection are there? ›

However, these types of micro clusters can often be identified more readily by a cluster analysis algorithm. There are three main classes of anomaly detection techniques: unsupervised, semi-supervised, and supervised.

What are anomaly detection tools? ›

Anomaly detection systems use those expectations to identify actionable signals within your data, uncovering outliers in key KPIs to alert you to key events in your organization. Depending on your business model and use case, time series data anomaly detection can be used for valuable metrics such as: Web page views.

What are two of the most common types of errors in Splunk? ›

Three Common Errors Customers Face in Splunk
  • Data not coming in from a Universal Forwarder or other data input type. ...
  • “Orphaned” knowledge objects. ...
  • Compatibility Issues.
Feb 28, 2023

How do you detect errors? ›

Error detection is most commonly realized using a suitable hash function (or specifically, a checksum, cyclic redundancy check or other algorithm). A hash function adds a fixed-length tag to a message, which enables receivers to verify the delivered message by recomputing the tag and comparing it with the one provided.

How to filter error logs in Splunk? ›

Filter ESXi logs example
  1. To filter ESXi logs, locate and open the props. conf file for Splunk_TA_esxilogs on the intermediate forwarder for syslog data. ...
  2. In the props.conf file, create an entry as per the following: ...
  3. Locate and open the transforms. ...
  4. Splunk Enterprise filters data based on sourcetype at index time.
Apr 13, 2022

How does Splunk stream work? ›

Stream collects network data and forwards it to Splunk Enterprise or Splunk Cloud. Stream does not analyze logs. If you can use a UF to send logs to Splunk then you don't need Stream.

What are streaming commands in Splunk? ›

A streaming command applies a transformation to each event returned by a search. For example, the rex command is streaming because it extracts and adds fields to events at search time.

What is the use of Eventstats in Splunk? ›

The SPL2 eventstats command generates summary statistics from fields in your events and saves those statistics into a new field. The eventstats command places the generated statistics in new field that is added to the original raw events.

What is the use of Splunk monitoring tool? ›

Splunk's software can be used to examine, monitor, and search for machine-generated big data through a browser-like interface. It makes searching for a particular piece of data quick and easy, and more importantly, does not require a database to store data as it uses indexes for storage.


