The Unique Technical Challenges Inherent in Cybersecurity Observability
A lot of great content has been written over the years on the challenges of logging and data lakes in cybersecurity. However, most of the articles and material are coming at the challenge from the point of view of a cybersecurity practitioner, as opposed to a technologist who is actually trying to implement a solution that needs to solve for the data collection, logging, and observability problems in cybersecurity, and how you should choose an underlying technology stack for that solution.
As technologists, we often like to try to bake-off technology stacks and figure out what is the best way to solve a problem — that is not what I am going to do in this article, for a number of reasons. Instead, I am going to focus on what factors one needs to consider in this evaluation. I want to dive in on what makes the cybersecurity world especially challenging when evaluating observability stacks, and allow readers to arm themselves with this information when they are choosing technologies and building out their solutions.
Lets align on some key terminology first — when we talk about “observability”, what we are referring to is the capability of centralized collection and storage of data such as event logs, audit logs, alert logs, traces, and metrics, in a way that allows alerting, querying, reporting, and analytics. In the cybersecurity realm, we are most interested in the event, audit, and alert logs, while in the DevOps and APM spaces, consumers tend to be interested most in the event logs, traces, and metrics.
However there is a large overlap in this venn diagram in the application and event logs space — which leads many enterprises down a path of seeking to have a common data fabric upon which to build their DevOps, APM, and cyber security use cases upon, for many reasons including cost savings, maximization of available expertise, etc. This makes perfect sense and is a goal one should pursue — but, as one embarks down this path, it is important to understand that there are three key requirements areas for cybersecurity observability that are a significant superset of those of DevOps and APM, and as such, some technology stacks that are perfectly well suited for those use cases will fall down when cybersecurity use cases are applied to them.
Scale & Volume Of Data Collection and Retention
In a DevOps or APM use case, there is rarely any justification for keeping log data beyond a month… in fact, many teams do not keep it longer than a couple of weeks. The reason for this is simple — there is little need or desire to go back in time 3 months ago and troubleshoot an issue in the log-ago past, performance related or otherwise. Once the customer incident has been resolved, the application and performance data surrounding it, has little to no value — so why incur the cost of keeping it online? What’s more, keeping this data can actually turn out to be a liability, because as Bruce Schneier says, “data is a toxic asset”. The the longer you keep log data, the greater your risk, as well as your burden when it comes to GDPR, CCPA, and related regulations. As such, common sense dictates that you should want to get rid of old log data as soon as is feasible.
In the cybersecurity world however, there are other factors at play. The first and most basic is mandatory retention. In many environments, because it houses all of the audit log data, the cybersecurity log management platform becomes the “system of record”, and as such, there are compliance issues at play. PCI DSS as one example requires all audit logs to be maintained for a minimum of one year. Some regulations such as NYDFS, require certain types of logs to be maintained for up to five years.
While one can meet these regulatory requirements by off-boarding and archiving logs into a seperate system (for example, object storage), being able to satisfy auditors that you can subsequently access this data when required and report on it’s contents, can become expensive and non-trivial when said off-boarding is entirely disconnected from your day-to-day log management platform — as such there is a strong desire by clients to have a single solution that can solve for this use case.
There are however two even more important reasons why cybersecurity use cases require more data to be kept online: those of incident response, as well as threat hunting. The average time to identify and contain a data breach currently stands at a remarkable 280 days, with a majority of that being attacker dwell time. When a serious breach occurs, there will be a lot of demand to investigate and respond quickly, including looking back historically for the time of incursion into the environment. Responding effectively is greatly hampered by offboard logs that need to be manually restored in order to investigate — and can be outright impossible if the logs do not exist at all.
In the threat hunting case, as new attacker IoCs, TTPs, and IoBs are uncovered daily, it is insufficient to only look for these zero-day indicators on a go-forward basis, because while these techniques may be new to the public, attackers have been using them for some time. As such it is vital to be able to continuously search in the past for IoCs and IoBs, and you can’t do that if the data required is kept in offline archival storage, or does not exist at all.
Finally, there is the simple aspect of data sources and protocols. While a typical enterprise may run hundreds of applications, and sometimes even thousands, the number of data sources for which one must collect cybersecurity telemetry for vastly outnumber those. The reason is simple — not only is every application a data source, but also, every single endpoint and employee asset is a data source, and these assets scale horizontally with the size of the organization, not simply the number of applications. It is very common to have hundreds of thousands of data sources in a cybersecurity log store, and often clients have millions.
As well, because of their wider mandate, security solutions also have to support a larger number of collection protocols for their observability data — it is insufficient to simply support one or two passive log and metric collection methods. Mature cybersecurity data layers have to support dozens of collection methods, data models, and protocols simultaneously - from the basic (common protocols such as syslog and endpoint agents and REST) to the more esoteric (JDBC, JMS, Kafka) to the extremely esoteric (OPSEC LEA).
The combination of all of the above factors is why a typical client of IBM Security could not only be ingesting 5TB of data or more every day across millions of data sources, but also retaining that data for two years or longer, and have to query through those PB of data — with a needle-in-a-haystack style query on a moments notice.
Importance Of Dynamic Schema
The second area where cybersecurity has logging requirements that are above and beyond those of the typical DevOps and APM use case, are the types of data that have to be commonly filtered and faceted when doing analysis, and when those facets can be known and can be optimized for.
When troubleshooting an application or user workflow, one normally wants to narrow down to a specific log stream. These log streams are commonly selected based on attributes such as application, user, and/or cluster identifiers. One can see how common this is in a DevOps use case by simply looking at the design of the LogQL query language used by Loki, where a log stream is a mandatory selector — there is a tacit assumption that one is almost always going to want to narrow their log scope in this way, and this assumption holds true and makes perfect sense in these kinds of use cases. Another methodology commonly used to optimize filtering and querying of data in a DevOps or APM scenario is through the use of labeling (or tagging) on ingest, where certain attributes of the data in the log are used to map a label to the log, and this label is used to optimize subsequent queries, either by being leveraged in an index or used to shard data on disk.
In the cybersecurity realm however, one quickly discovers that none of these assumptions hold. First of all, queries that go against a cybersecurity observability layer are rarely data-source-centric, they are threat-centric. When performing a threat hunt or incident response, it is very rare that a responder knows in advance what data sources they want to restrict their investigation to — they almost invariably want to investigate across all of them. The data that one is looking to filter and pivot against is dependant on the specific threat type and may include IP addresses, process names or IDs, file names and/or hashes, registry keys, domain names, user names, or any mixture of all of these and more. Each of these items are highly dynamic and not things that can be treated as “tags” in order to divide data on disk. As a result, they either need to be indexed, or other types of data reduction such as bloom filters need to be used in combination with brute-force mechanisms.
Furthermore, because of the constantly changing and dynamic nature of threats, the data facets that have to be filtered against to locate a specific threat, can not always be known and extracted at ingestion time. This necessitates the ability to be able to perform a schema-on-read transformation when querying, filtering, and pivoting on the data for many cybersecurity use cases.
Analytics & Enrichments
The third area I want to discuss is analytics and enrichments. While being able to do analytics on operational data is increasingly important (especially as we move to AI Ops), the need for advanced analytics and data enrichment capabilities is a basic necessity for doing advanced threat detection, and at a level I would argue is beyond what is required for most operational use cases.
Capabilities such as sub-queries, joins, and complex aggregates are required for threat hunters to uncover new threats and discover linkages in threat actor patterns. Capabilities like free text query of the entire log, are required for complex zero-day attack scenarios where the data you need to look for may not have been previously known in order to be extracted. Custom functions and data enrichments are also required as basic capabilities that can execute both at ingestion time as well as at query time.
Why enrichments? Because a lot of the wider security and risk context is not actually in the log or event… data points such as the risk level of an IP, the email address of a user, the business applications running on an asset — these are all stored externally and need to be enriched at query time. These are just a few examples of enrichments — nearly any query against security data is going to be executing many enrichments in order to make use of the data.
Why not simply enrich during ingestion? Because it is for example just as important to know the risk level of an IoC at the time the event occured, as it is right now — both are relevant when doing a threat investigation, which means that enrichments that work only at ingestion time do not solve for the use case.
I hope that people find this post useful as I attempted to distill the learnings we have made, not only from the extensive experience of building our own observability layer, but also our continuous evaluation of dozens of other technology stacks — both open as well as closed source — in this space. It is a very large and complex area, and one that is changing all the time. Indeed, I admit that some of the points I raise here may be obsolete six months from now!
I hope however that at a minimum I have raised awareness of some of the challenges that are unique to the security arena, that make it more difficult (though not impossible) to create a “one size fits all” enterprise observability solution. I am interested in your thoughts and perspectives below.