The Cybersecurity Data Model Proliferation Issue — Part 2 of 3, “How is XKCD so prescient”

Jason Keirstead
5 min readJun 23, 2021

--

Source: “Standards” — https://xkcd.com/927/

Previously I outlined what a cybersecurity data model is and why it is foundational, not only to be able to execute mature cyber operations in your enterprise, but also in order to enable basic collaboration on intelligence and defense.

Today I am going to take you on a short walk through what data models exist in our industry, their history, and where they are typically used — with a main focus on community-oriented models.

This will lead into Part 3 of this series which is going to be an attempt at objective analysis of the problems we face due to data model proliferation, and how to solve them.

STIX 2 SCO

The STIX 2 SCO data model’s origin dates back to 2016 after development of STIX was migrated from MITRE to OASIS in order to pursue a true international de-jure standard. During the work by the Cyber Threat Intelligence Technical Committee (CTI TC), the TC decided that the CybOX standard should be merged with the STIX standard, and eventually became referred to as the STIX 2 Cyber Observable (SCO) data model.

STIX 2 SCO also comes with a patterning language called STIX 2 Patterning. This language can be used to write detections that leverage the STIX 2 data model and is leveraged by projects like the STIX Shifter project to enable portable hunting and detections.

STIX 2 SCO is rarely used natively for logging or events at the present time, however it is the native format used in threat intelligence sharing using the STIX 2 standard, and thus some form of support for its ingestion is actually quite widespread.

Sigma

The Sigma project also dates to December 2016 when it was released by Florian Roth (@cyb3rops) in order to solve for the problem of being able to write portable alerts and correlation rules that can be used in multiple SIEM systems. The project rapidly grew beyond just SIEMs, and is now widely used throughout the detection engineering community, helping to enable collaboration across the industry without being tied to specific vendors or security stacks. Sigma contains as part of it a core data model called the Sigma Taxonomy, which is the data model that rules are usually* written against. Rules written using this taxonomy can be converted to the data models of other cybersecurity systems. Sigma is a very community-driven project, and not directly tied to any specific major commercial vendor, although some vendors are starting to natively support it.

While the Sigma Taxonomy is leveraged internally as a collaborative data model, it is extremely rare to have data actually ingested in this taxonomy natively, and I know of no commercial nor open source products that do that.

* It is worth noting that because usage of the taxonomy is optional due to how Sigma works, some Sigma rules are written directly against other data models — however these rules are not as portable as ones that leverage the taxonomy and thus use of the taxonomy is encouraged.

MITRE CAR

The MITRE Cyber Analytics Repository (CAR) project dates to 2016 as well, and is a community effort by MITRE to create a repository of detection analytics for MITRE ATT&CK TTPs, in order to enable defenders to disrupt adversaires. The analytics in CAR are developed using a mixture of systems and data models depending on their origin — including Splunk, Sigma, Zeek, and others. CAR however also comes with its own data model, that every analytic is also encoded into along with pseudo-code.

The MITRE CAR data model is inspired by, but not identical to, the CybOX data model, which is what was used in both STIX 1.x as well as the Malware Attribute Enumeration and Characterization format (MAEC) 4.x. CybOX has since been retired, as it was superseded by STIX 2 SCO (which is also the basis for MAEC 5.x). MITRE CAR however has not yet aligned with STIX 2 SCO, and continues as an independent data model. MITRE continues to release new analytics into CAR on a fairly regular basis.

OSSEM

The Open Source Security Events Metadata (OSSEM) model is a community effort that is now part of the wider Open Threat Research Forge (OTRF) set of community projects. OSSEM was started in 2018 by Roberto Rodriguez (@Cyb3rWard0g) to create a community-led forum where information about security event logs could be shared, and hopefully lead to improved data quality throughout the industry.

OTRF has grown and now has many contributors, and a wide community of threat hunters that leverage its projects — including things such as the Mordor datasets as well as the Hunting ELK (HELK) Elastic stack. Many of these projects leverage the OSSEM model in one way or another. Some commercial vendors are beginning to support this model as well, typically by providing mappings to/from OSSEM in their data model documentation.

Elastic ECS

The Elastic Common Schema (ECS) model was released by Elastic in 2019 as a model that Elastic could both normalize their own product lines to, as well as build enhanced cybersecurity capabilities around. While ECS is a model created by a vendor, it is worth inclusion in this list due both to the widespread usage of open-source Elastic tools in the community, as well as it’s community driven nature (it is open source under the Apache license, and it also has many non-Elastic originating pull requests). The open source cybersecurity products that Elastic releases (including Elastic Beats) leverage ECS as their primary method of data normalization, which is a large driver toward its growth in the community.

Azure ASIM

The Azure Sentinel Information Model (ASIM) was released in 2020 as part of Microsoft’s Azure Sentinel offering. ASIM is fairly new, and is thus under active development. However it is worth inclusion in this list due to its active effort to map to the OSSEM community model, which is definitely a positive step and something I very much want to encourage.

While ASIM claims to make an effort to align with the OSSEM data model, it is important to point out that the field naming schemes are actually slightly different when it comes to casing and format. This is very significant, because this means that detections that are written against the OSSEM model can not be used against ASIM data without some kind of manipulation. Simple mis-alignments like this, can greatly negate the value of leveraging a common data model at all.

Takeaways

There are many, many, MANY more product-specific data models that exist in addition to this list (Splunk CIM, QRadar LEEF, ArcSight CEF, Google UDM, etc…), and it is not meant to be comprehensive. The purpose is simply to illustrate that there are a plethora of cybersecurity data models in widespread use — and what’s more, several of them are mostly community driven and not tied to products. Yet, for the most part, these communities are not aligning, and are continuing to forge separate paths. Meanwhile, new vendors continue to invent their own data models at an alarming pace.

In the final part of this series, I am going to outline the challenges we face as a community and industry due to data model proliferation, and the path I would like to see taken on a journey to solve them.

--

--