The Cybersecurity Data Model Proliferation Issue — Part 1 of 3, “Model me this…”
This series of blog entries is going to be on a topic I am very passionate about, which is improving collaboration in the cybersecurity industry. However, I am not going to talk about sharing threat intelligence, or about standard APIs, or about collaborative detection engineering. Instead today I am going to talk about something even more fundamental, something that lies at the heart of all of those topics — and something that we, as an industry, continue to lack to make real progress on. That thing is a standard cybersecurity data model.
I have written about this problem in the past. I have done seminars on it at conferences. I have seen others write about it. Yet we, as an industry, continue to not prioritize it as a real problem to be solved. I am writing this three part series in hopes to drive some conversation toward real, sustainable improvement on this issue. Maybe you will agree with me, maybe you won’t — regardless, let’s please engage and continue the conversation.
Originally I tried to fit all of these thoughts into one blog, but it was far too long and ungainly. So I decided to split it into three parts which I will publish this week.
What is a data model and why do I care?
A cybersecurity data model is a way to organize ingested log, alert, event, and finding information so that it can reliably support things such as findability, usability, and analytics — a Data model is what allows you to look for the “source_ip” field and have it match across many different data sources and products, who each may represent this data differently in their own native format (“source_ip_address” vs “source-ip” vs “srcip” vs “sourceip” etc.) Without a data model being applied to your data, basically all you can write detections around are string fields in logs. For a lot more detail on data models and why they are important, I will refer you to this great blog entry from way back in 2017 by Roberto Rodriguez (@Cyb3rWard0g) — “Ready to hunt? First Show Me Your Data”
Common data models are vital in cybersecurity, most especially in detection engineering and threat hunting, for three key reasons.
The first is composability and maintainability. Detections & analytics built upon a normalized data model are easier to combine and mix into new ones, because of the shared underlying data model they operate upon, as opposed to simply searching for strings in the data. They are also easier to maintain over time, because as new people come onto a project, they can all be “speaking the same language” when it comes to their detections.
Another benefit to common data models is portability. The ability to have your detections and hunting skills be able to migrate from data source to data source, and from tool to tool, is incredibly valuable, and can save you a lot of time. This portability is only achievable if you have a common data model that you are building your detections against, that somehow can be applied to all of your data sources.
Finally, collaboration. We are starting to see an incredibly increased movement around community development of detections and analytics. Projects like MITRE ATT&CK, Sigma, OTRF and others are galvanizing the community around common goals. However collaboration around analytics and detections in a community is very difficult as long as everyone in the community is not “speaking the same language” as it were, because if all of the community members can not equally participate, then the wisdom of the crowd is diminished.
Takeaways
Hopefully it has become a bit obvious why data models are so important to cybersecurity. So what’s the problem? Is there one?
Later this week I will be publishing the next blog in this series, where we will go on a bit of a journey through the landscape of what data models exist today in cybersecurity, their history, and why they exist.