
In December 2013, Target's security operations center received an alert from their FireEye deployment. The malware that would eventually exfiltrate 40 million payment cards had been caught -- detected, flagged, and surfaced to analysts. The system worked. The alert was generated. The investigation never happened.
It wasn't a failure of detection. It was a failure of attention.
This is the uncomfortable truth at the center of modern security operations: the attack that breaches your company is, statistically, already in your SIEM. It's sitting somewhere in the 12,000 alerts your team will see this week. The question isn't whether your tools will see it. The question is whether anyone will look.
Every SOC analyst reading this knows the feeling. The queue is bottomless. The high-priority bucket has 400 items. You triage by reflex -- close the obvious false positives, defer the ambiguous ones, escalate the rare clean indicator. Somewhere in that motion, a real attack is being closed as benign, and nobody will know for 287 days, the industry's median dwell time for undetected intrusions.
This article isn't about telling SOC engineers their job is hard. They know. It's about the structural reasons the alert flood persists, the mental models that the best detection teams use to escape it, and the engineering practices that separate SOCs that catch breaches from SOCs that file post-mortems about them.

Why the Flood Never Recedes
The instinct of every new SOC manager is the same: tune the rules, reduce the noise, get the alert count down. They try for six months. The queue stays full. Then they realize the alert flood isn't a configuration problem. It's a structural one.
Three forces keep it going:
- Logs grow faster than analysts. Every new SaaS app, every cloud workload, every microservice generates new telemetry. Log ingestion volumes at most enterprises grow 30-50% year over year. Detection rule count grows with it. Analyst headcount doesn't.
- Vendor defaults are tuned for fear, not signal. Out-of-the-box detection content from SIEM vendors is conservative -- fire on anything suspicious, let the customer figure out what's relevant. This protects the vendor from "you missed it" lawsuits. It guarantees alert overload for the customer.
- Cost pressure pushes the wrong direction. Every SOC has a Splunk bill or a Sentinel bill that finance wants reduced. The fastest way to reduce ingestion cost is to drop log sources. The logs that get dropped first are usually the verbose ones -- endpoint detail, DNS, authentication telemetry. These are exactly the data sources where modern attacks live.
The engineers who escape this loop accept a hard truth early: you cannot tune your way out of alert fatigue. You have to change what an alert is.
The shadow tuning problem nobody talks about
In every SOC older than three years, there's a parallel reality. Analysts have developed muscle-memory shortcuts to clear the queue: certain alert names get auto-closed without reading. Certain source IPs get suppressed locally. Certain time-of-day patterns get ignored because "that's just the backup job."
None of this is in the formal rule documentation. It lives in Slack messages, in the analyst's head, in a forked playbook nobody updated. When a new analyst joins, they learn it through osmosis. When that analyst leaves, the knowledge leaves with them.
This is shadow tuning, and it's the reason detection engineering exists as a discipline. The fix isn't training analysts to be more diligent. The fix is making detection rules an engineering artifact -- versioned, tested, owned -- instead of a folklore.

Severity Is Broken (And Everyone Knows It)
Pull up your SIEM right now. Filter by severity = High. How many alerts are in there?
If the answer is over a hundred, your severity system has lost its meaning. This is the universal failure mode: severity inflation. Every detection author writes their rule as "High" because they want it noticed. Within a year, everything is High, which is functionally the same as nothing being High.
The fix isn't relabeling. The fix is forcing severity to mean something operationally measurable:
- Critical should mean: an analyst must stop other work and respond within 15 minutes. If you can't actually do that for the volume of Critical alerts you have, the rating is wrong.
- High should mean: this gets worked the same business day, no exceptions.
- Medium should mean: this gets worked this week.
- Low is anything else.
Run this audit and most SOCs discover they've labeled 30% of alerts at a severity their team has no physical capacity to actually respond to. The system has been broken for years; nobody noticed because everyone shared the same blindness.
Risk-based alerting: the actual shift happening in mature SOCs
The emerging escape hatch is risk-based alerting, often abbreviated RBA. Instead of every detection firing a discrete alert, detections contribute risk score to entities -- users, assets, IPs. An alert fires only when an entity accumulates enough cross-signal risk to be worth investigating.
What this looks like in practice: a single failed login no longer pages anyone. A user with three failed logins, a successful login from a new country, an unusual sudo command, and an outbound connection to an uncategorized domain -- within a 4-hour window -- generates one notable event with all five signals attached.
What goes right with RBA
- Alert volume drops by an order of magnitude. Teams report going from 10,000 weekly alerts to 200 notable events. Analysts can actually read each one.
- Context arrives with the alert. The notable event includes all contributing signals, so the analyst doesn't have to pivot through five tools to assemble the story.
- Attack chains become visible. Individually weak signals (one failed MFA, one unusual file modification) become strong when correlated against a single entity over time. This is where lateral-movement campaigns live.
What goes wrong with RBA
- Threshold tuning is a real project. Set the risk threshold too high and you'll miss everything. Too low and you'll recreate the alert flood. The first six months of an RBA program is mostly tuning.
- Slow-and-low attacks can stay under the line. A patient attacker who keeps each individual signal weak and spreads activity over weeks may never cross the risk threshold. RBA needs to be paired with longer-window analytics and threat hunting.
- It requires good entity resolution. If your SIEM can't reliably tie an IP, a username, an endpoint, and a cloud identity to the same entity, the scoring math falls apart. Most environments have entity-resolution gaps they didn't know existed until RBA exposed them.

Detection Engineering Is a Job, Not a Side Task
In legacy SOCs, "writing detection rules" is something analysts do in their spare hours, between triage shifts. The result is what every legacy SOC has: a rule library that nobody owns, that's never been tested, where 40% of rules either never fire or fire constantly.
Mature security organizations have separated detection engineering from SOC operations entirely. The detection engineering team is a software team. They write rules as code. They test rules. They version-control rules. They monitor each rule's signal-to-noise ratio over time and retire ones that decay.
Real scenario: the detection-as-code lifecycle
Here's what this looks like at an organization that takes it seriously:
- A new threat is reported -- say, a phishing technique abusing Microsoft device-code authorization. The detection engineer reads the writeup, identifies the telemetry that would be visible (specific event IDs in Azure AD sign-in logs, unusual user-agent patterns).
- They write the detection logic in their detection-as-code repository -- typically using Sigma, KQL, SPL, or a custom DSL.
- They write tests against a corpus of known-good and known-bad log samples. CI runs these tests on every change.
- The rule is deployed first in monitor-only mode. Telemetry is captured: how often does it fire? What's the false-positive rate against a labeled validation set?
- After two weeks of tuning, it goes live with a defined severity, owner, and runbook link.
- Quarterly, the rule's signal quality is reviewed. If it's decayed (the attacker tradecraft shifted), it's revised or retired.
This is the difference between a rule library and a detection program. The former accumulates. The latter is curated.
What goes right with detection engineering
- Coverage becomes measurable. Mapping rules to MITRE ATT&CK tactics and techniques reveals gaps -- you can literally see which techniques your detection doesn't cover.
- Quality is enforced. No rule ships without tests, owner, runbook, and a defined retirement criteria.
- Knowledge survives turnover. Because rules are code with documentation, the senior analyst leaving doesn't take the SOC's institutional knowledge with them.
- Purple-team feedback closes the loop. Red team writes an attack, detection engineering writes the detection, both teams iterate. This is how detection actually improves over time.
What goes wrong with detection engineering
- It needs real headcount. A detection engineering team is two to five engineers minimum, separate from your SOC analysts. Most mid-size organizations refuse to fund it and try to bolt the function onto analyst time. It doesn't work.
- MITRE ATT&CK coverage gets gamed. Teams optimize for "we cover technique T1078" by writing one weak rule and checking the box. Coverage is binary in the matrix but probabilistic in reality. A single rule for credential abuse doesn't actually mean you'd detect credential abuse.
- Detection logic becomes brittle. Rules tightly coupled to specific tool versions, log formats, or vendor field names break when something upstream changes. Without integration testing, breakage goes unnoticed until an audit.

The Identity Blind Spot That Catches Most Modern Breaches
Walk through the post-mortems of major breaches from the past five years -- SolarWinds, Colonial Pipeline, Microsoft (Storm-0558), MGM, Snowflake-customer compromises in 2024. A pattern dominates: the initial access was identity. Credential stuffing, MFA fatigue, stolen session tokens, OAuth consent phishing, compromised service accounts.
If your SIEM is still primarily network-and-endpoint-focused, you have a structural blind spot in the place where modern attacks live.
The hard part is that identity telemetry is verbose and contextually dependent. A single sign-in event in isolation tells you almost nothing. A sign-in correlated with the user's normal location pattern, their typical device, the time of day they usually work, their role in the org, and what they did after logging in -- that tells you whether to care.
What's commonly missed in identity-focused detection
- Service account abuse. Service accounts often have broad permissions, no MFA, no behavioral baseline, and their authentications are dismissed as "automation." Attackers know this. Compromised service accounts are the single most common lateral-movement vector in cloud environments.
- Token theft and replay. A stolen session token bypasses MFA entirely. The detection signal is subtle -- the same token used from two geographic locations in a short window, or used after the user has signed out. Most SIEMs aren't watching for this.
- OAuth consent grants. A user who consents to a malicious third-party app has just handed over persistent access that survives password changes. The Azure AD audit log shows the consent grant. Most detection programs don't have a rule for it.
- Stale privileged access. Accounts that haven't been used in 90 days but still have admin rights. The detection rule is trivial. The remediation is organizational -- and that's why it doesn't get fixed.
The shift mature SOCs are making is treating identity as a first-class data source alongside endpoint and network -- with dedicated detection coverage, dedicated analysts, and explicit ATT&CK coverage for the credential-access and lateral-movement tactics.
SOAR: What Automation Actually Solves
Every SOC has been sold a SOAR (Security Orchestration, Automation, and Response) platform with the promise that it will eliminate alert fatigue. It won't, but it does solve a real problem if used correctly.
What SOAR is good at:
- Enrichment automation. Every alert needs context -- WHOIS lookups, threat-intel reputation, user details from HR, asset details from CMDB, prior alert history. Doing this manually takes an analyst 8-15 minutes per alert. Automating it takes 30 seconds and arrives with the alert. This is SOAR's most legitimate win.
- Routine containment actions. Disable a compromised account, isolate an endpoint, block an IP at the firewall -- these are well-defined actions with clear reversibility. Automating them shaves response time from hours to seconds.
- Ticket and case management hygiene. Updating Jira, ServiceNow, or whatever the IR system is -- keeping evidence attached, status accurate, timeline complete -- is the kind of busywork SOAR genuinely removes.
What SOAR is bad at:
- Replacing the analyst's judgment. Auto-closing alerts based on enrichment results is how real attacks get auto-dismissed. The classic failure: SOAR auto-closes any alert where the source IP has "low" threat intel reputation, and the attacker uses fresh infrastructure with no reputation yet.
- Justifying its license cost without playbook investment. A SOAR with five playbooks delivers 5% of its value. Most organizations buy the platform, build a handful of integrations, and never staff the playbook engineering needed to make it actually pay off.
The mental model: SOAR is a force multiplier for analysts who already know what to do. It's not a substitute for analysts who don't.

The Metrics That Actually Tell You If Your SOC Is Working
Most SOC reporting is theater. "We processed 47,000 alerts this quarter" tells you nothing about whether you would have caught a real intrusion. The metrics that matter are different and uncomfortable:
- Mean time to detect (MTTD). From compromise to alert generation. The industry median is measured in months. World-class operations measure it in hours.
- Mean time to respond (MTTR). From alert generation to containment. This is where SOAR investment shows up.
- True-positive rate per rule. What percentage of each rule's alerts turn out to be real incidents? Rules below 1% TPR are noise generators and should be tuned or retired.
- Coverage map against MITRE ATT&CK. Not "do we have a rule?" -- but "have we tested that the rule actually fires on the technique?" Most teams have never run this test.
- Dwell time of past incidents. When an incident is finally found, how long had it been present? Trending this over time tells you whether your detection program is actually improving.
- Analyst attrition rate. Burnt-out analysts leave. High SOC turnover is a leading indicator of detection program failure. The breach that follows is a lagging indicator.
Reporting these numbers to leadership is uncomfortable because they're usually bad. They're supposed to be uncomfortable. That's the point.
Where This Knowledge Stops Being Enough
Understanding alert fatigue, detection engineering, and risk-based alerting is one piece of operating a real Security Operations Center. The questions that come next are the ones that decide whether your SOC actually catches the breach in progress versus writes the post-mortem about it:
- How do you build a threat hunting program that goes after the slow-and-low attacks that risk-based alerting will never surface?
- How do you architect an incident response runbook that holds together at 2 AM on a Sunday, when the on-call analyst is alone and the executive team is asking when the press release goes out?
- How do real adversaries -- ransomware groups, state-aligned actors, financially motivated intrusion sets -- actually move through compromised environments, and how does that change the detections you prioritize?
Those are the questions practitioners hit in the second year of any serious SOC role, and they're exactly the ground covered in Meritshot's Cyber Security programme. The curriculum walks learners through live case studies of major breaches -- Target, SolarWinds, MGM, the Snowflake customer compromises -- and into hands-on labs where you write detections, run them against simulated attack telemetry, and tune them under realistic alert volumes. The mentorship comes from engineers who've run SOCs, built detection programmes, and led incident response at scale. If you've reached the end of this article and the question forming in your head is "so how do I actually build this in my organisation?" -- that's the conversation Meritshot is built to continue. Explore the Meritshot Cyber Security programme and take the next step toward building detection capabilities that hold up when it matters.





