A production line reports a sensor overheating at 4:47 AM. The operator checks and sees the temperature rose by 2 degrees within an hour. The old threshold-based system didn’t alert because the threshold was set at an 8-degree deviation. Only in the morning, during shift change, someone notices the pattern started two days earlier, and the machine is already operating at the edge of warranty parameters.
This scenario repeats in manufacturing, logistics, and finance. An anomaly exists in the data for days, but no static threshold catches it because "below the limit" isn’t the same as "within normal."
Static thresholds vs. normality model: the fundamental difference
#A static threshold answers the question: "Has the value exceeded a set limit?" An anomaly model answers a different question: "Is this value normal for this context, this time of day, this sequence of prior events?"
The difference is concrete. A 47,000 PLN transaction is a normal transfer for a wholesale business but a suspicious anomaly for an account that has only handled amounts under 3,000 PLN for a year. A 50,000 PLN threshold won’t catch the second case. A contextual model will catch both.
The statistical approach builds a normality distribution: standard deviation, percentiles, seasonality models. It works without labels and is immediately interpretable. The limitation is the assumption that an anomaly is a quantitative outlier. Many real anomalies are sequential patterns: events that individually fall within normal ranges but together form a signal.
The ML approach builds a classifier or unsupervised model (isolation forest, autoencoder, One-Class SVM). A supervised classifier needs labeled past incidents. An unsupervised model learns the normality distribution without labels. In practice, both layers are used: statistical as the first line, ML catches patterns that statistics don’t describe.
Cost asymmetry: false alarm vs. missed incident
#When designing an anomaly detection system, the first decision isn’t choosing the algorithm but determining what costs more: a false alarm or a missed incident.
False alarm (false positive) means the system flags an anomaly that isn’t one. Cost: analyst time, alert fatigue (operators stop taking alerts seriously), potential unnecessary process halts. In environments with many alerts, precision is the key metric.
Missed incident (false negative) means an anomaly exists, but the system didn’t detect it. Cost: financial loss, equipment damage, security incident, problem escalation. In environments where the cost of missing an incident is high (financial fraud, production line failure, security breach), recall is the key metric.
This asymmetry translates to threshold selection. Lower sensitivity threshold: more alerts, higher recall, lower precision. Higher threshold: fewer alerts, lower recall, higher precision. The correct threshold doesn’t come from the algorithm but from a business decision about cost proportions. The cost of a missed incident in a given process divided by the cost of a false alarm defines the proper threshold.
The table below shows typical cost proportion ranges for different environments:
| Environment | False negative cost | False positive cost | Recommended prioritization |
|---|---|---|---|
| Financial fraud | High (loss, regulations) | Medium (blocking a legal transaction) | Recall above 0.85, precision accepted from 0.5 |
| Production line monitoring | High (equipment failure) | Low (preventive stop) | Recall above 0.90, alerts triaged by technician |
| IT security logs | Very high (breach) | Medium (false incident) | Maximum recall, SIEM with triage |
| SaaS operational metrics | Medium (service degradation) | Low (unnecessary escalation) | Balanced F1, alerts with P1/P2 priority |
The numbers in the table are indicative ranges from deployments, not guaranteed results for every organization.
Flag explainability: why is this an anomaly?
#A flag without explanation is useless. "The system deemed this transaction an anomaly" isn’t information. "The amount is 4.2 standard deviations above the 90-day history for this contractor, and the time (2:13 AM) hasn’t occurred once in the last 180 days" is information a human can act on.
Explainability in anomaly systems depends on architecture. For statistical models, the explanation is native: which dimension deviates and by how many standard deviations. Isolation forest points to features that shortened the isolation path. SHAP values for gradient boosting models show each feature’s contribution. Autoencoders indicate dimensions with the largest reconstruction error.
In Cashcrown, every flag reaches the analyst with three elements: the deviation value, the historical pattern considered normal, and similar past events with resolution status. The last element is often the most important: if 6 weeks ago an analyst marked a similar alert as a false alarm with a comment, the new alert loads that context immediately.
Model drift: when "normal" changes
#An anomaly model learns normality from historical data. Problem: normality changes over time. A company launches a new sales channel or changes production hours, and the old model treats the new normal pattern as an anomaly, generating a wave of false alarms.
Minimum safeguards against drift: alert rate monitored weekly (a sudden increase without confirmed incidents signals drift; a sudden drop signals the model isn’t detecting new patterns), recall on a golden test set checked quarterly (a drop over 10 pp. signals retraining needed), incremental training window for rapidly changing environments. The observability architecture for AI systems is discussed in the article monitoring AI agent quality.
Every retraining decision requires human approval. Re-training changes what the system considers normal, thus altering what it escalates to the analyst.
Human-gate: humans always decide
#An anomaly is a signal for investigation, not an action command. The boundary between automation and human decision must be defined processually, not just technically.
Reversible actions can be automated: flagging events in the system, lowering priority, enforcing additional verification at the next process step, notifying via an alerts channel. An analyst can reverse any of these actions in seconds.
Irreversible or costly actions require human-oversight: stopping a production line, blocking an account, rejecting a transaction, escalating to a regulator, imposing a penalty, replacing a component. Before any of these actions, the system must include a human approval step with a mandatory justification.
In Cashcrown, we implement this division as an HMAC-signed approval token: the system generates a proposal with a cryptographic signature, the analyst approves with a comment, and the log includes who, when, and with what justification. This is the audit trail required by the AI Act for high-risk systems. Every incident confirmed or rejected by an analyst feeds back into the knowledge base, so the system learns environment-specific patterns, not generic ones.
This same pattern applies to financial decisions in the article AI for fraud detection. The data architecture is described in the article data governance for AI.
Integration with existing operational data
#Data quality determines result quality more than algorithm choice. Missing measurements, inconsistent timestamps, irregularly reporting devices distort the normality distribution. Before the first training, we check: the percentage of missing values per sensor or account, timestamp consistency, physically impossible values (negative temperature readings, repeating identical sequences indicating a stuck sensor).
Structured output from the anomaly model should include: event identifier, anomaly score (float 0-1), features that determined the result with weights, the historical window used for calibration, and a field for analyst comments. Standardizing this format allows integrating results from different models in a single analytical interface. Preparing input data is described in the article AI for data analysis and BI, and RODO compliance aspects are covered in the article AI for controlling and finance.
FAQ
#Does AI for anomaly detection work without historical data?
#A statistical model and threshold rules work from day one: collect data for 2-4 weeks, build a baseline distribution statistic, and start flagging deviations. An ML model needs history, and if you have labeled incidents (downtime dates, service reports, confirmed fraud cases), you can use them as weak labels for the first training. Without labels, start in unsupervised mode: isolation forest or autoencoder on historical data, with manual verification of the first alerts by an analyst over 4-8 weeks.
How to distinguish an anomaly from a seasonal change in normality?
#The model must account for seasonality as part of the normality distribution. For daily data, this means seasonal adjustments for days of the week and months. Prophet and ARIMA have built-in seasonality components. For ML models, the "day of the week" feature should be an input; otherwise, every Monday sales increase becomes an anomaly in an environment where Sundays have low traffic. The analyst should be able to mark an event as a "structural change," which updates the system’s normality baseline.
Is an anomaly detection system subject to the AI Act?
#It depends on the context. A system monitoring technical data (sensors, IT logs) is typically low-risk. A system evaluating financial or credit behaviors of individuals is listed in Annex III of the AI Act as high-risk, requiring DPIA, technical documentation, decision explainability, and active post-deployment monitoring. In both cases, the analyst’s decision log (who approved what action, when, with what justification) is a best practice regardless of risk classification.
How many alerts per day can one analyst handle?
#This is a question to ask before deployment, not after. The practical limit for an analyst working with anomaly alerts is 20-50 alerts per day with full investigation. For larger alert volumes, automatic prioritization is needed: alerts with an anomaly score above 0.85 and a combination of multiple signals get P1 priority; the rest get P2 or P3 with a longer response window. If the system generates 500 alerts per day with a 30-alert capacity, the problem isn’t the algorithm but the sensitivity threshold calibration.
How to validate that the system actually detects anomalies?
#A golden test set of confirmed historical incidents is the only reliable method. Run the model on events analysts confirmed as real incidents and check recall: how many it would have detected before the actual resolution. If you lack historical labels, the first 60-90 days are a shadow mode phase: the system logs alerts, analysts verify them, and you build the set from scratch. Without this set, there’s no reliable answer to the effectiveness question. The evaluation pattern is described in the article monitoring AI agent quality.