Detect Anomalies at Scale with Random Cut Forest in Amazon SageMaker

Detect Anomalies at Scale with Random Cut Forest in Amazon SageMaker

Unexpected spikes in transactions, sudden drops in system metrics or subtle shifts in user behavior : anomalies can take many forms and they often highlight the events that matter most to a business. From fraud detection to infrastructure monitoring, from identifying unusual patterns in time series data to spotting operational irregularities, detecting outliers early can prevent losses, reduce risk and uncover hidden opportunities.

Anomalies don’t just matter operationally - handling them correctly is a critical step in almost any machine learning workflow. Training predictive models on datasets that include extreme or unusual points can mislead the model, causing it to overfit to rare events. This can result in unnecessarily complex models that struggle to generalize well to typical behavior. By identifying and handling anomalies early, teams can decide whether to remove them, investigate them or treat them separately, leading to cleaner data and more reliable models.

To address this challenge, AWS developed Random Cut Forest (RCF), an unsupervised algorithm purpose-built for anomaly detection. RCF is available as a built-in algorithm in Amazon SageMaker and is integrated into multiple AWS services, allowing teams to detect anomalies at scale while maintaining the flexibility to use the data for downstream analytics and predictive modeling.

Understanding Random Cut Forest for Anomaly Detection

RCF is an unsupervised algorithm used to detect anomalous data points. Because it is unsupervised, it does not require labeled training data. Instead, it learns the structure and patterns of “normal” data and assigns an anomaly score to every observation. Higher scores indicate that a point is more unusual compared to the rest of the dataset.

RCF is commonly used for detecting unexpected spikes in time series data, sudden changes in periodic patterns or isolated data points in multi-dimensional datasets. It works with arbitrary dimensional input and scales well in terms of feature count, dataset size and the number of instances [1].

Conceptually, the algorithm builds a forest of decision trees with each tree trained on a random sample of the data. These samples are obtained using an efficient sampling technique that allows the algorithm to work even when the dataset is large or streamed incrementally. Each sample is then partitioned into multiple subsets, one for each tree in the forest.

Inside each tree, the data is recursively split using randomly chosen dimensions and randomly selected cut points. You can imagine this as drawing random hyperplanes through the data space. Over time, this process isolates individual data points into smaller and smaller bounding boxes until each leaf node represents a single observation.

The intuition behind anomaly scoring is simple. If a data point is very different from the rest, it is more likely to be isolated early in the partitioning process. In other words, it will appear at a shallower depth in the tree. Points that are similar to many others require more cuts to be isolated and therefore end up deeper in the tree. The anomaly score is related to how much the complexity of the tree structure would change if the point were inserted - for example would it require creating a new branch. In practice, this is closely tied to the depth at which the point is placed. The final anomaly score is computed as the average score across all trees in the forest [2].

The averaging across many trees reduces variance and makes the scores more stable.

Choosing the Right Hyperparameters

Although Random Cut Forest is conceptually simple, a few hyperparameters significantly influence its behavior.

The first key parameter is num_trees, which defines how many trees are built in the forest. Increasing the number of trees reduces noise in the anomaly scores because the final score is an average across more independent models. A common starting point is around 100 trees. However, increasing this number also increases inference time, since each tree contributes to the final score.

The second important parameter is num_samples_per_tree. This determines how many data points are randomly sampled for each tree. A useful rule of thumb is that the inverse of this number should roughly match the expected proportion of anomalies in your dataset. For example, if you believe that about 0.5% of your data is anomalous, choosing around 200 samples per tree would align with that assumption. This parameter controls how sensitive the model is to rare events.

There is only a single required parameter - feature_dim, which specifies the dimensionality of each data point [3].

Because RCF is relatively lightweight and does not rely on deep neural networks, it runs efficiently on CPU instances. It does not take advantage of GPU hardware, so both training and inference are performed on CPU-based instance types such as the ml.m or ml.c families.

Training a Random Cut Forest Model in SageMaker

To train an RCF model in Amazon SageMaker, you first prepare your input data. The algorithm supports both CSV and RecordIO-Protobuf formats. RecordIO-Protobuf is a binary format optimized for performance and is often preferred for large-scale training as it is compact and faster to read compared to text-based formats, while CSV is convenient and human-readable, especially during development [4].

You can provide input data through File mode or Pipe mode. In File mode, the entire dataset is downloaded to the training instance before training starts. In Pipe mode, data is streamed directly from Amazon S3 to the training container. Pipe mode can reduce startup time and storage requirements, making it a good optimization for large datasets [5].

Although RCF is unsupervised, SageMaker allows you to provide an optional test channel with labeled data. These labels indicate whether each point is truly anomalous or not. While the model does not use these labels for training, they are used to compute evaluation metrics such as precision, recall and F1-score. This is particularly useful when you have historical knowledge of known anomalies and want to validate the model’s performance.

Below is a simplified example of how you might configure and train an RCF model using the SageMaker Python SDK [6]:

from sagemaker import RandomCutForest

import sagemaker

import numpy as np

session = sagemaker.Session()

role = "your-execution-role"

training_data = np.random.rand(1000, 1)

rcf = RandomCutForest(

    role=role,

    instance_count=1,

    instance_type="ml.c5.xlarge",

    num_trees=100,

    num_samples_per_tree=256,

    feature_dim=1,

    output_path="s3://your-bucket/output/"

)

record_set = rcf.record_set(training_data)

rcf.fit(record_set)

Once training is complete, the model artifacts are stored in Amazon S3. You can monitor job status and logs directly from the SageMaker console.

Automatic Hyperparameter Tuning

SageMaker also supports automatic model tuning for RCF. In this setup, you define a range of values for tunable hyperparameters such as num_trees and num_samples_per_tree and select an objective metric to optimize, such as F1-score.

For RCF, hyperparameter tuning requires a labeled test dataset. During tuning, the algorithm computes anomaly scores for the test data and labels points as anomalous if their scores exceed a threshold based on the three-sigma rule, meaning three standard deviations above the mean anomaly score. The F1-score is then calculated by comparing these predicted labels with the actual labels in the test set. The tuning job searches for the hyperparameter combination that maximizes this F1-score [7].

This approach works well when the three-sigma heuristic aligns with the distribution of anomalies in your data. As with any heuristic-based thresholding, it is very important to validate assumptions against real-world behavior.

Deploying and Performing Inference

After training, the model can be deployed as a real-time inference endpoint. SageMaker handles provisioning and scaling of the endpoint on the specified instance type.

predictor = rcf.deploy(

    initial_instance_count=1,

    instance_type="ml.c5.large"

)

For inference, you send new data points to the endpoint and receive anomaly scores in return. As mentioned previously, SageMaker supports multiple data formats such as CSV and RecordIO-Protobuf. Using the SDK’s built-in serializers and deserializers simplifies this process [6]:

from sagemaker.serializers import CSVSerializer

from sagemaker.deserializers import JSONDeserializer

import numpy as np

sample_data = np.array([[50], [52], [49], [51], [50], [53], [200]])

predictor.serializer = CSVSerializer()

predictor.deserializer = JSONDeserializer()

results = predictor.predict(sample_data)

scores = [data["score"] for data in results["scores"]]

print("Input Data:", sample_data.flatten())

print("Anomaly Scores:", scores)

The response includes anomaly scores for each input record. Higher scores indicate a higher likelihood of being anomalous.

In many time series use cases, it is common to compute anomaly scores for all observations and then apply a threshold, such as the three-sigma rule, to flag significant outliers. Visualizing both the original metric and the anomaly scores together often reveals that spikes in the score align with real-world events, such as traffic surges, system outages or unusual business activity.

Where Random Cut Forest Fits in Standalone AWS Services

Beyond SageMaker, RCF powers anomaly detection in multiple AWS services, each tailored to the type of data and use case. 

For example, Amazon OpenSearch Service uses RCF to detect anomalies in near-real-time streams of search and log data. Each new data point is scored with an anomaly grade and a confidence value, allowing the service to distinguish unusual events from normal variations automatically [8].

Similarly, Amazon Managed Service for Prometheus leverages RCF to monitor time series metrics. The algorithm models normal behavior patterns in your data, adapts to seasonal trends and flags deviations with a confidence score. This helps reduce false alerts and highlights the anomalies that truly require attention [9].

Even in analytics-focused services like Amazon QuickSight, RCF is used to assign anomaly scores to individual data points. As QuickSight samples and visualizes your dataset, the algorithm evaluates how much each point deviates from expected patterns, giving higher scores to outliers. This enables analysts to spot unusual trends and investigate them directly within dashboards [10].

Because RCF is lightweight and CPU-based, it is practical for a wide range of applications, from streaming metrics to interactive dashboards. Its integration across AWS services allows teams to detect anomalies consistently, without needing to move data out of the platform or build custom models for each use case.

What This Means for You

Random Cut Forest gives you a powerful, scalable way to detect anomalies without needing labeled data or complex models. This makes it ideal for applications like monitoring time series metrics, spotting fraudulent activity or uncovering rare patterns in high-dimensional datasets.

With Amazon SageMaker, you can train, tune and deploy RCF models using fully managed infrastructure. Built-in scaling, integrated monitoring and optional hyperparameter tuning mean you can go from experimentation to production without managing servers or writing complex code. Whether you’re exploring anomaly detection for the first time or rolling it out across critical business systems, RCF provides a simple, reliable foundation.

Anomalies often contain the signals that matter most. With the right combination of algorithm and platform, you can detect them early, surface actionable insights and respond with confidence : turning unusual patterns into real opportunities.

Want to Learn More?

We’re just getting started with anomaly detection using Random Cut Forest, and every organization’s data challenges are unique. If you want to understand how RCF can unlock insights in your own environment, the best next step is to talk with our AWS experts. They can walk you through your datasets, help identify where anomalies matter most and design a tailored approach to integrate RCF into your monitoring systems, dashboards or production pipelines.

Our team can guide you through choosing the right model configuration, setting up training and inference workflows, and ensuring the outputs align with your business goals. Whether it’s reducing false alerts, spotting early signs of fraud or uncovering hidden trends in high-dimensional data, we help turn anomaly detection from a concept into actionable results.

References

[1] “Random Cut Forest (RCF) Algorithm”, AWS Docs,

https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html

[2] “How RCF Works”, AWS Docs,

https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_how-it-works.html

[3] “RCF Hyperparameters”, AWS Docs,

https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html

[4] Manasa, “Differences between LibSVM and RecordIO/Protobuf in the context of Machine Learning”, Medium, 10 Aug 2023

https://medium.com/@mansi89mahi/differences-between-libsvm-and-recordio-protobuf-in-the-context-of-machine-learning-99793a4850c2

[5] Can Balioglu, David Arpin, and Ishaaq Chandy, “Using Pipe input mode for Amazon SageMaker algorithms”, AWS Blogs, 23 May 2018,

https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

[6] “An Introduction to SageMaker Random Cut Forests”, Amazon SageMaker Examples,

https://sagemaker-examples.readthedocs.io/en/latest/index.html

[7] “Tune an RCF Model”, AWS Docs,

https://docs.aws.amazon.com/sagemaker/latest/dg/random-cut-forest-tuning.html

[8] “Anomaly detection in Amazon OpenSearch Service”, AWS Docs,

https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ad.html

[9] “Anomaly detection - Amazon Managed Service for Prometheus”, AWS Docs,

https://docs.aws.amazon.com/prometheus/latest/userguide/prometheus-anomaly-detection.html

[10] “How RCF is applied to detect anomalies - Amazon Quick”, AWS Docs,

https://docs.aws.amazon.com/quick/latest/userguide/how-does-rcf-detect-anomalies.html

Share this post
Ivan Dimitrov
March 30, 2026

Book a meeting

Ready to unlock more value from your cloud? Whether you're exploring a migration, optimizing costs, or building with AI—we're here to help. Book a free consultation with our team and let's find the right solution for your goals.