SMDebug Rules

Rules are the medium by which SageMaker Debugger executes a certain piece of code regularly on different steps of a training job. A rule is assigned to a trial and can be invoked at each new step of the trial. It can also access other trials for its evaluation. You can evaluate a rule using tensors from the current step or any step before the current step. Please ensure your logic respects these semantics, else you will get a TensorUnavailableForStep exception as the data would not yet be available for future steps.

Use Built-in Rules Officially Provided by SageMaker

Amazon SageMaker Debugger rules analyze tensors emitted during the training of a model. Debugger offers the Rule API operation that monitors training job progress and errors for the success of training your model. For example, the rules can detect whether gradients are getting too large or too small, whether a model is overfitting or overtraining, and whether a training job does not decrease loss function and improve. To see a full list of available built-in rules, see List of Debugger Built-in Rules. .

Write Custom Rules Within or Outside SageMaker

Writing a rule involves implementing the Rule APIs. Below, let’s start with a simplified version of a custom VanishingGradient rule.

Step 1: Construct a Rule Class

Creating a rule involves first inheriting from the base Rule class provided by smdebug. For this example rule here, we do not need to look at any other trials, so we set other_trials to None.

from smdebug.rules import Rule

class VanishingGradientRule(Rule):
    def __init__(self, base_trial, threshold=0.0000001):
        super().__init__(base_trial, other_trials=None)
        self.threshold = float(threshold)

Please note that apart from base_trial and other_trials (if required), we require all arguments of the rule constructor to take a string as value. You can parse them to the type that you want from the string. This means if you want to pass a list of strings, you might want to pass them as a comma separated string. This restriction is being enforced so as to let you create and invoke rules from json using Sagemaker’s APIs.

Step 2: Create a Function to Invoke at a Step

In this function you can implement the core logic of what you want to do with these tensors. It should return a boolean value True or False, where True means the rule evaluation condition has been met. When you invoke these rules through SageMaker, the rule evaluation ends when the rule evaluation condition is met. SageMaker creates a Cloudwatch event for every rule evaluation job, which can be used to define actions that you might want to take based on the state of the rule.

A simplified version of the actual invoke function for VanishingGradientRule is below:

def invoke_at_step(self, step):
    for tensorname in self.base_trial.tensors(collection='gradients'):
        tensor = self.base_trial.tensor(tensorname)
        abs_mean = tensor.reduction_value(step, 'mean', abs=True)
        if abs_mean < self.threshold:
            return True
            return False

That’s it, writing a rule is as simple as that.

Step 3: Invoke the Rule

Option 1: Invoking a rule through SageMaker

After you’ve written your rule, you can ask SageMaker to evaluate the rule against your training job by either using SageMaker Python SDK as

estimator = Estimator(
    rules = Rules.custom(
        instance_type='ml.t3.medium', # instance type to run the rule evaluation on
        source='rules/', # path to the rule source file
        rule_to_invoke='VanishingGradientRule', # name of the class to invoke in the rule source file
        volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance
        collections_to_save=[CollectionConfig("gradients")], # collections to be analyzed by the rule
            "threshold": "20.0" # this will be used to initialize 'threshold' param in your rule constructor

If you’re using the SageMaker API directly to evaluate the rule, then you can specify the rule configuration DebugRuleConfigurations in the CreateTrainingJob API request as:

"DebugRuleConfigurations": [
        "RuleConfigurationName": "VGRule",
        "InstanceType": "ml.t3.medium",
        "VolumeSizeInGB": 30,
        "RuleEvaluatorImage": "",
        "RuleParameters": {
            "source_s3_uri": "s3://path/to/",
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"

Option 2: Invoking a rule outside SageMaker through invoke_rule

You might want to invoke the rule locally during development. We provide a function to invoke rules easily. Refer smdebug/rules/ The invoke function has the following syntax. It takes a instance of a Rule and invokes it for a series of steps one after the other.

from smdebug.rules import invoke_rule
from smdebug.trials import create_trial

trial = create_trial('s3://smdebug-dev-test/mnist-job/')
rule_obj = VanishingGradientRule(trial, threshold=0.0001)
invoke_rule(rule_obj, start_step=0, end_step=None)

Rule API

class smdebug.rules.Rule(base_trial, action_str, other_trials=None)

Bases: abc.ABC

The Rule class to create an instance of Rule evaluator. You can construct a rule class and add thresholds and criteria to its __init__ function.

Example of a Rule class

from smdebug.rules import Rule

class VanishingGradientRule(Rule):
    def __init__(self, base_trial, threshold=0.0000001):
       super().__init__(base_trial, other_trials=None)
       self.threshold = float(threshold)

    def invoke_at_step(self, step):
       for tensorname in self.base_trial.tensors(collection='gradients'):
          tensor = self.base_trial.tensor(tensorname)
          abs_mean = tensor.reduction_value(step, 'mean', abs=True)
          if abs_mean < self.threshold:
              return True
              return False
abstract invoke_at_step(step)

The abstract method to construct a rule invokation logic against output tensors.

class smdebug.rules.invoke_rule(rule_obj, start_step=0, end_step=None, raise_eval_cond=False)

The rule invoker function against a defined smdebug rule using Rule.

  • rule_obj (Rule) – An instance of a subclass of Rule that you want to invoke.

  • start_step (int) – Global step number to start invoking the rule from. Note that this refers to a global step. The default value is 0.

  • end_step (int or None) – Global step number to end the invocation of rule before. To clarify, end_step is an exclusive bound. The rule is invoked at end_step. The default value is None, which means run till the end of the job.

  • raise_eval_cond (bool) – This parameter controls whether to raise the exception RuleEvaluationConditionMet when raised by the rule, or to catch it and log the message and move to the next step. The default value is False, which implies that the it catches the exception, logs that the evaluation condition was met for a step and moves on to evaluate the next step.