Tensorflow

SMDebug for TensorFlow

Amazon SageMaker Debugger and the smdebug client library fully support TensorFlow framework.

Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on the official AWS Deep Learning Containers where Debugger can automatically capture tensors from your training job. It doesn’t matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers.

Debugger and its client library smdebug support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see SageMaker containers to use Debugger with script mode. For a complete guide for using custom containers, see Use Debugger in Custom Training Containers.

Features supported by SMDebug

  • Debug training jobs with the TensorFlow framework or Keras TensorFlow

  • Debug training jobs with the TensorFlow eager or non-eager mode

  • Extended built-in tensor collections: inputs, outputs, layers, and gradients

  • Hook APIs to save model parameters: save_tensors, save_scalar


Using Debugger on AWS Deep Learning Containers with TensorFlow

The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see Amazon SageMaker Debugger with TensorFlow in the Amazon SageMaker Developer Guide.

The following code example provides the base structure for a SageMaker TensorFlow estimator with Debugger.

from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs

tf_estimator = TensorFlow(
    entry_point = "tf-train.py",
    role = "SageMakerRole",
    train_instance_count = 1,
    train_instance_type = "ml.p2.xlarge",
    framework_version = "2.2.0",
    py_version = "py37"

    # Debugger-specific Parameters
    rules = [
        Rule.sagemaker(rule_configs.vanishing_gradient()),
        Rule.sagemaker(rule_configs.loss_not_decreasing()),
        ...
    ],
    debugger_hook_config = DebuggerHookConfig(
        CollectionConfig(name="inputs"),
        CollectionConfig(name="outputs"),
        CollectionConfig(name="layers"),
        CollectionConfig(name="gradients")
        ...
    )
)
tf_estimator.fit("s3://bucket/path/to/training/data")

Note

The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the SageMaker python SDK v2 and smdebug v0.9.2. It is highly recommended to upgrade the packages by executing the following command line.

pip install -U sagemaker
pip install -U smdebug

If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about the SageMaker Python SDK, see Use Version 2.x of the SageMaker Python SDK.

Debugger Built-in Tensor Collections for TensorFlow

The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the name parameter of CollectionConfig() as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook.

Name

Description

all

Matches all tensors.

default

Includes metrics, losses, and sm_metrics.

metrics

For KerasHook, saves the metrics computed by Keras for the model.

losses

Saves all losses of the model.

sm_metrics

Saves scalars that you want to include in the SageMaker metrics collection.

inputs

Matches all model inputs to the model.

outputs

Matches all model outputs of the model, such as predictions (logits) and labels.

layers

Matches all inputs and outputs of intermediate layers.

gradients

Matches all gradients of the model.

weights

Matches all weights of the model.

biases

Matches all biases of the model.

optimizer_variables

Matches all optimizer variables, currently only supported for Keras.

For more information about adjusting the tensor collection parameters, see Save Tensors Using Debugger Modified Built-in Collections.

For a full list of available tensor collection parameters, see Configuring Collection using SageMaker Python SDK.

Note

The inputs, outputs, gradients, and layers built-in collections are currently available for TensorFlow versions <2.0 and ==2.2.0.


Using Debugger on SageMaker Training Containers and Custom Containers

If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options:

  • Option 1 - Use the SageMaker TensorFlow training containers with training script modification

  • Option 2 - Use your custom container with modified training script and push the container to Amazon ECR.

For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors.

  1. Create a hook

  2. Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors

  3. Register the hook to model.fit()

Step 1: Create a hook

To create the hook constructor, add the following code to your training script. This enables the smdebug tools for TensorFlow and creates a TensorFlow hook object. When you run the fit() API for training, specify the smdebug hook as callbacks, as shown in the following subsections.

Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are smd.KerasHook, smd.SessionHook, and smd.EstimatorHook.

KerasHook

If you use the Keras model zoo and a Keras model.fit() API, use KerasHook. KerasHook is available for the Keras model with the TensorFlow backend interface. KerasHook covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before model.compile():

import smdebug.tensorflow as smd
hook = smd.KerasHook.create_from_json_file()

To learn how to fully implement the hook in your training script, see the Keras with the TensorFlow gradient tape and the smdebug hook example scripts.

Note: If you use the AWS Deep Learning Containers for zero script change, Debugger collects most of the tensors through its high-level API, regardless of the eager execution modes.

SessionHook

If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use SessionHook. SessionHook is for the TensorFlow 1.x monitored training session API, tf.train.MonitoredSessions(), as shown following:

import smdebug.tensorflow as smd
hook = smd.SessionHook.create_from_json_file()

To learn how to fully implement the hook into your training script, see the TensorFlow monitored training session with the smdebug hook example script.

Note: The official TensorFlow library deprecated the tf.train.MonitoredSessions() API in favor of tf.function() in TensorFlow 2.0 and later. You can use SessionHook for tf.function() in TensorFlow 2.0 and later.

EstimatorHook

If you have a model using the tf.estimator() API, use EstimatorHook. EstimatorHook is available for any TensorFlow framework versions that support the tf.estimator() API, as shown following:

import smdebug.tensorflow as smd
hook = smd.EstimatorHook.create_from_json_file()

To learn how to fully implement the hook into your training script, see the simple MNIST training script with the Tensorflow estimator.

Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors

The smdebug TensorFlow hook provides tools to manually retrieve gradients tensors specific to the TensorFlow framework.

If you want to save gradients (for example, from the Keras Adam optimizer) wrap it with the hook as shown following:

optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr)
optimizer = hook.wrap_optimizer(optimizer)

If you want to save gradients and outputs tensors from the TensorFlow GradientTape feature, wrap tf.GradientTape with the smdebug hook.wrap_tape method and save using the hook.save_tensor function. The input of hook.save_tensor is in (tensor_name, tensor_value, collections_to_write=“default”) format. For example:

with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
    logits = model(data, training=True)
    loss_value = cce(labels, logits)
hook.save_tensor("y_labels", labels, "outputs")
hook.save_tensor("predictions", logits, "outputs")
grads = tape.gradient(loss_value, model.variables)
hook.save_tensor("grads", grads, "gradients")

These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all.

For examples of code structures that you can use to apply the hook wrappers, see the Code Examples section.

Step 3: Register the hook to model.fit()

To collect the tensors from the hooks that you registered, add callbacks=[hook] to the Keras model.fit() API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add hooks=[hook] to the MonitoredSession(), tf.function(), and tf.estimator() APIs. For example:

model.fit(X_train, Y_train,
          batch_size=batch_size,
          epochs=epoch,
          validation_data=(X_valid, Y_valid),
          shuffle=True,
          # smdebug modification: Pass the hook as a Keras callback
          callbacks=[hook])

Step 4: Perform actions using the hook APIs

For a full list of actions that the hook APIs offer to construct hooks and save tensors, see Common hook API and TensorFlow specific hook API.


Code Examples

The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see Use Debugger in AWS Containers.

Keras API (tf.keras)

The following code example shows how to register the smdebug KerasHook for the Keras model.fit(). You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see smdebug modes.

import smdebug.tensorflow as smd

hook = smd.KerasHook.create_from_json_file()

model = tf.keras.models.Sequential([ ... ])
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
)
# Add the hook as a callback
# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL
hook.set_mode(mode=smd.modes.TRAIN)
model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook])

hook.set_mode(mode=smd.modes.EVAL)
model.evaluate(x_test, y_test, callbacks=[hook])

Keras GradientTape example for TensorFlow 2.0 and later

The following code example shows how to register the smdebug KerasHook by wrapping the TensorFlow GradientTape() with the smdebug hook.wrap_tape() API.

import smdebug.tensorflow as smd

hook = smd.KerasHook.create_from_json_file()

model = tf.keras.models.Sequential([ ... ])
    for epoch in range(n_epochs):
        for data, labels in dataset:
            dataset_labels = labels
            # wrap the tape to capture tensors
            with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape:
                logits = model(data, training=True)  # (32,10)
                loss_value = cce(labels, logits)
            grads = tape.gradient(loss_value, model.variables)
            opt.apply_gradients(zip(grads, model.variables))
            acc = train_acc_metric(dataset_labels, logits)
            # manually save metric values
            hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default")

Monitored Session (tf.train.MonitoredSession)

The following code example shows how to register the smdebug SessionHook.

import smdebug.tensorflow as smd

hook = smd.SessionHook.create_from_json_file()

loss = tf.reduce_mean(tf.matmul(...), name="loss")
optimizer = tf.train.AdamOptimizer(args.lr)

# Wrap the optimizer
optimizer = hook.wrap_optimizer(optimizer)

# Add the hook as a callback
sess = tf.train.MonitoredSession(hooks=[hook])

sess.run([loss, ...])

Estimator (tf.estimator.Estimator)

The following code example shows how to register the smdebug EstimatorHook. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see smdebug modes.

import smdebug.tensorflow as smd

hook = smd.EstimatorHook.create_from_json_file()

train_input_fn, eval_input_fn = ...
estimator = tf.estimator.Estimator(...)

# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL
hook.set_mode(mode=smd.modes.TRAIN)
estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook])

hook.set_mode(mode=smd.modes.EVAL)
estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook])

References

The smdebug API for saving tensors

See the API for saving tensors page for details about the Hooks, Collection, SaveConfig, and ReductionConfig. See the Analysis page for details about analyzing a training job.