Trial API

Use Trial methods of an SMDebug trial object. The methods are tools to load output tensors collected by SageMaker Debugger for further analysis.

Note

To use the following methods of the Trial class, you must create a trial instance as guided in the previous topic SMDebug Trial.

class smdebug.trials.trial.Trial(name, range_steps=None, parallel=True, check=False, index_mode=True, cache=False)

Bases: abc.ABC

The base class for creating an SMDebug trial objects. A trial creation helper function create_trial was introduced in the previous topic at SMDebug Trial.

After you create an SMDebug trial object, use the following Trial class methods for accessing output tensor information.

tensor(tname)

Retrieves the smdebug.core.tensor.Tensor object by the given name tname. To find available methods that this Tensor object provides, see Tensor API.

If output tensor is still not available when you run this method, it refreshes the method call until the first output tensor becomes available.

Parameters

tname (str) – Takes the name of tensor

Returns

An output tensor object.

Return type

Tensor object

has_tensor(tname)

Checks if the trial has a tensor of the given tensor name.

Parameters

tname (str) – Takes the name of tensor

Returns

True if the tensor is found by the trial, else it returns False.

Return type

bool

tensor_names(show_prefixed_tensors=False, *, step=None, mode=<ModeKeys.GLOBAL: 4>, regex=None, collection=None) → list

Retrieves names of tensors saved.

All arguments to this method are optional. You are not required to pass any of the following arguments. By default, this method returns all tensor names if you don’t pass any arguments.

Parameters
  • step (int) – If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step.

  • mode (smdebug.modes enum value) – If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as smd.modes.TRAIN, smd.modes.EVAL, smd.modes.PREDICT, or smd.modes.GLOBAL.

  • regex (str or list[str]) – You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of regex or collection parameters.

  • collection (Collection or str) – You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of regex or collection parameters.

Returns

List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode, saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters.

Return type

list[str]

Examples:

  • trial.tensor_names() - Returns all tensors saved for any step or mode.

  • trial.tensor_names(step=10, mode=modes.TRAIN) - Returns tensors saved for training step 10

  • trial.tensor_names(regex='relu') - Returns all tensors matching the regex pattern relu saved for any step or mode.

  • trial.tensor_names(collection='gradients') - Returns tensors from collection “gradients”

  • trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax') - Returns tensor saved for 10th training step which matches the regex softmax

workers()

Query for all the worker processes from which data was saved by smdebug during multi worker training.

Returns

A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form worker_0 where 0 is the rank of the process.

Return type

list[str]

steps(mode=<ModeKeys.GLOBAL: 4>, show_incomplete_steps=False) → list

Retrieves a list of steps collected by SageMaker Debugger.

Parameters
  • mode (smdebug.modes enum value) – Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode If this is not passed, returns steps for all modes.

  • show_incomplete_steps (bool) –

Returns

List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. Each of these mode steps has a global step number associated with it. The global step represents the sequence of steps across all modes executed by the job.

Return type

list[int]

global_step(mode, mode_step)

Given a mode and a mode_step number you can retrieve its global step using this method.

Parameters
  • mode (smdebug.modes enum value) – Takes the mode as enum value

  • mode_step (int) – Takes the mode step as an integer

Returns

An integer representing global_step of the given mode and mode_step.

Return type

int

mode_step(global_step)

Given a global step number you can identify the mode_step for that step using this method.

Parameters

global_step (int) – Takes the global step as an integer.

Returns

An integer representing mode_step of the given global step. Typically used in conjunction with mode method.

Return type

int

mode(global_step)

Given a global step number you can identify the mode for that step using this method.

Parameters

global_step (int) – Takes the global step as an integer.

Returns

smdebug.modes enum value of the given global step.

modes()

Retrieve a list of modes seen by the trial.

Returns

List of modes for which data was saved at all steps collected from the training job.

Return type

list[smdebug.modes enum value]

collections()

List the collections from the trial.

Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training.

Returns

A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer Tensor Collections for more details.

Return type

dict[str -> Collection]

collection(coll_name)

Get a specific collection from the trial.

Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training.

Parameters

coll_name (str) – Name of the collection

Returns

The requested Collection object. Please refer Tensor Collections for more details.

Return type

Collection

wait_for_steps(required_steps, mode=<ModeKeys.GLOBAL: 4>)

This method allows you to wait for steps before proceeding.

You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug.

Parameters
  • required_steps (list[int]) – Step numbers to wait for

  • mode (smdebug.modes enum value) – The mode to which given step numbers correspond to. This defaults to modes.GLOBAL.

Returns

Only returns after we know definitely whether we have seen the steps.

Return type

None

Exceptions raised:

StepUnavailable and NoMoreData. See Exceptions section for more details.

has_passed_step(step, mode=<ModeKeys.GLOBAL: 4>) → smdebug.core.tensor.StepState

This function indicates whether a step is complete (AVAILABLE), incomplete ( NOT_YET_AVAILABLE ) or absent ( UNAVAILABLE ).

Overview of logic:

  1. if the queried step is greater than all the available steps (complete / incomplete):

    if job is not complete:
        return StepState.NOT_YET_AVAILABLE
    else:
        return StepState.UNAVAILABLE
    
  2. if the queried step is less or equal to a step in available steps (complete / incomplete):

    if the queried step is less than all the available steps:
        if single_worker:
            return UNAVAILABLE ( step has been skipped or will not written)
        else:
            return NOT_YET_AVAILABLE
    
  3. queried step is available:

    if all workers have written the step or job is complete
    or last_complete_step > step ( All workers have written a step greater than the step we are checking.
                                        Hence, the step will never be complete. )
        return AVAILABLE
    else:
         return NOT_YET_AVAILABLE
    
Parameters
  • step (int) – The step number to check if the trial has passed it.

  • mode (smdebug.modes enum value) – The mode to which given step number corresponds to. This defaults to modes.GLOBAL.

Returns

Returns one of the following values: UNAVAILABLE, AVAILABLE, and NOT_YET_AVAILABLE.

Return type

smdebug.core.tensor.StepState enum value