Trial API¶

Use Trial methods of an SMDebug trial object. The methods are tools to load output tensors collected by SageMaker Debugger for further analysis.

Note

To use the following methods of the Trial class, you must create a trial instance as guided in the previous topic SMDebug Trial.

class smdebug.trials.trial.Trial(name, range_steps=None, parallel=True, check=False, index_mode=True, cache=False)¶

Bases: abc.ABC

The base class for creating an SMDebug trial objects. A trial creation helper function create_trial was introduced in the previous topic at SMDebug Trial.

After you create an SMDebug trial object, use the following Trial class methods for accessing output tensor information.

tensor(tname)¶

Retrieves the smdebug.core.tensor.Tensor object by the given name tname. To find available methods that this Tensor object provides, see Tensor API.

If output tensor is still not available when you run this method, it refreshes the method call until the first output tensor becomes available.

Parameters: tname (str) – Takes the name of tensor
Returns: An output tensor object.
Return type: Tensor object

has_tensor(tname)¶

Checks if the trial has a tensor of the given tensor name.

Parameters: tname (str) – Takes the name of tensor
Returns: True if the tensor is found by the trial, else it returns False.
Return type: bool

tensor_names(show_prefixed_tensors=False, *, step=None, mode=<ModeKeys.GLOBAL: 4>, regex=None, collection=None) → list¶

Retrieves names of tensors saved.

All arguments to this method are optional. You are not required to pass any of the following arguments. By default, this method returns all tensor names if you don’t pass any arguments.

Parameters

step (int) – If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step.
mode (smdebug.modes enum value) – If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as smd.modes.TRAIN, smd.modes.EVAL, smd.modes.PREDICT, or smd.modes.GLOBAL.
regex (str or list[str]) – You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of regex or collection parameters.
collection (Collection or str) – You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of regex or collection parameters.

Returns

List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode, saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters.

Return type

list[str]

Examples:

trial.tensor_names() - Returns all tensors saved for any step or mode.
trial.tensor_names(step=10, mode=modes.TRAIN) - Returns tensors saved for training step 10
trial.tensor_names(regex='relu') - Returns all tensors matching the regex pattern relu saved for any step or mode.
trial.tensor_names(collection='gradients') - Returns tensors from collection “gradients”
trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax') - Returns tensor saved for 10th training step which matches the regex softmax

workers()¶

Query for all the worker processes from which data was saved by smdebug during multi worker training.

Returns: A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form worker_0 where 0 is the rank of the process.
Return type: list[str]

steps(mode=<ModeKeys.GLOBAL: 4>, show_incomplete_steps=False) → list¶

Retrieves a list of steps collected by SageMaker Debugger.

Parameters

mode (smdebug.modes enum value) – Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode If this is not passed, returns steps for all modes.
show_incomplete_steps (bool) –

Returns

List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. Each of these mode steps has a global step number associated with it. The global step represents the sequence of steps across all modes executed by the job.

Return type

list[int]

global_step(mode, mode_step)¶

Given a mode and a mode_step number you can retrieve its global step using this method.

Parameters

mode (smdebug.modes enum value) – Takes the mode as enum value
mode_step (int) – Takes the mode step as an integer

Returns

An integer representing global_step of the given mode and mode_step.

Return type

int

mode_step(global_step)¶

Given a global step number you can identify the mode_step for that step using this method.

Parameters: global_step (int) – Takes the global step as an integer.
Returns: An integer representing mode_step of the given global step. Typically used in conjunction with mode method.
Return type: int

mode(global_step)¶

Given a global step number you can identify the mode for that step using this method.

Parameters: global_step (int) – Takes the global step as an integer.
Returns: smdebug.modes enum value of the given global step.

modes()¶

Retrieve a list of modes seen by the trial.

Returns: List of modes for which data was saved at all steps collected from the training job.
Return type: list[smdebug.modes enum value]

collections()¶

List the collections from the trial.

Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training.

Returns: A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer Tensor Collections for more details.
Return type: dict[str -> Collection]

collection(coll_name)¶

Get a specific collection from the trial.

Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training.

Parameters: coll_name (str) – Name of the collection
Returns: The requested Collection object. Please refer Tensor Collections for more details.
Return type: Collection

wait_for_steps(required_steps, mode=<ModeKeys.GLOBAL: 4>)¶

This method allows you to wait for steps before proceeding.

You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug.

Parameters

required_steps (list[int]) – Step numbers to wait for
mode (smdebug.modes enum value) – The mode to which given step numbers correspond to. This defaults to modes.GLOBAL.

Returns

Only returns after we know definitely whether we have seen the steps.

Return type

None

Exceptions raised:

StepUnavailable and NoMoreData. See Exceptions section for more details.

has_passed_step(step, mode=<ModeKeys.GLOBAL: 4>) → smdebug.core.tensor.StepState¶

This function indicates whether a step is complete (AVAILABLE), incomplete ( NOT_YET_AVAILABLE ) or absent ( UNAVAILABLE ).

Overview of logic:

if the queried step is greater than all the available steps (complete / incomplete):

if job is not complete:
    return StepState.NOT_YET_AVAILABLE
else:
    return StepState.UNAVAILABLE

if the queried step is less or equal to a step in available steps (complete / incomplete):

if the queried step is less than all the available steps:
    if single_worker:
        return UNAVAILABLE ( step has been skipped or will not written)
    else:
        return NOT_YET_AVAILABLE

queried step is available:

if all workers have written the step or job is complete
or last_complete_step > step ( All workers have written a step greater than the step we are checking.
                                    Hence, the step will never be complete. )
    return AVAILABLE
else:
     return NOT_YET_AVAILABLE

Parameters

step (int) – The step number to check if the trial has passed it.
mode (smdebug.modes enum value) – The mode to which given step number corresponds to. This defaults to modes.GLOBAL.

Returns

Returns one of the following values: UNAVAILABLE, AVAILABLE, and NOT_YET_AVAILABLE.

Return type

smdebug.core.tensor.StepState enum value