Trial API¶
Use Trial methods of an SMDebug trial object. The methods are tools to load output tensors collected by SageMaker Debugger for further analysis.
Note
To use the following methods of the Trial
class,
you must create a trial
instance as guided
in the previous topic SMDebug Trial.
-
class
smdebug.trials.trial.
Trial
(name, range_steps=None, parallel=True, check=False, index_mode=True, cache=False)¶ Bases:
abc.ABC
The base class for creating an SMDebug trial objects. A trial creation helper function
create_trial
was introduced in the previous topic at SMDebug Trial.After you create an SMDebug trial object, use the following Trial class methods for accessing output tensor information.
-
tensor
(tname)¶ Retrieves the
smdebug.core.tensor.Tensor
object by the given nametname
. To find available methods that this Tensor object provides, see Tensor API.If output tensor is still not available when you run this method, it refreshes the method call until the first output tensor becomes available.
- Parameters
tname (str) – Takes the name of tensor
- Returns
An output tensor object.
- Return type
Tensor
object
-
has_tensor
(tname)¶ Checks if the trial has a tensor of the given tensor name.
- Parameters
tname (str) – Takes the name of tensor
- Returns
True
if the tensor is found by the trial, else it returnsFalse
.- Return type
bool
-
tensor_names
(show_prefixed_tensors=False, *, step=None, mode=<ModeKeys.GLOBAL: 4>, regex=None, collection=None) → list¶ Retrieves names of tensors saved.
All arguments to this method are optional. You are not required to pass any of the following arguments. By default, this method returns all tensor names if you don’t pass any arguments.
- Parameters
step (int) – If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step.
mode (smdebug.modes enum value) – If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as
smd.modes.TRAIN
,smd.modes.EVAL
,smd.modes.PREDICT
, orsmd.modes.GLOBAL
.regex (str or list[str]) – You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of
regex
orcollection
parameters.collection (Collection or str) – You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of
regex
orcollection
parameters.
- Returns
List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode, saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters.
- Return type
list[str]
Examples:
trial.tensor_names()
- Returns all tensors saved for any step or mode.trial.tensor_names(step=10, mode=modes.TRAIN)
- Returns tensors saved for training step 10trial.tensor_names(regex='relu')
- Returns all tensors matching the regex patternrelu
saved for any step or mode.trial.tensor_names(collection='gradients')
- Returns tensors from collection “gradients”trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')
- Returns tensor saved for 10th training step which matches the regexsoftmax
-
workers
()¶ Query for all the worker processes from which data was saved by smdebug during multi worker training.
- Returns
A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form
worker_0
where 0 is the rank of the process.- Return type
list[str]
-
steps
(mode=<ModeKeys.GLOBAL: 4>, show_incomplete_steps=False) → list¶ Retrieves a list of steps collected by SageMaker Debugger.
- Parameters
mode (smdebug.modes enum value) – Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode If this is not passed, returns steps for all modes.
show_incomplete_steps (bool) –
- Returns
List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. Each of these mode steps has a global step number associated with it. The global step represents the sequence of steps across all modes executed by the job.
- Return type
list[int]
-
global_step
(mode, mode_step)¶ Given a mode and a mode_step number you can retrieve its global step using this method.
- Parameters
mode (smdebug.modes enum value) – Takes the mode as enum value
mode_step (int) – Takes the mode step as an integer
- Returns
An integer representing
global_step
of the given mode andmode_step
.- Return type
int
-
mode_step
(global_step)¶ Given a global step number you can identify the
mode_step
for that step using this method.- Parameters
global_step (int) – Takes the global step as an integer.
- Returns
An integer representing
mode_step
of the given global step. Typically used in conjunction withmode
method.- Return type
int
-
mode
(global_step)¶ Given a global step number you can identify the mode for that step using this method.
- Parameters
global_step (int) – Takes the global step as an integer.
- Returns
smdebug.modes enum value
of the given global step.
-
modes
()¶ Retrieve a list of modes seen by the trial.
- Returns
List of modes for which data was saved at all steps collected from the training job.
- Return type
list[smdebug.modes enum value]
-
collections
()¶ List the collections from the trial.
Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training.
- Returns
A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer Tensor Collections for more details.
- Return type
dict[str -> Collection]
-
collection
(coll_name)¶ Get a specific collection from the trial.
Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training.
- Parameters
coll_name (str) – Name of the collection
- Returns
The requested Collection object. Please refer Tensor Collections for more details.
- Return type
Collection
-
wait_for_steps
(required_steps, mode=<ModeKeys.GLOBAL: 4>)¶ This method allows you to wait for steps before proceeding.
You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug.
- Parameters
required_steps (list[int]) – Step numbers to wait for
mode (smdebug.modes enum value) – The mode to which given step numbers correspond to. This defaults to modes.GLOBAL.
- Returns
Only returns after we know definitely whether we have seen the steps.
- Return type
None
Exceptions raised:
StepUnavailable
andNoMoreData
. See Exceptions section for more details.
-
has_passed_step
(step, mode=<ModeKeys.GLOBAL: 4>) → smdebug.core.tensor.StepState¶ This function indicates whether a step is complete (AVAILABLE), incomplete ( NOT_YET_AVAILABLE ) or absent ( UNAVAILABLE ).
Overview of logic:
if the queried step is greater than all the available steps (complete / incomplete):
if job is not complete: return StepState.NOT_YET_AVAILABLE else: return StepState.UNAVAILABLE
if the queried step is less or equal to a step in available steps (complete / incomplete):
if the queried step is less than all the available steps: if single_worker: return UNAVAILABLE ( step has been skipped or will not written) else: return NOT_YET_AVAILABLE
queried step is available:
if all workers have written the step or job is complete or last_complete_step > step ( All workers have written a step greater than the step we are checking. Hence, the step will never be complete. ) return AVAILABLE else: return NOT_YET_AVAILABLE
- Parameters
step (int) – The step number to check if the trial has passed it.
mode (smdebug.modes enum value) – The mode to which given step number corresponds to. This defaults to modes.GLOBAL.
- Returns
Returns one of the following values:
UNAVAILABLE
,AVAILABLE
, andNOT_YET_AVAILABLE
.- Return type
smdebug.core.tensor.StepState enum value
-