SMDebug Profiler Analysis Utils

class smdebug.profiler.analysis.utils.merge_timelines.MergeUnit(value)

Bases: enum.Enum

Enum to get Merge Unit - time or step.

TIME = 'time'
STEP = 'step'
class smdebug.profiler.analysis.utils.merge_timelines.MergedTimeline(path, file_suffix_filter=None, output_directory=None)

Bases: object

Parameters
  • path – trace root folder that contains framework and system folders

  • file_suffix_filter – list of file suffix PYTHONTIMELINE_SUFFIX = “pythontimeline.json” MODELTIMELINE_SUFFIX = “model_timeline.json” TENSORBOARDTIMELINE_SUFFIX = “trace.json.gz” HOROVODTIMELINE_SUFFIX = “horovod_timeline.json” SMDATAPARALLELTIMELINE_SUFFIX = “smdataparallel_timeline.json”. Default: None (all files will be merged)

  • output_directory – Path where merged file should be saved Default: None (writes to the same location as the ‘path’ argument.

open(file_path)

Open the trace event file

file_name(end_timestamp_in_us)

Since this util will be used from a notebook or local directory, we directly write to the merged file

merge_timeline(start, end, unit=<MergeUnit.TIME: 'time'>, sys_metrics_filter={'lowgpu': ()})

Get all trace files captured and merge them for viewing in the browser

close()
class smdebug.profiler.analysis.utils.pandas_data_analysis.StatsBy(value)

Bases: enum.Enum

Enum to get stats by different categories.

TRAINING_PHASE = 'training_phase'
FRAMEWORK_METRICS = 'framework_metric'
PROCESS = 'process'
class smdebug.profiler.analysis.utils.pandas_data_analysis.Resource(value)

Bases: enum.Enum

Enum to specify the device/resource specified in system metrics

CPU = 'cpu'
GPU = 'gpu'
IO = 'i/o'
NETWORK = 'network'
MEMORY = 'memory'
class smdebug.profiler.analysis.utils.pandas_data_analysis.JobStats

Bases: dict

class smdebug.profiler.analysis.utils.pandas_data_analysis.PandasFrameAnalysis(system_df, framework_df)

Bases: object

This class contains some of the common utils that can be used with the system metrics and framework metrics DataFrames. The functions here only query the DataFrame and return results. The results will then have to be plotted/visualized by the user or other utils.

get_job_statistics()

Returns a Dictionary with information about runtime of training job, initilization, training loop and finalization.

get_step_statistics(by=<StatsBy.TRAINING_PHASE: 'training_phase'>)

Get average, minimum, maximum, p50, p95, p99 stats on step duration :param by: by default, stats are grouped by framework_metric. The other options are to get stats by training phase - train/eval/global or grouped by process. This parameter should be of type StatsBy

get_utilization_stats(resource=None, by=None, phase=None)

Get CPU/GPU utilization stats :param resource: system resource for which utilization stats have to be computed. Type: Resource :param by: By default, get overall utilization stats. When by=”training_phase”, utilization stats are provided per training phase interval. Type: StatsBy :param phase: List of training phase to find intervals for. If nothing is mentioned, intervals are determined for all training phases available. :return: Dataframe containing utilization stats

get_device_usage_stats(device=None, utilization_ranges=None)

Find the usage spread based on utilization ranges. If ranges are not provided, >90, 10-90, <10 are considered :param device: List of Resource.cpu, Resource.gpu. Type: Resource :param utilization_ranges: list of tuples

get_training_phase_intervals(phase=None)

This function splits framework data into before train, train, between train and eval, eval, and after eval. :param phase: List of training phase to find intervals for. If nothing is mentioned, intervals are determined for all training phases available. Type: string or List of strings :return: DataFrame containing the intervals

class smdebug.profiler.analysis.utils.profiler_data_to_pandas.PandasFrame(path, use_in_memory_cache=False, scan_interval=5000000000)

Bases: object

get_all_system_metrics(selected_system_metrics=[])

Get system metrics :param systemk_metrics_list: list of system metrics.If not empty, function will only return framework events that are part of this list. :return: System metrics DataFrame

get_all_framework_metrics(selected_framework_metrics=[])

Get framework metrics :param selected_framework_metrics: list of framework metrics.If not empty, function will only return framework events that are part of this list. :return: Framework metrics DataFrame

convert_datetime_to_timestamp(timestamp)

A helper function to convert datetime into timestamp :param timestep: timestamp in datetime :return: timestamp in microseconds

get_framework_metrics_by_timesteps(timestep_list=[], selected_framework_metrics=[])

Get framework metrics for a list of timeranges. This function is useful when we want to correlate framework metrics with system metrics. Framework metrics have a begin and end timestamp. System metrics have only a single timestamp. :param timestep_list: list of timestamps :param selected_framework_metrics: list of framework metrics which will be stored in the dataframe :return: Framework metrics DataFrame

get_framework_metrics_by_begin_and_end_timesteps(begin_timestep_list, end_timestep_list, selected_framework_metrics=[])

Get framework metrics for a set of given timeranges. This function is useful when we want to correlate framework metrics such as steps with other framework metrics such as dataloading, preprocessing etc. :param begin_timestep_list: list of start of intervals in datetime :param end_timestep_list: list of end intervals in datetime :param selected_framework_metrics: list of framework metrics which will be stored in the dataframe :return: Framework metrics DataFrame

get_profiler_data_by_time(start_time_us, end_time_us, cache_metrics=False, selected_framework_metrics=[], selected_system_metrics=[], get_framework_metrics=True, get_system_metrics=True)

Get metrics data within a time interval.

Parameters
  • start_time_us – Start of the interval in microseconds

  • end_time_us – End of the interval in microseconds

  • cache_metrics – If True, collect and return all metrics requested so far, else,

  • framework_metrics_list – list of framework metrics. If not empty, function will only return framework events that are part of this list.

  • selected_system_metrics – list of system metrics. If not empty, function will only return system events that are part of this list.

  • selected_framework_metrics – if True, get framework metrics

  • get_system_metrics – if True: get system metrics

Returns

System metrics DataFrame, Framework metrics DataFrame

get_profiler_data_by_step(start_step, end_step, cache_metrics=False)

Get metrics data within a step interval. We find the mapping between step number and time interval for the step as some events may not be associated with a step number yet. :param start_step: Start of step interval :param end_step: End of step interval :param cache_metrics: If True, collect and return all metrics requested so far, else, return current request :return: System metrics DataFrame, Framework metrics DataFrame

get_all_dataloader_metrics(selected_framework_metrics=[])

Get framework metrics :param selected_framework_metrics: list of framework metrics.If not empty, function will only return framework events that are part of this list. :return: Framework metrics DataFrame

class smdebug.profiler.analysis.utils.python_profile_analysis_utils.Metrics(value)

Bases: enum.Enum

Enum to describe the types of metrics recorded in cProfile profiling.

TOTAL_TIME = 'tottime'
CUMULATIVE_TIME = 'cumtime'
PRIMITIVE_CALLS = 'pcalls'
TOTAL_CALLS = 'ncalls'
class smdebug.profiler.analysis.utils.python_profile_analysis_utils.StepPythonProfileStats(framework, profiler_name, node_id, stats_dir, stats_path)

Bases: object

Class that represents the metadata for a single instance of profiling: before step 0, during a step, between steps, end of script, etc. Used so that users can easily filter through which exact portion of their session that they want profiling stats of. In addition, printing this class will result in a dictionary of the attributes and its corresponding values.

profiler_name

The name of the profiler used to generate this stats file, cProfile or pyinstrument

Type

str

framework

The machine learning framework used in training.

Type

str

node_id

The node ID of the node used in the session.

Type

str

start_mode

The training phase (TRAIN/EVAL/GLOBAL) at which profiling started.

Type

str

start_phase

The step phase (start of step, end of step, etc.) at which python profiling was started.

Type

str

start_step

The step at which python profiling was started. -1 if profiling before step 0.

Type

float

start_time_since_epoch_in_micros

The UTC time (in microseconds) at which profiling started for this step.

Type

int

end_mode

The training phase (TRAIN/EVAL/GLOBAL) at which profiling was stopped.

Type

str

end_step

The step at which python profiling was stopped. Infinity if end of script.

Type

float

end_phase

The step phase (start of step, end of step, etc.) at which python profiling was stopped.

Type

str

end_time_since_epoch_in_micros

The UTC time (in microseconds) at which profiling finished for this step.

Type

int

stats_path

The path to the dumped python stats or html resulting from profiling this step.

Type

str

has_start_and_end_mode(start_mode, end_mode)
in_time_interval(start_time_since_epoch_in_micros, end_time_since_epoch_in_micros)

Returns whether this step is in the provided time interval. This is defined as whether there is any overlap between the time interval of the step and the provided time interval.

in_step_interval(start_step, end_step, start_phase, end_phase)

Returns whether this is in the provided step interval.

This is defined as:

  1. This start step is greater than the provided start step and the end step is greater than the provided end step.

  2. If this start step equals the provided start step, verify that this start phase does not occur before the provided start phase.

  3. If this end step equals the provided end step, verify that this end phase does not occur after the provided end phase.

has_pre_step_zero_profile_stats()
has_post_hook_close_profile_stats()
has_node_id(node_id)
class smdebug.profiler.analysis.utils.python_profile_analysis_utils.cProfileStats(ps)

Bases: object

Class used to represent cProfile stats captured, given the pStats.Stats object of the desired interval. … .. attribute:: ps

The cProfile stats of Python functions as a pStats.Stats object. Useful for high level analysis like sorting functions by a desired metric and printing the list of profiled functions.

type

pstats.Stats

function_stats_list

The cProfile stats of Python functions as a list of cProfileFunctionStats objects, which contain specific metrics corresponding to each function profiled. Parsed from the pStats.Stats object. Useful for more in depth analysis as it allows users physical access to the metrics for each function.

Type

list of cProfileFunctionStats

print_top_n_functions(by, n=10)

Print the stats for the top n functions with respect to the provided metric.

Parameters
  • by (Metrics enum) – The metric to sort the functions by. Must be one of the following from the Metrics enum: TOTAL_TIME, CUMULATIVE_TIME, PRIMITIVE_CALLS, TOTAL_CALLS.

  • n (int) – The first n functions and stats to print after sorting.

For example, to print the top 20 functions with respect to cumulative time spent in function

from smdebug.profiler.analysis.utils.python_profile_analysis_utils import Metrics
cprofile_stats.print_top_n_function(self, Metrics.CUMULATIVE_TIME, n=20)
get_function_stats()

Return the function stats list as a DataFrame, where each row represents a cProfileFunctionStats object.

class smdebug.profiler.analysis.utils.python_profile_analysis_utils.cProfileFunctionStats(key, value)

Bases: object

Class used to represent a single profiled function and parsed cProfile stats pertaining to this function. Processes the stats dictionary’s (key, value) pair to get the function name and the specific stats. Key is a tuple of (filename, lineno, function). Value is a tuple of (prim_calls, total_calls, total_time, cumulative_time, callers). See below for details.

Parameters
  • function_name (str) – The full function name, derived from the key tuple. Defined as filename:lineno(function).

  • prim_calls (int) – The number of primitive (non-recursive) calls to this function.

  • total_calls (int) – The total number of calls to this function.

  • total_time (int) – The total amount of time spent in the scope of this function alone, in seconds.

  • cumulative_time (int) – The total amount of time spent in the scope of this function and in the scope of all other functions that this function calls, in seconds.

  • callers (list of str) – The list of functions that call this function. Organized as a list of function names, which follow the above format for function_name: filename:lineno(function)

class smdebug.profiler.analysis.utils.python_profile_analysis_utils.PyinstrumentStepStats(html_file_path, json_stats)

Bases: object

class smdebug.profiler.analysis.utils.pytorch_dataloader_analysis.PT_dataloader_analysis(pandas_frame)

Bases: object

analyze_dataloaderIter_initialization()
analyze_dataloaderWorkers()
analyze_dataloader_getnext()
analyze_batchtime()
plot_the_window(start_timestamp, end_timestamp, select_events=['.*'], select_dimensions=['.*'])