Introduction

RADICAL-Analytics (RA) is a library implemented in Python to support the analysis of traces produced by RADICAL-Cybertools (RCT). Using RA requires knowing the architecture and the event model of the chosen RCT tool. Without that knowledge, you will not be able to choose the events that are relevant to your analysis and to understand how the results of your analysis relate to the inner working of the chosen RCT tool.

Depending on the chosen RCT, an understanding of the following document is precondition to the use of RA:

  1. RP architecture (outdated as for Aug 2020)
  2. RP event model
  3. EnTK architecture
  4. EnTK event model

Note

States are special types of events. Given two states in a sequence <1, 2>, both states are always recorded at runtime and state 1 always precede state 2.

Using RA

RA supports post-mortem analysis:

  1. Install RA and RADICAL-Pilot (RP) and/or RADICAL-EnTK.
  2. Write an application in Python to execute a workload (RP) or a workflow (EnTK) on an high-performance computing (HPC) platform.
  3. Set the environment variable RADICAL_PROFILE to TRUE with the command export RADICAL_PROFILE="TRUE".
  4. Execute your application.
  5. Both RP and EnTK write traces (i.e., timestamped sequences of events) to a directory called client sandbox. This directory is created inside the directory from which you executed your application. The name of the client sandbox is a session ID, e.g., rp.session.hostname.username.018443.0002 or en.session.hostname.username.018443.0002.
  6. Load the session traces in RA by creating an object ra.Session.
  7. Measure entity-level or session-level durations, concurrency or resource utilization, using RA API.

Fundamental Notions

  • Session: set of events generated by a single run of a RP or EnTK application. RA creates an object Session containing all the relevant information about all the events recoded at runtime by RP or EnTK. The Session object contains also information about the execution environment.
  • Entity: object exposed by RP or EnTK. Currently, RP exposes two types of entity—Pilot and Task—while EnTK exposes three types of entity—Pipeline, Stage and Task. An instance of an entity type is an actual pilot, task, pipeline, stage or task.
  • Describing: session and entity instances can be described by listing their properties. For example, a session instance has properties like list of a type of entity, list of events, list of timestamps for those events. A task instance has proprieties like the events of that specific instance, the timestamps of those specific events.
  • Filtering: selecting a subset of properties of a session. This is particularly important when we want to limit an analysis to a specific type of entity. For example, assume that we want to measure the amount of time spent by the tasks waiting to be scheduled. We will want to filter the session so to have only entities of type Task in the session. Then, we will perform our measure only on those entities.

Warning

It is important to stress that description and filtering are performed on instances of entities. This means that if we filter for, say, the event DONE and all the tasks have failed, RA will return an empty list as none of task instances will have the event DONE as their property.

Types of Analysis

RA enables both local and global analyses. Local analyses pertain to a single instance of an entity. Currently, RP supports two entities (Pilot and Task) and EnTK supports three entities (Pipeline, Stage and Task).

Global analyses pertain to a set of entities, including all the entities of a run. For example, a very common global analysis consists of measuring the total time all the tasks took to execute. It is fundamental to note that this is NOT the sum of the execution time of all the tasks. Tasks execute with varying degree of concurrency, depending on resource availability.

Types of Measure

RA is agnostic towards the tools used to perform the measurements. For example, RA supports writing stand-alone Python scripts, wranglers or being loaded into a Jupyter Notebook. RA offers classes and methods to perform three types of measures:

  1. Duration: measures the time spent by an instance of an entity (local analyses) or a set of instances of an entity (global analyses) between two timestamps. For example, staging, scheduling, pre-execute, execute time of one or more tasks; description, submission and execution time of one or more pipelines or stages; and runtime of one or more pilots.
  2. Concurrency: measures the number of entities of the same type that are between two given events in a time range during the execution. For example, this measures how many tasks where scheduled in a time range. Note that the time range here can be as large as the whole runtime of the application.
  3. Utilization: measures the amount of time a resource has been provided and consumed. In this context, resource indicates an hardware thread, a CPU core or a GPU. When measured for each resource, we can derive the percentage of utilization of all the resources available.

Note

Utilization is available only for RP as EnTK does not directly utilize resources but delegates that to RP.

Warning

Utilization is still under development so, for example, at the moment it does not offer an easy way to discriminate about types of resources.