Documentation/scheduler/sched-energy.txt

   1 Energy cost model for energy-aware scheduling (EXPERIMENTAL)
   2
   3 Introduction
   4 =============
   5
   6 The basic energy model uses platform energy data stored in sched_group_energy
   7 data structures attached to the sched_groups in the sched_domain hierarchy. The
   8 energy cost model offers two functions that can be used to guide scheduling
   9 decisions:
  10
  11 1.      static unsigned int sched_group_energy(struct energy_env *eenv)
  12 2.      static int energy_diff(struct energy_env *eenv)
  13
  14 sched_group_energy() estimates the energy consumed by all cpus in a specific
  15 sched_group including any shared resources owned exclusively by this group of
  16 cpus. Resources shared with other cpus are excluded (e.g. later level caches).
  17
  18 energy_diff() estimates the total energy impact of a utilization change. That
  19 is, adding, removing, or migrating utilization (tasks).
  20
  21 Both functions use a struct energy_env to specify the scenario to be evaluated:
  22
  23         struct energy_env {
  24                 struct sched_group      *sg_top;
  25                 struct sched_group      *sg_cap;
  26                 int                     cap_idx;
  27                 int                     util_delta;
  28                 int                     src_cpu;
  29                 int                     dst_cpu;
  30                 int                     energy;
  31         };
  32
  33 sg_top: sched_group to be evaluated. Not used by energy_diff().
  34
  35 sg_cap: sched_group covering the cpus in the same frequency domain. Set by
  36 sched_group_energy().
  37
  38 cap_idx: Capacity state to be used for energy calculations. Set by
  39 find_new_capacity().
  40
  41 util_delta: Amount of utilization to be added, removed, or migrated.
  42
  43 src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
  44 -1 if no source (e.g. task wake-up).
  45
  46 dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
  47 if utilization is removed (e.g. terminating tasks).
  48
  49 energy: Result of sched_group_energy().
  50
  51 The metric used to represent utilization is the actual per-entity running time
  52 averaged over time using a geometric series. Very similar to the existing
  53 per-entity load-tracking, but _not_ scaled by task priority and capped by the
  54 capacity of the cpu. The latter property does mean that utilization may
  55 underestimate the compute requirements for task on fully/over utilized cpus.
  56 The greatest potential for energy savings without affecting performance too much
  57 is scenarios where the system isn't fully utilized. If the system is deemed
  58 fully utilized load-balancing should be done with task load (includes task
  59 priority) instead in the interest of fairness and performance.
  60
  61
  62 Background and Terminology
  63 ===========================
  64
  65 To make it clear from the start:
  66
  67 energy = [joule] (resource like a battery on powered devices)
  68 power = energy/time = [joule/second] = [watt]
  69
  70 The goal of energy-aware scheduling is to minimize energy, while still getting
  71 the job done. That is, we want to maximize:
  72
  73         performance [inst/s]
  74         --------------------
  75             power [W]
  76
  77 which is equivalent to minimizing:
  78
  79         energy [J]
  80         -----------
  81         instruction
  82
  83 while still getting 'good' performance. It is essentially an alternative
  84 optimization objective to the current performance-only objective for the
  85 scheduler. This alternative considers two objectives: energy-efficiency and
  86 performance. Hence, there needs to be a user controllable knob to switch the
  87 objective. Since it is early days, this is currently a sched_feature
  88 (ENERGY_AWARE).
  89
  90 The idea behind introducing an energy cost model is to allow the scheduler to
  91 evaluate the implications of its decisions rather than applying energy-saving
  92 techniques blindly that may only have positive effects on some platforms. At
  93 the same time, the energy cost model must be as simple as possible to minimize
  94 the scheduler latency impact.
  95
  96 Platform topology
  97 ------------------
  98
  99 The system topology (cpus, caches, and NUMA information, not peripherals) is
 100 represented in the scheduler by the sched_domain hierarchy which has
 101 sched_groups attached at each level that covers one or more cpus (see
 102 sched-domains.txt for more details). To add energy awareness to the scheduler
 103 we need to consider power and frequency domains.
 104
 105 Power domain:
 106
 107 A power domain is a part of the system that can be powered on/off
 108 independently. Power domains are typically organized in a hierarchy where you
 109 may be able to power down just a cpu or a group of cpus along with any
 110 associated resources (e.g.  shared caches). Powering up a cpu means that all
 111 power domains it is a part of in the hierarchy must be powered up. Hence, it is
 112 more expensive to power up the first cpu that belongs to a higher level power
 113 domain than powering up additional cpus in the same high level domain. Two
 114 level power domain hierarchy example:
 115
 116                 Power source
 117                          +-------------------------------+----...
 118 per group PD             G                               G
 119                          |           +----------+        |
 120                     +--------+-------| Shared   |  (other groups)
 121 per-cpu PD          G        G       | resource |
 122                     |        |       +----------+
 123                 +-------+ +-------+
 124                 | CPU 0 | | CPU 1 |
 125                 +-------+ +-------+
 126
 127 Frequency domain:
 128
 129 Frequency domains (P-states) typically cover the same group of cpus as one of
 130 the power domain levels. That is, there might be several smaller power domains
 131 sharing the same frequency (P-state) or there might be a power domain spanning
 132 multiple frequency domains.
 133
 134 From a scheduling point of view there is no need to know the actual frequencies
 135 [Hz]. All the scheduler cares about is the compute capacity available at the
 136 current state (P-state) the cpu is in and any other available states. For that
 137 reason, and to also factor in any cpu micro-architecture differences, compute
 138 capacity scaling states are called 'capacity states' in this document. For SMP
 139 systems this is equivalent to P-states. For mixed micro-architecture systems
 140 (like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
 141 performance relative to the other cpus in the system.
 142
 143 Energy modelling:
 144 ------------------
 145
 146 Due to the hierarchical nature of the power domains, the most obvious way to
 147 model energy costs is therefore to associate power and energy costs with
 148 domains (groups of cpus). Energy costs of shared resources are associated with
 149 the group of cpus that share the resources, only the cost of powering the
 150 cpu itself and any private resources (e.g. private L1 caches) is associated
 151 with the per-cpu groups (lowest level).
 152
 153 For example, for an SMP system with per-cpu power domains and a cluster level
 154 (group of cpus) power domain we get the overall energy costs to be:
 155
 156         energy = energy_cluster + n * energy_cpu
 157
 158 where 'n' is the number of cpus powered up and energy_cluster is the cost paid
 159 as soon as any cpu in the cluster is powered up.
 160
 161 The power and frequency domains can naturally be mapped onto the existing
 162 sched_domain hierarchy and sched_groups by adding the necessary data to the
 163 existing data structures.
 164
 165 The energy model considers energy consumption from two contributors (shown in
 166 the illustration below):
 167
 168 1. Busy energy: Energy consumed while a cpu and the higher level groups that it
 169 belongs to are busy running tasks. Busy energy is associated with the state of
 170 the cpu, not an event. The time the cpu spends in this state varies. Thus, the
 171 most obvious platform parameter for this contribution is busy power
 172 (energy/time).
 173
 174 2. Idle energy: Energy consumed while a cpu and higher level groups that it
 175 belongs to are idle (in a C-state). Like busy energy, idle energy is associated
 176 with the state of the cpu. Thus, the platform parameter for this contribution
 177 is idle power (energy/time).
 178
 179 Energy consumed during transitions from an idle-state (C-state) to a busy state
 180 (P-state) or going the other way is ignored by the model to simplify the energy
 181 model calculations.
 182
 183
 184         Power
 185         ^
 186         |            busy->idle             idle->busy
 187         |            transition             transition
 188         |
 189         |                _                      __
 190         |               / \                    /  \__________________
 191         |______________/   \                  /
 192         |                   \                /
 193         |  Busy              \    Idle      /        Busy
 194         |  low P-state        \____________/         high P-state
 195         |
 196         +------------------------------------------------------------> time
 197
 198 Busy    |--------------|                          |-----------------|
 199
 200 Wakeup                 |------|            |------|
 201
 202 Idle                          |------------|
 203
 204
 205 The basic algorithm
 206 ====================
 207
 208 The basic idea is to determine the total energy impact when utilization is
 209 added or removed by estimating the impact at each level in the sched_domain
 210 hierarchy starting from the bottom (sched_group contains just a single cpu).
 211 The energy cost comes from busy time (sched_group is awake because one or more
 212 cpus are busy) and idle time (in an idle-state). Energy model numbers account
 213 for energy costs associated with all cpus in the sched_group as a group.
 214
 215         for_each_domain(cpu, sd) {
 216                 sg = sched_group_of(cpu)
 217                 energy_before = curr_util(sg) * busy_power(sg)
 218                                 + (1-curr_util(sg)) * idle_power(sg)
 219                 energy_after = new_util(sg) * busy_power(sg)
 220                                 + (1-new_util(sg)) * idle_power(sg)
 221                 energy_diff += energy_before - energy_after
 222
 223         }
 224
 225         return energy_diff
 226
 227 {curr, new}_util: The cpu utilization at the lowest level and the overall
 228 non-idle time for the entire group for higher levels. Utilization is in the
 229 range 0.0 to 1.0 in the pseudo-code.
 230
 231 busy_power: The power consumption of the sched_group.
 232
 233 idle_power: The power consumption of the sched_group when idle.
 234
 235 Note: It is a fundamental assumption that the utilization is (roughly) scale
 236 invariant. Task utilization tracking factors in any frequency scaling and
 237 performance scaling differences due to difference cpu microarchitectures such
 238 that task utilization can be used across the entire system.
 239
 240
 241 Platform energy data
 242 =====================
 243
 244 struct sched_group_energy can be attached to sched_groups in the sched_domain
 245 hierarchy and has the following members:
 246
 247 cap_states:
 248         List of struct capacity_state representing the supported capacity states
 249         (P-states). struct capacity_state has two members: cap and power, which
 250         represents the compute capacity and the busy_power of the state. The
 251         list must be ordered by capacity low->high.
 252
 253 nr_cap_states:
 254         Number of capacity states in cap_states list.
 255
 256 idle_states:
 257         List of struct idle_state containing idle_state power cost for each
 258         idle-state supported by the system orderd by shallowest state first.
 259         All states must be included at all level in the hierarchy, i.e. a
 260         sched_group spanning just a single cpu must also include coupled
 261         idle-states (cluster states). In addition to the cpuidle idle-states,
 262         the list must also contain an entry for the idling using the arch
 263         default idle (arch_idle_cpu()). Despite this state may not be a true
 264         hardware idle-state it is considered the shallowest idle-state in the
 265         energy model and must be the first entry. cpus may enter this state
 266         (possibly 'active idling') if cpuidle decides not enter a cpuidle
 267         idle-state. Default idle may not be used when cpuidle is enabled.
 268         In this case, it should just be a copy of the first cpuidle idle-state.
 269
 270 nr_idle_states:
 271         Number of idle states in idle_states list.
 272
 273 There are no unit requirements for the energy cost data. Data can be normalized
 274 with any reference, however, the normalization must be consistent across all
 275 energy cost data. That is, one bogo-joule/watt must be the same quantity for
 276 data, but we don't care what it is.
 277
 278 A recipe for platform characterization
 279 =======================================
 280
 281 Obtaining the actual model data for a particular platform requires some way of
 282 measuring power/energy. There isn't a tool to help with this (yet). This
 283 section provides a recipe for use as reference. It covers the steps used to
 284 characterize the ARM TC2 development platform. This sort of measurements is
 285 expected to be done anyway when tuning cpuidle and cpufreq for a given
 286 platform.
 287
 288 The energy model needs two types of data (struct sched_group_energy holds
 289 these) for each sched_group where energy costs should be taken into account:
 290
 291 1. Capacity state information
 292
 293 A list containing the compute capacity and power consumption when fully
 294 utilized attributed to the group as a whole for each available capacity state.
 295 At the lowest level (group contains just a single cpu) this is the power of the
 296 cpu alone without including power consumed by resources shared with other cpus.
 297 It basically needs to fit the basic modelling approach described in "Background
 298 and Terminology" section:
 299
 300         energy_system = energy_shared + n * energy_cpu
 301
 302 for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
 303 the lowest level. 'energy_shared' is included at the next level which
 304 represents the group of cpus among which the resources are shared.
 305
 306 This model is, of course, a simplification of reality. Thus, power/energy
 307 attributions might not always exactly represent how the hardware is designed.
 308 Also, busy power is likely to depend on the workload. It is therefore
 309 recommended to use a representative mix of workloads when characterizing the
 310 capacity states.
 311
 312 If the group has no capacity scaling support, the list will contain a single
 313 state where power is the busy power attributed to the group. The capacity
 314 should be set to a default value (1024).
 315
 316 When frequency domains include multiple power domains, the group representing
 317 the frequency domain and all child groups share capacity states. This must be
 318 indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
 319 all levels that share the capacity state must have the list of capacity states
 320 with the power set to the contribution of the individual group.
 321
 322 2. Idle power information
 323
 324 Stored in the idle_states list. The power number is the group idle power
 325 consumption in each idle state as well when the group is idle but has not
 326 entered an idle-state ('active idle' as mentioned earlier). Due to the way the
 327 energy model is defined, the idle power of the deepest group idle state can
 328 alternatively be accounted for in the parent group busy power. In that case the
 329 group idle state power values are offset such that the idle power of the
 330 deepest state is zero. It is less intuitive, but it is easier to measure as
 331 idle power consumed by the group and the busy/idle power of the parent group
 332 cannot be distinguished without per group measurement points.
 333
 334 Measuring capacity states and idle power:
 335
 336 The capacity states' capacity and power can be estimated by running a benchmark
 337 workload at each available capacity state. By restricting the benchmark to run
 338 on subsets of cpus it is possible to extrapolate the power consumption of
 339 shared resources.
 340
 341 ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
 342 shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
 343 benchmark workload on just one cpu in a cluster means that power is consumed in
 344 the cluster (higher level group) and a single cpu (lowest level group). Adding
 345 another benchmark task to another cpu increases the power consumption by the
 346 amount consumed by the additional cpu. Hence, it is possible to extrapolate the
 347 cluster busy power.
 348
 349 For platforms that don't have energy counters or equivalent instrumentation
 350 built-in, it may be possible to use an external DAQ to acquire similar data.
 351
 352 If the benchmark includes some performance score (for example sysbench cpu
 353 benchmark), this can be used to record the compute capacity.
 354
 355 Measuring idle power requires insight into the idle state implementation on the
 356 particular platform. Specifically, if the platform has coupled idle-states (or
 357 package states). To measure non-coupled per-cpu idle-states it is necessary to
 358 keep one cpu busy to keep any shared resources alive to isolate the idle power
 359 of the cpu from idle/busy power of the shared resources. The cpu can be tricked
 360 into different per-cpu idle states by disabling the other states. Based on
 361 various combinations of measurements with specific cpus busy and disabling
 362 idle-states it is possible to extrapolate the idle-state power.