From 5a64f775691647c242aa40d34f3512e7b179a921 Mon Sep 17 00:00:00 2001 From: Lukasz Luba Date: Tue, 3 Nov 2020 09:05:58 +0000 Subject: PM: EM: Clarify abstract scale usage for power values in Energy Model The Energy Model (EM) can store power values in milli-Watts or in abstract scale. This might cause issues in the subsystems which use the EM for estimating the device power, such as: - mixing of different scales in a subsystem which uses multiple (cooling) devices (e.g. thermal Intelligent Power Allocation (IPA)) - assuming that energy [milli-Joules] can be derived from the EM power values which might not be possible since the power scale doesn't have to be in milli-Watts To avoid misconfiguration add the requisite documentation to the EM and related subsystems: EAS and IPA. Signed-off-by: Lukasz Luba Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/thermal/power_allocator.rst | 12 +++++++++++- Documentation/power/energy-model.rst | 13 +++++++++++++ Documentation/scheduler/sched-energy.rst | 5 +++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/driver-api/thermal/power_allocator.rst b/Documentation/driver-api/thermal/power_allocator.rst index 67b6a3297238..aa5f66552d6f 100644 --- a/Documentation/driver-api/thermal/power_allocator.rst +++ b/Documentation/driver-api/thermal/power_allocator.rst @@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore simply an estimate, and may be tuned to affect the aggressiveness of the thermal ramp. For reference, the sustainable power of a 4" phone is typically 2000mW, while on a 10" tablet is around 4500mW (may vary -depending on screen size). +depending on screen size). It is possible to have the power value +expressed in an abstract scale. The sustained power should be aligned +to the scale used by the related cooling devices. If you are using device tree, do add it as a property of the thermal-zone. For example:: @@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this governor, step-wise will also misbehave if you call its throttle() faster than the normal thermal framework tick (due to interrupts for example) as it will overreact. + +Energy Model requirements +========================= + +Another important thing is the consistent scale of the power values +provided by the cooling devices. All of the cooling devices in a single +thermal zone should have power values reported either in milli-Watts +or scaled to the same 'abstract scale'. diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst index a6fb986abe3c..ba7aa581b307 100644 --- a/Documentation/power/energy-model.rst +++ b/Documentation/power/energy-model.rst @@ -20,6 +20,19 @@ possible source of information on its own, the EM framework intervenes as an abstraction layer which standardizes the format of power cost tables in the kernel, hence enabling to avoid redundant work. +The power values might be expressed in milli-Watts or in an 'abstract scale'. +Multiple subsystems might use the EM and it is up to the system integrator to +check that the requirements for the power value scale types are met. An example +can be found in the Energy-Aware Scheduler documentation +Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or +powercap power values expressed in an 'abstract scale' might cause issues. +These subsystems are more interested in estimation of power used in the past, +thus the real milli-Watts might be needed. An example of these requirements can +be found in the Intelligent Power Allocation in +Documentation/driver-api/thermal/power_allocator.rst. +Important thing to keep in mind is that when the power values are expressed in +an 'abstract scale' deriving real energy in milli-Joules would not be possible. + The figure below depicts an example of drivers (Arm-specific here, but the approach is applicable to any architecture) providing power costs to the EM framework, and interested clients reading the data from it:: diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst index 001e09c95e1d..afe02d394402 100644 --- a/Documentation/scheduler/sched-energy.rst +++ b/Documentation/scheduler/sched-energy.rst @@ -350,6 +350,11 @@ independent EM framework in Documentation/power/energy-model.rst. Please also note that the scheduling domains need to be re-built after the EM has been registered in order to start EAS. +EAS uses the EM to make a forecasting decision on energy usage and thus it is +more focused on the difference when checking possible options for task +placement. For EAS it doesn't matter whether the EM power values are expressed +in milli-Watts or in an 'abstract scale'. + 6.3 - Energy Model complexity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -- cgit From f2c90b12e700fff6a0b5a1c32f446f05f9d0890c Mon Sep 17 00:00:00 2001 From: Lukasz Luba Date: Tue, 3 Nov 2020 09:05:59 +0000 Subject: PM: EM: update the comments related to power scale The Energy Model supports power values expressed in milli-Watts or in an 'abstract scale'. Update the related comments is the code to reflect that state. Reviewed-by: Quentin Perret Signed-off-by: Lukasz Luba Signed-off-by: Rafael J. Wysocki --- include/linux/energy_model.h | 11 +++++------ kernel/power/energy_model.c | 2 +- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 3a33c738d876..9618c0a46ef4 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -13,9 +13,8 @@ /** * em_perf_state - Performance state of a performance domain * @frequency: The frequency in KHz, for consistency with CPUFreq - * @power: The power consumed at this level, in milli-watts (by 1 CPU or - by a registered device). It can be a total power: static and - dynamic. + * @power: The power consumed at this level (by 1 CPU or by a registered + * device). It can be a total power: static and dynamic. * @cost: The cost coefficient associated with this level, used during * energy calculation. Equal to: power * max_frequency / frequency */ @@ -58,7 +57,7 @@ struct em_data_callback { /** * active_power() - Provide power at the next performance state of * a device - * @power : Active power at the performance state in mW + * @power : Active power at the performance state * (modified) * @freq : Frequency at the performance state in kHz * (modified) @@ -69,8 +68,8 @@ struct em_data_callback { * and frequency. * * In case of CPUs, the power is the one of a single CPU in the domain, - * expressed in milli-watts. It is expected to fit in the - * [0, EM_MAX_POWER] range. + * expressed in milli-Watts or an abstract scale. It is expected to + * fit in the [0, EM_MAX_POWER] range. * * Return 0 on success. */ diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c index efe2a595988e..1358fa4abfa8 100644 --- a/kernel/power/energy_model.c +++ b/kernel/power/energy_model.c @@ -143,7 +143,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd, /* * The power returned by active_state() is expected to be - * positive, in milli-watts and to fit into 16 bits. + * positive and to fit into 16 bits. */ if (!power || power > EM_MAX_POWER) { dev_err(dev, "EM: invalid power: %lu\n", -- cgit From b56a352c0d3ca4640c3c6364e592be360ac0d6d4 Mon Sep 17 00:00:00 2001 From: Lukasz Luba Date: Tue, 3 Nov 2020 09:06:00 +0000 Subject: PM: EM: Update Energy Model with new flag indicating power scale Update description and meaning of a new flag, which indicates the type of power scale used for a registered Energy Model (EM) device. Reviewed-by: Quentin Perret Signed-off-by: Lukasz Luba Signed-off-by: Rafael J. Wysocki --- Documentation/power/energy-model.rst | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst index ba7aa581b307..60ac091d3b0d 100644 --- a/Documentation/power/energy-model.rst +++ b/Documentation/power/energy-model.rst @@ -30,6 +30,8 @@ These subsystems are more interested in estimation of power used in the past, thus the real milli-Watts might be needed. An example of these requirements can be found in the Intelligent Power Allocation in Documentation/driver-api/thermal/power_allocator.rst. +Kernel subsystems might implement automatic detection to check whether EM +registered devices have inconsistent scale (based on EM internal flag). Important thing to keep in mind is that when the power values are expressed in an 'abstract scale' deriving real energy in milli-Joules would not be possible. @@ -86,7 +88,7 @@ Drivers are expected to register performance domains into the EM framework by calling the following API:: int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, - struct em_data_callback *cb, cpumask_t *cpus); + struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts); Drivers must provide a callback function returning tuples for each performance state. The callback function provided by the driver is free @@ -94,6 +96,10 @@ to fetch data from any relevant location (DT, firmware, ...), and by any mean deemed necessary. Only for CPU devices, drivers must specify the CPUs of the performance domains using cpumask. For other devices than CPUs the last argument must be set to NULL. +The last argument 'milliwatts' is important to set with correct value. Kernel +subsystems which use EM might rely on this flag to check if all EM devices use +the same scale. If there are different scales, these subsystems might decide +to: return warning/error, stop working or panic. See Section 3. for an example of driver implementing this callback, and kernel/power/energy_model.c for further documentation on this API. @@ -169,7 +175,8 @@ EM framework:: 37 nr_opp = foo_get_nr_opp(policy); 38 39 /* And register the new performance domain */ - 40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus); - 41 - 42 return 0; - 43 } + 40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus, + 41 true); + 42 + 43 return 0; + 44 } -- cgit From fc51989062138744b56e47190915ce68484e73f3 Mon Sep 17 00:00:00 2001 From: Ulf Hansson Date: Tue, 3 Nov 2020 16:06:25 +0100 Subject: PM: domains: Rename pm_genpd_syscore_poweroff|poweron() To better describe what the pm_genpd_syscore_poweroff|poweron() functions actually do, let's rename them to dev_pm_genpd_suspend|resume() and update the rather few callers of them accordingly (a couple of clocksource drivers). Moreover, let's take the opportunity to add some documentation of these exported functions, as that is currently missing. Signed-off-by: Ulf Hansson Signed-off-by: Rafael J. Wysocki --- drivers/base/power/domain.c | 35 +++++++++++++++++++++-------------- drivers/clocksource/sh_cmt.c | 8 ++++---- drivers/clocksource/sh_mtu2.c | 4 ++-- drivers/clocksource/sh_tmu.c | 8 ++++---- include/linux/pm_domain.h | 8 ++++---- 5 files changed, 35 insertions(+), 28 deletions(-) diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c index 743268996336..9b4881b67683 100644 --- a/drivers/base/power/domain.c +++ b/drivers/base/power/domain.c @@ -1363,14 +1363,7 @@ static void genpd_complete(struct device *dev) genpd_unlock(genpd); } -/** - * genpd_syscore_switch - Switch power during system core suspend or resume. - * @dev: Device that normally is marked as "always on" to switch power for. - * - * This routine may only be called during the system core (syscore) suspend or - * resume phase for devices whose "always on" flags are set. - */ -static void genpd_syscore_switch(struct device *dev, bool suspend) +static void genpd_switch_state(struct device *dev, bool suspend) { struct generic_pm_domain *genpd; @@ -1387,17 +1380,31 @@ static void genpd_syscore_switch(struct device *dev, bool suspend) } } -void pm_genpd_syscore_poweroff(struct device *dev) +/** + * dev_pm_genpd_suspend - Synchronously try to suspend the genpd for @dev + * @dev: The device that is attached to the genpd, that can be suspended. + * + * This routine should typically be called for a device that needs to be + * suspended during the syscore suspend phase. + */ +void dev_pm_genpd_suspend(struct device *dev) { - genpd_syscore_switch(dev, true); + genpd_switch_state(dev, true); } -EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweroff); +EXPORT_SYMBOL_GPL(dev_pm_genpd_suspend); -void pm_genpd_syscore_poweron(struct device *dev) +/** + * dev_pm_genpd_resume - Synchronously try to resume the genpd for @dev + * @dev: The device that is attached to the genpd, which needs to be resumed. + * + * This routine should typically be called for a device that needs to be resumed + * during the syscore resume phase. + */ +void dev_pm_genpd_resume(struct device *dev) { - genpd_syscore_switch(dev, false); + genpd_switch_state(dev, false); } -EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweron); +EXPORT_SYMBOL_GPL(dev_pm_genpd_resume); #else /* !CONFIG_PM_SLEEP */ diff --git a/drivers/clocksource/sh_cmt.c b/drivers/clocksource/sh_cmt.c index 760777458a90..7275d95de435 100644 --- a/drivers/clocksource/sh_cmt.c +++ b/drivers/clocksource/sh_cmt.c @@ -658,7 +658,7 @@ static void sh_cmt_clocksource_suspend(struct clocksource *cs) return; sh_cmt_stop(ch, FLAG_CLOCKSOURCE); - pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev); + dev_pm_genpd_suspend(&ch->cmt->pdev->dev); } static void sh_cmt_clocksource_resume(struct clocksource *cs) @@ -668,7 +668,7 @@ static void sh_cmt_clocksource_resume(struct clocksource *cs) if (!ch->cs_enabled) return; - pm_genpd_syscore_poweron(&ch->cmt->pdev->dev); + dev_pm_genpd_resume(&ch->cmt->pdev->dev); sh_cmt_start(ch, FLAG_CLOCKSOURCE); } @@ -760,7 +760,7 @@ static void sh_cmt_clock_event_suspend(struct clock_event_device *ced) { struct sh_cmt_channel *ch = ced_to_sh_cmt(ced); - pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev); + dev_pm_genpd_suspend(&ch->cmt->pdev->dev); clk_unprepare(ch->cmt->clk); } @@ -769,7 +769,7 @@ static void sh_cmt_clock_event_resume(struct clock_event_device *ced) struct sh_cmt_channel *ch = ced_to_sh_cmt(ced); clk_prepare(ch->cmt->clk); - pm_genpd_syscore_poweron(&ch->cmt->pdev->dev); + dev_pm_genpd_resume(&ch->cmt->pdev->dev); } static int sh_cmt_register_clockevent(struct sh_cmt_channel *ch, diff --git a/drivers/clocksource/sh_mtu2.c b/drivers/clocksource/sh_mtu2.c index bfccb31e94ad..169a1fccc497 100644 --- a/drivers/clocksource/sh_mtu2.c +++ b/drivers/clocksource/sh_mtu2.c @@ -297,12 +297,12 @@ static int sh_mtu2_clock_event_set_periodic(struct clock_event_device *ced) static void sh_mtu2_clock_event_suspend(struct clock_event_device *ced) { - pm_genpd_syscore_poweroff(&ced_to_sh_mtu2(ced)->mtu->pdev->dev); + dev_pm_genpd_suspend(&ced_to_sh_mtu2(ced)->mtu->pdev->dev); } static void sh_mtu2_clock_event_resume(struct clock_event_device *ced) { - pm_genpd_syscore_poweron(&ced_to_sh_mtu2(ced)->mtu->pdev->dev); + dev_pm_genpd_resume(&ced_to_sh_mtu2(ced)->mtu->pdev->dev); } static void sh_mtu2_register_clockevent(struct sh_mtu2_channel *ch, diff --git a/drivers/clocksource/sh_tmu.c b/drivers/clocksource/sh_tmu.c index d41df9ba3725..b00dec0655cb 100644 --- a/drivers/clocksource/sh_tmu.c +++ b/drivers/clocksource/sh_tmu.c @@ -292,7 +292,7 @@ static void sh_tmu_clocksource_suspend(struct clocksource *cs) if (--ch->enable_count == 0) { __sh_tmu_disable(ch); - pm_genpd_syscore_poweroff(&ch->tmu->pdev->dev); + dev_pm_genpd_suspend(&ch->tmu->pdev->dev); } } @@ -304,7 +304,7 @@ static void sh_tmu_clocksource_resume(struct clocksource *cs) return; if (ch->enable_count++ == 0) { - pm_genpd_syscore_poweron(&ch->tmu->pdev->dev); + dev_pm_genpd_resume(&ch->tmu->pdev->dev); __sh_tmu_enable(ch); } } @@ -394,12 +394,12 @@ static int sh_tmu_clock_event_next(unsigned long delta, static void sh_tmu_clock_event_suspend(struct clock_event_device *ced) { - pm_genpd_syscore_poweroff(&ced_to_sh_tmu(ced)->tmu->pdev->dev); + dev_pm_genpd_suspend(&ced_to_sh_tmu(ced)->tmu->pdev->dev); } static void sh_tmu_clock_event_resume(struct clock_event_device *ced) { - pm_genpd_syscore_poweron(&ced_to_sh_tmu(ced)->tmu->pdev->dev); + dev_pm_genpd_resume(&ced_to_sh_tmu(ced)->tmu->pdev->dev); } static void sh_tmu_register_clockevent(struct sh_tmu_channel *ch, diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h index 1ad0ec481416..a8f93328daec 100644 --- a/include/linux/pm_domain.h +++ b/include/linux/pm_domain.h @@ -280,11 +280,11 @@ static inline int dev_pm_genpd_remove_notifier(struct device *dev) #endif #ifdef CONFIG_PM_GENERIC_DOMAINS_SLEEP -void pm_genpd_syscore_poweroff(struct device *dev); -void pm_genpd_syscore_poweron(struct device *dev); +void dev_pm_genpd_suspend(struct device *dev); +void dev_pm_genpd_resume(struct device *dev); #else -static inline void pm_genpd_syscore_poweroff(struct device *dev) {} -static inline void pm_genpd_syscore_poweron(struct device *dev) {} +static inline void dev_pm_genpd_suspend(struct device *dev) {} +static inline void dev_pm_genpd_resume(struct device *dev) {} #endif /* OF PM domain providers */ -- cgit From b9795a3e4e1cbf521bbb5ef240eb47803c303b02 Mon Sep 17 00:00:00 2001 From: Ulf Hansson Date: Tue, 3 Nov 2020 16:06:26 +0100 Subject: PM: domains: Enable dev_pm_genpd_suspend|resume() for suspend-to-idle The dev_pm_genpd_suspend|resume() have so far only been used during the syscore suspend/resume phases. However, during suspend-to-idle, where the syscore phases doesn't exist, similar operations are sometimes needed. An existing example are the timekeeping_suspend|resume() functions, which are being called both through a registered syscore ops during the syscore phases, but also as regular functions calls from cpuidle (via tick_freeze()) during suspend-to-idle. For similar reasons, let's enable the dev_pm_genpd_suspend|resume() APIs to be re-used for corresponding CPU devices that are attached to a genpd, during suspend-to-idle. Signed-off-by: Ulf Hansson Signed-off-by: Rafael J. Wysocki --- drivers/base/power/domain.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c index 9b4881b67683..4a55f3c949ae 100644 --- a/drivers/base/power/domain.c +++ b/drivers/base/power/domain.c @@ -1366,18 +1366,27 @@ static void genpd_complete(struct device *dev) static void genpd_switch_state(struct device *dev, bool suspend) { struct generic_pm_domain *genpd; + bool use_lock; genpd = dev_to_genpd_safe(dev); if (!genpd) return; + use_lock = genpd_is_irq_safe(genpd); + + if (use_lock) + genpd_lock(genpd); + if (suspend) { genpd->suspended_count++; - genpd_sync_power_off(genpd, false, 0); + genpd_sync_power_off(genpd, use_lock, 0); } else { - genpd_sync_power_on(genpd, false, 0); + genpd_sync_power_on(genpd, use_lock, 0); genpd->suspended_count--; } + + if (use_lock) + genpd_unlock(genpd); } /** @@ -1385,7 +1394,9 @@ static void genpd_switch_state(struct device *dev, bool suspend) * @dev: The device that is attached to the genpd, that can be suspended. * * This routine should typically be called for a device that needs to be - * suspended during the syscore suspend phase. + * suspended during the syscore suspend phase. It may also be called during + * suspend-to-idle to suspend a corresponding CPU device that is attached to a + * genpd. */ void dev_pm_genpd_suspend(struct device *dev) { @@ -1398,7 +1409,8 @@ EXPORT_SYMBOL_GPL(dev_pm_genpd_suspend); * @dev: The device that is attached to the genpd, which needs to be resumed. * * This routine should typically be called for a device that needs to be resumed - * during the syscore resume phase. + * during the syscore resume phase. It may also be called during suspend-to-idle + * to resume a corresponding CPU device that is attached to a genpd. */ void dev_pm_genpd_resume(struct device *dev) { -- cgit From 670c90def03429a228229420fa48a17913fdcc0d Mon Sep 17 00:00:00 2001 From: Ulf Hansson Date: Tue, 3 Nov 2020 16:06:27 +0100 Subject: cpuidle: psci: Enable suspend-to-idle for PSCI OSI mode To select domain idlestates for cpuidle-psci when OSI mode has been enabled, the PM domains via genpd are being managed through runtime PM. This works fine for the regular idlepath, but it doesn't during system wide suspend. More precisely, the domain idlestates becomes temporarily disabled, which is because the PM core disables runtime PM for devices during system wide suspend. Later in the system suspend phase, genpd intends to deal with this from its ->suspend_noirq() callback, but this doesn't work as expected for a device corresponding to a CPU, because the domain idlestates needs to be selected on a per CPU basis (the PM core doesn't invoke the callbacks like that). To address this problem, let's enable the syscore flag for the corresponding CPU device that becomes successfully attached to its PM domain (applicable only in OSI mode). This informs the PM core to skip invoke the system wide suspend/resume callbacks for the device, thus also prevents genpd from screwing up its internal state of it. Moreover, to properly select a domain idlestate for the CPUs during suspend-to-idle, let's assign a specific ->enter_s2idle() callback for the corresponding domain idlestate (applicable only in OSI mode). From that callback, let's invoke dev_pm_genpd_suspend|resume(), as this allows a domain idlestate to be selected for the current CPU by genpd. Signed-off-by: Ulf Hansson Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/cpuidle-psci-domain.c | 2 ++ drivers/cpuidle/cpuidle-psci.c | 34 ++++++++++++++++++++++++++++++---- 2 files changed, 32 insertions(+), 4 deletions(-) diff --git a/drivers/cpuidle/cpuidle-psci-domain.c b/drivers/cpuidle/cpuidle-psci-domain.c index 4a031c62f92a..ff2c3f8e4668 100644 --- a/drivers/cpuidle/cpuidle-psci-domain.c +++ b/drivers/cpuidle/cpuidle-psci-domain.c @@ -327,6 +327,8 @@ struct device *psci_dt_attach_cpu(int cpu) if (cpu_online(cpu)) pm_runtime_get_sync(dev); + dev_pm_syscore_device(dev, true); + return dev; } diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c index d928b37718bd..b51b5df08450 100644 --- a/drivers/cpuidle/cpuidle-psci.c +++ b/drivers/cpuidle/cpuidle-psci.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -52,8 +53,9 @@ static inline int psci_enter_state(int idx, u32 state) return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state); } -static int psci_enter_domain_idle_state(struct cpuidle_device *dev, - struct cpuidle_driver *drv, int idx) +static int __psci_enter_domain_idle_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int idx, + bool s2idle) { struct psci_cpuidle_data *data = this_cpu_ptr(&psci_cpuidle_data); u32 *states = data->psci_states; @@ -66,7 +68,12 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, return -1; /* Do runtime PM to manage a hierarchical CPU toplogy. */ - RCU_NONIDLE(pm_runtime_put_sync_suspend(pd_dev)); + rcu_irq_enter_irqson(); + if (s2idle) + dev_pm_genpd_suspend(pd_dev); + else + pm_runtime_put_sync_suspend(pd_dev); + rcu_irq_exit_irqson(); state = psci_get_domain_state(); if (!state) @@ -74,7 +81,12 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, ret = psci_cpu_suspend_enter(state) ? -1 : idx; - RCU_NONIDLE(pm_runtime_get_sync(pd_dev)); + rcu_irq_enter_irqson(); + if (s2idle) + dev_pm_genpd_resume(pd_dev); + else + pm_runtime_get_sync(pd_dev); + rcu_irq_exit_irqson(); cpu_pm_exit(); @@ -83,6 +95,19 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev, return ret; } +static int psci_enter_domain_idle_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv, int idx) +{ + return __psci_enter_domain_idle_state(dev, drv, idx, false); +} + +static int psci_enter_s2idle_domain_idle_state(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int idx) +{ + return __psci_enter_domain_idle_state(dev, drv, idx, true); +} + static int psci_idle_cpuhp_up(unsigned int cpu) { struct device *pd_dev = __this_cpu_read(psci_cpuidle_data.dev); @@ -170,6 +195,7 @@ static int psci_dt_cpu_init_topology(struct cpuidle_driver *drv, * deeper states. */ drv->states[state_count - 1].enter = psci_enter_domain_idle_state; + drv->states[state_count - 1].enter_s2idle = psci_enter_s2idle_domain_idle_state; psci_cpuidle_use_cpuhp = true; return 0; -- cgit From 7a25759eaa04b8c0ecb3db134922d6641ab2e6d1 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 30 Nov 2020 22:32:02 +0000 Subject: cpuidle: Select polling interval based on a c-state with a longer target residency It was noted that a few workloads that idle rapidly regressed when commit 36fcb4292473 ("cpuidle: use first valid target residency as poll time") was merged. The workloads in question were heavy communicators that idle rapidly and were impacted by the c-state exit latency as the active CPUs were not polling at the time of wakeup. As they were not particularly realistic workloads, it was not considered to be a major problem. Unfortunately, a bug was reported for a real workload in a production environment that relied on large numbers of threads operating in a worker pool pattern. These threads would idle for periods of time longer than the C1 target residency and so incurred the c-state exit latency penalty. The application is very sensitive to wakeup latency and indirectly relying on behaviour prior to commit on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()") to poll for long enough to avoid the exit latency cost. The target residency of C1 is typically very short. On some x86 machines, it can be as low as 2 microseconds. In poll_idle(), the clock is checked every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one iteration of that loop can be over 1 microsecond so the polling interval is very close to the granularity of what poll_idle() can detect. Furthermore, a basic ping pong workload like perf bench pipe has a longer round-trip time than the 2 microseconds meaning that the CPU will almost certainly not be polling when the ping-pong completes. This patch selects a polling interval based on an enabled c-state that has an target residency longer than 10usec. If there is no enabled-cstate then polling will be up to a TICK_NSEC/16 similar to what it was up until kernel 4.20. Polling for a full tick is unlikely (rescheduling event) and is much longer than the existing target residencies for a deep c-state. As an example, consider a CPU with the following c-state information from an Intel CPU; residency exit_latency C1 2 2 C1E 20 10 C3 100 33 C6 400 133 The polling interval selected is 20usec. If booted with intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper c-states were not available. On an AMD EPYC machine, the c-state information is more limited and looks like residency exit_latency C1 2 1 C2 800 400 The polling interval selected is 250usec. While C2 was considered, the polling interval was clamped by CPUIDLE_POLL_MAX. Note that it is not expected that polling will be a universal win. As well as potentially trading power for performance, the performance is not guaranteed if the extra polling prevented a turbo state being reached. Making it a tunable was considered but it's driver-specific, may be overridden by a governor and is not a guaranteed polling interval making it difficult to describe without knowledge of the implementation. tbench4 vanilla polling Hmean 1 497.89 ( 0.00%) 543.15 * 9.09%* Hmean 2 975.88 ( 0.00%) 1059.73 * 8.59%* Hmean 4 1953.97 ( 0.00%) 2081.37 * 6.52%* Hmean 8 3645.76 ( 0.00%) 4052.95 * 11.17%* Hmean 16 6882.21 ( 0.00%) 6995.93 * 1.65%* Hmean 32 10752.20 ( 0.00%) 10731.53 * -0.19%* Hmean 64 12875.08 ( 0.00%) 12478.13 * -3.08%* Hmean 128 21500.54 ( 0.00%) 21098.60 * -1.87%* Hmean 256 21253.70 ( 0.00%) 21027.18 * -1.07%* Hmean 320 20813.50 ( 0.00%) 20580.64 * -1.12%* Signed-off-by: Mel Gorman Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/cpuidle.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 83af15f77f66..ef2ea1b12cd8 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index) cpuidle_curr_governor->reflect(dev, index); } +/* + * Min polling interval of 10usec is a guess. It is assuming that + * for most users, the time for a single ping-pong workload like + * perf bench pipe would generally complete within 10usec but + * this is hardware dependant. Actual time can be estimated with + * + * perf bench sched pipe -l 10000 + * + * Run multiple times to avoid cpufreq effects. + */ +#define CPUIDLE_POLL_MIN 10000 +#define CPUIDLE_POLL_MAX (TICK_NSEC / 16) + /** * cpuidle_poll_time - return amount of time to poll for, * governors can override dev->poll_limit_ns if necessary @@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv, int i; u64 limit_ns; + BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX); + if (dev->poll_limit_ns) return dev->poll_limit_ns; - limit_ns = TICK_NSEC; + limit_ns = CPUIDLE_POLL_MAX; for (i = 1; i < drv->state_count; i++) { + u64 state_limit; + if (dev->states_usage[i].disable) continue; - limit_ns = drv->states[i].target_residency_ns; + state_limit = drv->states[i].target_residency_ns; + if (state_limit < CPUIDLE_POLL_MIN) + continue; + + limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX); break; } -- cgit From 1080399542075bb0e9d46ea80418d76784d1ece8 Mon Sep 17 00:00:00 2001 From: Pavankumar Kondeti Date: Sat, 28 Nov 2020 07:09:23 +0530 Subject: PM / EM: Micro optimization in em_cpu_energy When the sum of the utilization of CPUs in a power domain is zero, return the energy as 0 without doing any computations. Acked-by: Quentin Perret Reviewed-by: Dietmar Eggemann Signed-off-by: Pavankumar Kondeti Signed-off-by: Rafael J. Wysocki --- include/linux/energy_model.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 9618c0a46ef4..757fc60658fa 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -106,6 +106,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, struct em_perf_state *ps; int i, cpu; + if (!sum_util) + return 0; + /* * In order to predict the performance state, map the utilization of * the most utilized CPU of the performance domain to a requested -- cgit