【Linux驱动设备开发详解】19.Linux电源驱动管理架构
1.Linux内核电源管理的整体架构
Linux电源管理牵扯到系统级的待机、频率电压变换、系统空闲时的处理以及每个设备驱动对系统待机的支持和每个设备的运行时(Runtime)电源管理。
Linux内核电源管理的整体架构,大致可以归纳为如下几类:
1.CPU在运行时根据系统负载进行动态电压和频率变换的CPUFreq
2.CPU在系统空闲时根据空闲的情况进行低功耗模式的CPUIdle
3.多核系统下的CPU热插拔功能
4.系统和设备针对延迟的特别需求而提出申请的PM QoS,它会作用于CPUIdle的具体策略
5.设备驱动针对系统挂起到RAM/硬盘的一系列入口函数
6.Soc进入挂起状态,SDRAM自刷新入口
7.设备的运行时动态电源管理,根据使用情况动态开关设备
8.底层的时钟、稳压器、频率/电压表(opp模块完成)支撑,各驱动子系统都可能用到
2. CPUFreq驱动
CPUFreq子系统位于driver/cpufreq目录下,负责进行运行过程中CPU频率和电压的动态调整,即DVFS(动态电压调整)。CMOS电路中功耗与电压的平方成正比、与频率成正比(P∝fV^2),所以在运行时可以通过降低电压和频率的方式来降低功耗。
CPUFreq的核心层位于drivers/cpufreq/cpufreq.c下,它为各个Soc的CPUFreq驱动的实现提供一套提供一的接口,并实现了一套notifier机制,可以在CPUFreq的策略和频率改变的时候向其他模块发出通知。而且在CPU运行频率发生变化的时候,内核的loops_per_jiffy常数也会发生相应变化。
2.1 SoC的CPUFreq驱动实现
每个SoC的具体CPUFreq驱动实例只需要实现电压、频率表,以及从硬件层面完成这些变化。
CPUFreq核心层提供了API以供SoC注册自身的CPUFreq驱动:
int cpufreq_register_driver(struct cpufreq_driver *driver_data);
cpufreq_driver封装了一个具体的SoC的CPUFreq驱动的主体,结构体定义如下:
struct cpufreq_driver {
char name[CPUFREQ_NAME_LEN]; // 是CPUFreq驱动的名字,比如imx6q-cpufreq的name为"imx6q-cpufreq"
u8 flags; // 表示一些暗示性的标志,如果设置CPUFREQ_CONST_LOOPS,则是告诉内核loops_per_jiffy不会因为CPU的频率变化而变化
void *driver_data;
/* needed by all drivers */
int (*init)(struct cpufreq_policy *policy);
int (*verify)(struct cpufreq_policy *policy);
/* define one out of two */
int (*setpolicy)(struct cpufreq_policy *policy);
/*
* On failure, should always restore frequency to policy->restore_freq
* (i.e. old freq).
*/
int (*target)(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation); /* Deprecated */
int (*target_index)(struct cpufreq_policy *policy,
unsigned int index);
unsigned int (*fast_switch)(struct cpufreq_policy *policy,
unsigned int target_freq);
/*
* Caches and returns the lowest driver-supported frequency greater than
* or equal to the target frequency, subject to any driver limitations.
* Does not set the frequency. Only to be implemented for drivers with
* target().
*/
unsigned int (*resolve_freq)(struct cpufreq_policy *policy,
unsigned int target_freq);
/*
* Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
* unset.
*
* get_intermediate should return a stable intermediate frequency
* platform wants to switch to and target_intermediate() should set CPU
* to to that frequency, before jumping to the frequency corresponding
* to 'index'. Core will take care of sending notifications and driver
* doesn't have to handle them in target_intermediate() or
* target_index().
*
* Drivers can return '0' from get_intermediate() in case they don't
* wish to switch to intermediate frequency for some target frequency.
* In that case core will directly call ->target_index().
*/
unsigned int (*get_intermediate)(struct cpufreq_policy *policy,
unsigned int index);
int (*target_intermediate)(struct cpufreq_policy *policy,
unsigned int index);
/* should be defined, if possible */
unsigned int (*get)(unsigned int cpu);
/* optional */
int (*bios_limit)(int cpu, unsigned int *limit);
int (*exit)(struct cpufreq_policy *policy);
void (*stop_cpu)(struct cpufreq_policy *policy);
int (*suspend)(struct cpufreq_policy *policy);
int (*resume)(struct cpufreq_policy *policy);
/* Will be called after the driver is fully initialized */
void (*ready)(struct cpufreq_policy *policy);
struct freq_attr **attr;
/* platform specific boost support code */
bool boost_enabled;
int (*set_boost)(int state);
};
- init()成员是一个per-CPU初始化函数指针,每当一个新的CPU被注册进系统的时候,该函数就被调用,该函数接收一个cpufreq_policy的指针参数,在init成员中,可进行如下设置:
policy->cpuinfo.min_freq; // Cpu支持的最小频率(kHZ)
policy->cpuinfo.max_freq; // Cpu支持的最大频率(kHZ)
policy->cpuinfo.transition_latency; // Cpu进行频率切换所需要的延迟(ns)
policy->cur; // CPU的当前频率
// 该CPU的缺省策略,以及在缺省策略下,支持的最小、最大CPU频率
policy->policy;
policy->governor;
policy->min;
policy->max;
- verify()成员函数用于对用户的CPUFreq策略进行有效性验证和数据修正,每当用户设定一个新策略时,该函数根据老的策略和新的策略,检查新策略设置的有效性并对无效设置进行修正。在该函数的具体实现中,常用到如下辅助函数:
void cpufreq_verify_within_limits(struct cpufreq_policy *policy,
unsigned int min, unsigned int max);
- setpolicy()成员函数接收一个policy参数(包含policy->policy、policy->min和policy->max等成员),实现了这个成员函数的CPU一般具备在一个范围(policy->min和policy->max)里自动调整频率的能力。目前只有少数驱动实现了这个能力,绝大多数CPU一般只实现target()成员函数,它的参数直接就是一个指定的频率
- target()成员函数用于将频率叫调整到一个指定的值,接收三个参数:policy、target_freq和relation
target_freq:目标频率,实际驱动总是要设定真实的CPU频率到最接近的target_freq,并且设定的频率必须位于policy->min到policy->max之间
relation若为CPUFREQ_REL_H,则按时设置的频率应该小于或等于target_freq,若为CPUFREQ_REL_H,则按时设定值的频率应该小于或等于target_freq
下表描述了setpolicy()和target()所针对的CPu以及调用方式上的区别:
setpolicy() | Target() |
---|---|
CPU具备在一定范围内独立调整频率的能力 | CPU只能被指定频率 |
CPUfreq policy调用到setpolicy(),由CPU独立在一个范围内调整频率 | 由CPUFreq核心层根据系统负载和策略综合决定目标频率 |
根据芯片内部PLL和分频器的关系,ARM SOC一般不具备独立调整频率的能力,往往SOC的CPUFreq驱动会提供一个频率表,频率表在该表的范围内进行变更,因此一般实现target()成员函数。
static unsigned long regulator_latency;
struct s3c64xx_dvfs {
unsigned int vddarm_min;
unsigned int vddarm_max;
};
static struct s3c64xx_dvfs s3c64xx_dvfs_table[] = {
[0] = { 1000000, 1150000 },
[1] = { 1050000, 1150000 },
[2] = { 1100000, 1150000 },
[3] = { 1200000, 1350000 },
[4] = { 1300000, 1350000 },
};
static struct cpufreq_frequency_table s3c64xx_freq_table[] = {
{ 0, 0, 66000 },
{ 0, 0, 100000 },
{ 0, 0, 133000 },
{ 0, 1, 200000 },
{ 0, 1, 222000 },
{ 0, 1, 266000 },
{ 0, 2, 333000 },
{ 0, 2, 400000 },
{ 0, 2, 532000 },
{ 0, 2, 533000 },
{ 0, 3, 667000 },
{ 0, 4, 800000 },
{ 0, 0, CPUFREQ_TABLE_END },
};
// 完成目标频率的函数
static int s3c64xx_cpufreq_set_target(struct cpufreq_policy *policy,
unsigned int index)
{
struct s3c64xx_dvfs *dvfs;
unsigned int old_freq, new_freq;
int ret;
old_freq = clk_get_rate(policy->clk) / 1000;
// 从频率表中获取对应index的频率
new_freq = s3c64xx_freq_table[index].frequency;
dvfs = &s3c64xx_dvfs_table[s3c64xx_freq_table[index].driver_data];
// 设置具体的频率和电压环节
ret = clk_set_rate(policy->clk, new_freq * 1000);
if (ret < 0) {
pr_err("Failed to set rate %dkHz: %d\n",
new_freq, ret);
return ret;
}
pr_debug("Set actual frequency %lukHz\n",
clk_get_rate(policy->clk) / 1000);
return 0;
}
static void s3c64xx_cpufreq_config_regulator(void)
{
int count, v, i, found;
struct cpufreq_frequency_table *freq;
struct s3c64xx_dvfs *dvfs;
count = regulator_count_voltages(vddarm);
if (count < 0) {
pr_err("Unable to check supported voltages\n");
}
if (!count)
goto out;
cpufreq_for_each_valid_entry(freq, s3c64xx_freq_table) {
dvfs = &s3c64xx_dvfs_table[freq->driver_data];
found = 0;
for (i = 0; i < count; i++) {
v = regulator_list_voltage(vddarm, i);
if (v >= dvfs->vddarm_min && v <= dvfs->vddarm_max)
found = 1;
}
if (!found) {
pr_debug("%dkHz unsupported by regulator\n",
freq->frequency);
freq->frequency = CPUFREQ_ENTRY_INVALID;
}
}
out:
/* Guess based on having to do an I2C/SPI write; in future we
* will be able to query the regulator performance here. */
regulator_latency = 1 * 1000 * 1000;
}
static int s3c64xx_cpufreq_driver_init(struct cpufreq_policy *policy)
{
int ret;
struct cpufreq_frequency_table *freq;
if (policy->cpu != 0)
return -EINVAL;
if (s3c64xx_freq_table == NULL) {
pr_err("No frequency information for this CPU\n");
return -ENODEV;
}
policy->clk = clk_get(NULL, "armclk");
if (IS_ERR(policy->clk)) {
pr_err("Unable to obtain ARMCLK: %ld\n",
PTR_ERR(policy->clk));
return PTR_ERR(policy->clk);
}
#ifdef CONFIG_REGULATOR
vddarm = regulator_get(NULL, "vddarm");
if (IS_ERR(vddarm)) {
ret = PTR_ERR(vddarm);
pr_err("Failed to obtain VDDARM: %d\n", ret);
pr_err("Only frequency scaling available\n");
vddarm = NULL;
} else {
s3c64xx_cpufreq_config_regulator();
}
#endif
cpufreq_for_each_entry(freq, s3c64xx_freq_table) {
unsigned long r;
/* Check for frequencies we can generate */
r = clk_round_rate(policy->clk, freq->frequency * 1000);
r /= 1000;
if (r != freq->frequency) {
pr_debug("%dkHz unsupported by clock\n",
freq->frequency);
freq->frequency = CPUFREQ_ENTRY_INVALID;
}
/* If we have no regulator then assume startup
* frequency is the maximum we can support. */
if (!vddarm && freq->frequency > clk_get_rate(policy->clk) / 1000)
freq->frequency = CPUFREQ_ENTRY_INVALID;
}
/* Datasheet says PLL stabalisation time (if we were to use
* the PLLs, which we don't currently) is ~300us worst case,
* but add some fudge.
*/
ret = cpufreq_generic_init(policy, s3c64xx_freq_table,
(500 * 1000) + regulator_latency);
if (ret != 0) {
pr_err("Failed to configure frequency table: %d\n",
ret);
regulator_put(vddarm);
clk_put(policy->clk);
}
return ret;
}
static struct cpufreq_driver s3c64xx_cpufreq_driver = {
.flags = CPUFREQ_NEED_INITIAL_FREQ_CHECK,
.verify = cpufreq_generic_frequency_table_verify,
.target_index = s3c64xx_cpufreq_set_target,
.get = cpufreq_generic_get,
.init = s3c64xx_cpufreq_driver_init,
.name = "s3c",
};
static int __init s3c64xx_cpufreq_init(void)
{
return cpufreq_register_driver(&s3c64xx_cpufreq_driver);
}
module_init(s3c64xx_cpufreq_init);
关于频率表,新的内核一般使用OPP。
2.2 CPUFreq的策略
Soc CPUFreq驱动只是设定了CPU的频率参数,和设置频率的途径,但是它并不会管CPU自身究竟应该运行在哪种频率上。频率依据哪种标准,进行何种变化,完全由CPUFreq的策略(policy)决定,具体如下表CPUFreq的策略及其实现方法:
CPUFreq的策略 | 策略的实现方法 |
---|---|
cpufreq_ondemand | 平时以低速方式运行,当系统负载提高时按需自动提高频率 |
cpufreq_performance | CPU以最高频率运行,即scaling_max_freq |
cpufreq_conservative | 与ondemand相似,区别在于动态频率在变更的时候采用渐进的方式 |
cpufreq_powersave | CPU以最低频率运行,即scaling_min_freq |
cpufreq_userspace | 让根用户通过sys节点scaling_setspeed设置频率 |
在Android系统中,增加了一个交互策略,该策略适合于对延迟敏感的 UI交互任务,当由UI交互任务的时候,该策略会更加激进并及时地调整CPU频率。
系统的状态以及CPUFreq的策略共同决定了CPU频率跳变的目标,CPUFreq核心层将目标频率传递给底层具体SoC的CPUFreq驱动,驱动修改硬件,完成频率的变化:
用户空间一般可以通过/sys/devices/system/cpu/cpx/cpufreq节点来设置CPUFreq,比如将CPUFreq设置到700MHz:
# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# echo 700000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
2.3 CPUFreq的性能测试和调优
cpupower-utils工具集在内核的tool/power/cpupower目录中,此工具集中的cpufrep-bech工具可以分析CPUFreq对系统性能的影响。
cpufreq-bench工具的工具原理是模拟系统运行时候的"空闲-忙-空闲-忙"场景,从而触发系统的动态变化,然后在ondemand、conservative、interative等策略的情况下,计算与performance高频模式下同样的运算完成任务的时间比例。
交叉编译后,可放入目标电路板文件系统的/usr/sbin/等目录下,运行该工具
# cpufreq-bench -l 50000 -s 100000 -x 50000 -y 100000 -g ondemand -r 5 -n 5 -v
会输出结果,我们需要提取其中Round n这样的行,它表明了-g ondemand选项中设定的ondmand策略相对于performance性能比例,比如:
Round 1 - 39.74%
Round 2 - 36.35%
Round 3 - 47.91%
Round 4 - 54.22%
Round 5 - 58.64%
这个结果不太理想,当采用Android的交互策略,新的测试结果:
Round 1 - 72.95%
Round 2 - 87.20%
Round 3 - 91.21%
Round 4 - 94.10%
Round 5 - 94.93%
一般目标在采用CPUFreq动态调整频率和电压后,性能应该为performance这个高性能策略下的90%左右,比较理想
2.4 CPUFreq通知
CPUFreq子系统会发出通知的情况有2种:CPUFreq的策略变化或者CPU运行频率变化。
在策略变化的过程种,会发送3次通知:
- CPUFREWQ_ADJUST:所有注册的notifier可以根据硬件或者温度的情况去修改范围(即policy->min和policy->max)
- CPUFREQ_INCOMPATIBLE:除非前面的策略设定可能会导致硬件出错,否则被注册的notifier不能改变范围等设定
- CPUFREQ_NOTIFY: 所有注册的notier都会被告知新的策略已经被设置
在频率变化的过程种会发送2次通知:
- CPUFREQ_PRECHANGE:准备进行频率变更
- CPUFREQ_POSTCHANCE: 已经完成频率变更
发送CPUREQ_PRECHANGE和CPUFREQ_POSTCHANCE代码如下:
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,CPUFREQ_PRECHANGE, freqs);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,CPUFREQ_POSTCHANGE, freqs);
如果某模块关心CPUFREQ_PRECHANGE或CPUFREQ_POSTCHANGE事件, 可简单地使用Linux的notifier机制监控。譬如,drivers/video/sa1100fb.c在CPU频率变化过程中需对自身硬件进行相关设置, 因此它注册了notifier并在CPUFREQ_PRECHANGE和CPUFREQ_POSTCHANGE情况下分别进行不同的处理。示例如下:
static int sa1100fb_probe(struct platform_device *pdev)
{
....
fbi->freq_transition.notifier_call = sa1100fb_freq_transition;
fbi->freq_policy.notifier_call = sa1100fb_freq_policy;
cpufreq_register_notifier(&fbi->freq_transition, CPUFREQ_TRANSITION_NOTIFIER);
cpufreq_register_notifier(&fbi->freq_policy, CPUFREQ_POLICY_NOTIFIER);
....
}
/*
* CPU clock speed change handler. We need to adjust the LCD timing
* parameters when the CPU clock is adjusted by the power management
* subsystem.
*/
static int
sa1100fb_freq_transition(struct notifier_block *nb, unsigned long val,
void *data)
{
struct sa1100fb_info *fbi = TO_INF(nb, freq_transition);
u_int pcd;
switch (val) {
case CPUFREQ_PRECHANGE:
set_ctrlr_state(fbi, C_DISABLE_CLKCHANGE);
break;
case CPUFREQ_POSTCHANGE:
pcd = get_pcd(fbi, fbi->fb.var.pixclock);
fbi->reg_lccr3 = (fbi->reg_lccr3 & ~0xff) | LCCR3_PixClkDiv(pcd);
set_ctrlr_state(fbi, C_ENABLE_CLKCHANGE);
break;
}
return 0;
}
此外, 如果在系统挂起/恢复的过程中CPU频率会发生变化, 则CPUFreq子系统也会发出CPUFREQ_SUSPENDCHANGE和CPUFREQ_RESUMECHANGE这两个通知。
除了CPU以外, 一些非CPU设备也支持多个操作频率和电压, 存在多个OPP。 Linux3.2之后的内核也支持针对这种非CPU设备的DVFS, 该套子系统为Devfreq。 与CPUFreq存在一个drivers/cpufreq目录相似, 在内核中也存在一个drivers/devfreq的目录。