I386下Oprofile实现
一、引入
OProfile是Linux下性能检测的重要工具,对于系统的优化和评估有意义。例如我们的某一个单板发现CPU利用率很低,也就是真正的工作任务执行的时间不长,所以我们要分析一下系统中是哪些任务占用了大量的CPU时间,此时就需要使用OProfile工具了。
二、原理
现在大部分的CPU都在硬件上支持对于系统性能事件的监控,例如对于cache缺失事件的统计、对于分支预测失败的统计等。当这些事件发生的时候,CPU可以将这些事件在制定的寄存器中进行累加,并且达到一定值(一般软件可以设置)之后发出中断,然后在中断处理程序中进行真正的统计和处理工作(对于386来说,就是上一章中说的LAPIC中的性能监控中断)。
如果说硬件不支持性能监控,那软件就没有办法了,它只能在自己可控的时钟中断中进行采样,只要系统采样比较多,并且系统运行比较稳定这个也就可以完成。
三、386实现
1、硬件支持
对于P4处理器来说,它有一组寄存器来完成这个功能,这些寄存器都是直接集成在CPU内部,通过Model Special Register的形式提供。对于特定的事件,它需要用到3组寄存器。
①、总体配置寄存器CCCR(counter configuration control registers)。
这个总共8个。每个对应一个配置,它和性能计数器是一一确定对应的,也就是给定一个CCCR,就有一个对应ECR,这种关系软件无法操作。
? Enable flag, bit 12 — When set, enables counting; when clear, the counter is disabled. This flag is cleared on reset.
? ESCR select field, bits 13 through 15 — Identifies the ESCR to be used to select events to be counted with the counter associated with the CCCR.这里最为重要,设置事件选择寄存器的标号,可以认为是控制策略的选择。因为真正的工作实体(counter)的关系式确定的,所以不用选择。
? Compare flag, bit 18 — When set, enables filtering of the event count; when clear, disables filtering. The filtering method is selected with the threshold, complement, and edge flags.
? Complement flag, bit 19 — Selects how the incoming event count is compared with the threshold value. When set, event counts that are less than or equal to the threshold value result in a single count being delivered to the performance counter; when clear, counts greater than the threshold value result in a count being delivered to the performance counter (see Section 30.9.5.2, “Filtering Events”). The complement flag is not active unless the compare flag is set.
? Threshold field, bits 20 through 23 — Selects the threshold value to be used for comparisons. The processor examines this field only when the compare flag is set, and uses the complement flag setting to determine the type of threshold comparison to be made. The useful range of values that can be entered in this field depend on the type of event being counted (see Section 30.9.5.2, “Filtering Events”).
? Edge flag, bit 24 — When set, enables rising edge (false-to-true) edge detection of the threshold comparison output for filtering event counts; when clear, rising edge detection is disabled. This flag is active only when the compare flag is set.
? FORCE_OVF flag, bit 25 — When set, forces a counter overflow on every counter increment; when clear, overflow only occurs when the counter actually overflows.
? OVF_PMI flag, bit 26 — When set, causes a performance monitor interrupt (PMI) to be generated when the counter overflows occurs; when clear, disables PMI generation. Note that the PMI is generated on the next event count after the counter has overflowed.当此处置位之后,表示当计数值溢出之后,发出PMI中断,也就是LAPIC中断。linux中通过对LAPIC编程,再次通过CPU的NMI引脚传递给CPU。
? Cascade flag, bit 30 — When set, enables counting on one counter of a counter pair when its alternate counter in the other the counter pair in the sae counter group overflows (see Section 30.9.2, “Performance Counters,” for further details); when clear, disables cascading of counters.
? OVF flag, bit 31 — Indicates that the counter has overflowed when set. This flag is a sticky flag that must be explicitly cleared by software.
②、performance counter寄存器,
这个寄存器比较简单,就是完成简单的计数累加功能。但是它总共40bits。
③、ESCR配置
? USR flag, bit 2 — When set, events are counted when the processor is operating at a current privilege level (CPL) of 1, 2, or 3. These privilege levels are generally used by application code and unprotected operating system code.
? OS flag, bit 3 — When set, events are counted when the processor is operating at CPL of 0. This privilege level is generally reserved for protected operating system code. (When both the OS and USR flags are set, events are counted at all privilege levels.)
? Tag enable, bit 4 — When set, enables tagging of μops to assist in at-retirement event counting; when clear, disables tagging. See Section 30.9.6, “At-Retirement Counting.”
? Tag value field, bits 5 through 8 — Selects a tag value to associate with a μop to assist in at-retirement event counting.
? Event mask field, bits 9 through 24 — Selects events to be counted from the event class selected with the event select field.事件屏蔽寄存器。
? Event select field, bits 25 through 30) — Selects a class of events to be counted. The events within this class that are counted are selected with the event mask field.事件选择寄存器。
四、代码处理
1、386特有配置方法
要注意的是:事件和ESCR的绑定并不是任意的,而是一个特定的 多对多 的映射关系。
linux-2.6.21\arch\i386\oprofile\op_model_p4.c
内核中的所有操作是基于事件的,也就是通过事件来索引到所有所需的寄存器。
struct p4_event_binding {
int escr_select; /* value to put in CCCR */ 这个值将会被放入一个CCCR寄存器的ESCR域中,注意,是一个成员域,通过这个所引导一个Event Select Configure寄存器,在这个寄存器中再配置选择的事件。
int event_select; /* value to put in ESCR */这个是放置在ESCR寄存器中的事件选择域中。注意,这里是选择一个事件,而前一个是选择一个事件寄存器。
struct {
int virt_counter; /* for this counter... */ 表示使用的计数器寄存器。这是一个虚拟值,也就是counter_binding数组的下标索引,通过这个可以得到一个对应的配置,这个配置包括了CCCR,其对应的Counter以及ESCR的配置情况。
int escr_address; /* use this ESCR */将要操作的事件选择寄存器的MSR地址,这是一个MSR中的绝对地址。
} bindings[2];
};
这是一个CCCR(由于CCCR和counter对应关系,所以也就是counter的地址)的地址和其对应的counter寄存器的地址的映射关系。参考
Table 30-28. Performance Counter MSRs and Associated CCCR and
ESCR MSRs (Pentium 4 and Intel Xeon Processors) (Contd.)
/* tables to simulate simplified hardware view of p4 registers */
struct p4_counter_binding {
int virt_counter;
int counter_address;
int cccr_address;
};
2、nmi中断的处理
暂空
3、opfile文件系统
为了实现大量数据的记录。系统专门创建了一个oprofile文件系统,如果要进行监控的话,需要挂在这个文件系统。当文件系统挂在之后,会在根目录下创建一些控制结构和数据结构。其中最为重要的就是buffer文件,该文件存放者内核中存储的各种采样数据。然后用户态程序就通过不断的从这个文件中读出数据来进行和内核的交互。
const struct file_operations event_buffer_fops = {
.open = event_buffer_open,
.release = event_buffer_release,
.read = event_buffer_read,
};
void oprofile_create_files(struct super_block * sb, struct dentry * root)
oprofilefs_create_file(sb, root, "buffer", &event_buffer_fops);
static ssize_t event_buffer_read(struct file * file, char __user * buf,
size_t count, loff_t * offset)
if (copy_to_user(buf, event_buffer, count))
goto out;
add_sample--->>add_us_sample--->>>add_sample_entry-->>
/* Add an entry to the event buffer. When we
* get near to the end we wake up the process
* sleeping on the read() of the file.
*/
void add_event_entry(unsigned long value)
{
if (buffer_pos == buffer_size) {
atomic_inc(&oprofile_stats.event_lost_overflow);
return;
}
event_buffer[buffer_pos] = value;
if (++buffer_pos == buffer_size - buffer_watershed) {
atomic_set(&buffer_ready, 1);
wake_up(&buffer_wait);
}
}
可见,两者操作的是相同的内核数据。
4、oprofile的初始化
static int __init oprofile_init(void)
{
int err;
err = oprofile_arch_init(&oprofile_ops);优先使用体系结构硬件支持,如果失败,则使用定时器。
if (err < 0 || timer) {如果体系相关失败,或者特别在启动时指明了timer模式,则使用定时器模式进行性能监控。
printk(KERN_INFO "oprofile: using timer interrupt.\n");
oprofile_timer_init(&oprofile_ops);
}
err = oprofilefs_register();
if (err)
oprofile_arch_exit();
return err;
}
对于386:\linux-2.6.21\arch\i386\oprofile\init.c: oprofile_arch_init
ret = op_nmi_init(ops);
op_nmi_init---》》》 ops->setup = nmi_setup;----》》》》 register_die_notifier(&profile_exceptions_nb)
当NMI异常发生时,执行这个中断回调函数
static int profile_exceptions_notify(struct notifier_block *self,
unsigned long val, void *data)
{
struct die_args *args = (struct die_args *)data;
int ret = NOTIFY_DONE;
int cpu = smp_processor_id();
switch(val) {
case DIE_NMI:
if (model->check_ctrs(args->regs, &cpu_msrs[cpu]))
五、内核和用户态的交互方法
1、核心文件oprofile/buffer
该文件是用户态读取内核采集信息的方式,该文件对应的文件操作为
const struct file_operations event_buffer_fops = {
.open = event_buffer_open,
.release = event_buffer_release,
.read = event_buffer_read,
};
static ssize_t event_buffer_read(struct file * file, char __user * buf,
size_t count, loff_t * offset)
{
int retval = -EINVAL;
size_t const max = buffer_size * sizeof(unsigned long);
/* handling partial reads is more trouble than it's worth */
if (count != max || *offset)
return -EINVAL; 这里有一个很霸道的限制,就是read的时候传入的缓冲区大小必须为max的大小,也就是buffer_size的大小乘以long数据的大小,注意,这个buffer_size并不会随着采样信息的变化而变化。所以用户态就是通过文件系统中的buffer_size和pointer_size两个文件之积计算出用户态传递给read的参数。但是如果说内核中实际上可用的缓冲区(即有效地采样样本数)如果小于这个缓冲区(大部分都是这种情况),那么就会在下面设置为可用大小。
wait_event_interruptible(buffer_wait, atomic_read(&buffer_ready));
if (signal_pending(current))
return -EINTR;
/* can't currently happen */
if (!atomic_read(&buffer_ready))
return -EAGAIN;
mutex_lock(&buffer_mutex);
atomic_set(&buffer_ready, 0);
retval = -EFAULT;
count = buffer_pos * sizeof(unsigned long); 这里大小设置为内核实际上可用的样本数,并且不会阻塞。
if (copy_to_user(buf, event_buffer, count))
goto out;
retval = count;返回实际读入样本数。
buffer_pos = 0;
out:
mutex_unlock(&buffer_mutex);
return retval;
}
2、用户态何时读取
oprofile-0.9.7\daemon\init.c:static void opd_do_read(char * buf, size_t size)
while (1) {
ssize_t count = -1;
/* loop to handle EINTR */
while (count < 0) {
count = op_read_device(devfd, buf, size);
/* we can lose an alarm or a hup but
* we don't care.
*/
if (signal_alarm) {
signal_alarm = 0;
opd_alarm();
}
int main(int argc, char const * argv[])
/* clean up every 10 minutes */
alarm(60 * 10);
也就是用户态使用定时器来进行文件的读取,可是这个事件还是有点长啊,平均每10分钟读取一次。