PCIe AER

Concepts

AER:Advance error reporting

PCIe 提供两种报错机制:baseline capability 和 AER capability

Baseline capability is required by all PCIe components providing a minimum defined set of error reporting requirements.

AER is implemented with a PCIe advanced error reporting extended capability structure.

AER功能更强。

PCIe AER driver is required to support the AER capability, it's attached to root ports. 

 

Bacground Knowledge

PCIe port

理解AER driver,首先理解PCIe的两种port:Root port 和 Switch port

 

 

 

Root port:

originates a PCIe link from a PCIe Root complex

Switch Port:

up-stream port: has its secondary bus representing switch internal routing logic

downstream port: bridge from switch internal routing buses to the bus representing the downstream PCIe link.

每个PCIe port device都可以最多支持四种服务:native hot plug(HP), power management event(PME),advanced error reporting(AER), virtual channels(VC)

 

PCIe AER driver作为Root port AER service driver的一种,附着于PCIe Port Bus Driver(PBD)

 

 

 

The basic functions AER provides are:

a. Gather error information if error occurs.

b. Reports the error to users

c. Performs error recovery actions.

 

 Error message

 Error mesasges are originate from the PCIe or legacy endpoints and are sent to corresponding Root ports. There are three types of Error messages as below:

 Root complex will translate these error messages into platform level events.

 

Error classification

 

  Correctable Errors Uncorrectable Errors
non-fatal fatal
Description

Errors detected and corrected by hardware, without software intervention

Pose no impacts on functionality of the interface

errors cause the particular transaction to be unreliable errors cause PCIe link unreliable
Examples

receiver errors

bad TLP

bad DLLP

REPLAY_NUM rollover

replay timer time-out

Training Errors

Data link protocol errors

poisoned TLP errors

flow control protocol errors

Completion Time-out Errors

Completer Abort Errors

Unexpected completion errors

Receiver overflow errors

Malformed TLPs

ECRC Errors

Unsupported Request errors

 

Uncorrectable Error severity Programming(advanced Error reporting)

For device functions implementing the AER, the Uncorretable Error severity register allows each uncorrectable error to be programmed to fatal or non-fatal error.

 

Advisory Non-fatal error cases

Advisory non-fatl error caes are determined by the role of the detecting agent(requester, completer or Receiver) and the specific error. In this case, an agent with AER signals the non-fatal error by sending an ERR_COR message as an advisory to software, instead of sending ERR_NONFATAL. An agent without AER sends no error Message for these cases, since software receiving ERR_COR would be unable to distinguish Advisory Non-Fatal Error cases from the correctable error cases used to assess link integrity.

Below gives the Error message determination flow:

 

 

AER Driver

Basic Flow

通过AER extented capability structure,PCEe device(endpoint)会向PCIe Root port发送error message, Root port收到error message, 解析并将error log 在AER extented capability structure中,包括将 error reporting agent的requester ID 存入error source identification register中以及将root error command register的error bit设置上。对于enable的error reporting,Root Port device会产生相应的中断。 AER driver即负责处理该中断。如下图:

 

Initialization

当机器启动时,系统会为每个PCIe Root port分配interrupt vector(s),PCIe AER Driver会向kernel注册其interrupt service handler, 当PCIe Root port从下游device接收到error message,Root port会发送中断到CPU, 然后kernel会调用PCIe AER interrupt service handler进行中断处理。

大多数的BIOS厂商会提供自己的一套非标准化的错误处理机制,为了避免BIOS的处理冲突,AER Driver必须通过ACPI _OSC向BIOS请求AER的拥有权,如果BIOS不支持ACPI _OSC, AER Driver的probe function会失败。反之,一旦AER Driver接管了AER,AER Driver会配置AER capability registers,从而使PCIe Root port和device支持PCIe native AER.

 

Handle for different types of errors

  • Correctable Errors      

                                

AER Driver will:

1. Decode  the error message received at the Root port into error type and agent ID

2. Based on the agent ID, read detailed error information from corresponding agent device PCIe AER capability

3. Clear the corresponding bit in correctable error status register of both PCIe Root port and agent device.

  • Uncorrectable Non-fatal Errors

        

         

AER Driver will:

1. Decode  the error message received at the Root port into error type and agent ID

2. Based on the agent ID, read detailed error information from corresponding agent device PCIe AER capability

3. AER must perform a non-fatal error recovery procedure. To take part in the error recovery procedures, specific device drivers need to implement error callbacks.

  •  Fatal errors:

PCIe link will be reset. In below Figure, if root port P0 reports a fatal error to itself AER driver will reset the upstream link between Root port P0 and upstream port P1.

If the end-point device E1 reports a fatal error, the PCI express AER driver choose to reset the upstream link of E1, i.e. the link between P2 and E1.

 

Impact to Device Driver

  • Enable device's error reporting capability in probe function. 

If the device doesn’t support PCIExpress capability, the function returns 0

int pci_enable_pcie_error_reporting(struct pci_dev *dev);

Correspondingly, to disable the device from sending error messages to the root port. 

  • Implement error callbacks as described in pci_error_handlers
struct pci_error_handlers {
    /* PCI bus error detected on this device */
    pci_ers_result_t (*error_detected)(struct pci_dev *dev, enum pci_channel_state error);
    /* MMIO has been re-enabled, but not DMA */
    pci_ers_result_t (*mmio_enabled)(struct pci_dev *dev);
    /* PCI slot has been reset */
    pci_ers_result_t (*slot_reset)(struct pci_dev *dev);
    /* Device driver may resume normal operations */
    void (*resume)(struct pci_dev *dev);
};        

 example:

static struct pci_driver xxx_driver;
xxx_driver->err_handler = &xxx_err_handler;

status const struct pci_err_handlers xxx_err_handler = {
    .error_detected = xxx_error_detected,
    .slot_reset = xxx_slot_reset;
    .resume = adf_resume,
};    
   
  • Clear the uncorrectable error status register
int pci_cleanup_aer_uncorrect_error_register(struct pci_dev *dev);

The AER driver only clears correctable error status register when processing errors. As for uncorrectable errors, specific device drivers should do so since they might do more specific processing. Usually a driver should call this function in its slot_reset or resume callbacks.

 

Testing

aer-inject tools: https://github.com/jderrick/aer-inject

 

Reference

Enable PCI Express Advanced Error Reporting in the Kernel: https://www.kernel.org/doc/ols/2007/ols2007v2-pages-297-304.pdf

PCIe spec: https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf

 

posted @ 2022-12-05 14:48  风一样的liz  阅读(1759)  评论(0编辑  收藏  举报