ingram14
原博客地址:https://blog.csdn.net/wangpeng22

NVMe协议调试总结

1、NVMe问答

NVMe 是什么?

百度百科这么说的:

       NVMe(Non-VolatileMemory express),是一种建立在M.2接口上的类似AHCI的一种协议,是专门为闪存类存储设计的协议。中文名 NVMe协议 外文名 Non-Volatile Memory express。NVMe具体优势包括:

①性能有数倍的提升;

②可降低延迟超过50%;

③NVMe PCIe SSD可提供的IOPs十倍于高端企业级SATA SSD;

④自动功耗状态切换和动态能耗管理功能大大降低功耗;

⑤支持未来十年技术发展的可扩展能力。

 

码农该怎么理解?

       它是一个存储协议,既然是存储协议是不是需要快速的读写?

       答:对。

PCIe才是最快的协议啊,为啥不用PCIe呢?

答:PCIe很复杂的。

那我们给PCIe穿个马甲,就可以?

答:NVMe就是给PCIe穿个马甲。

NVMe是怎么做到的?

答:PCIe是作文题,NVMe是选词填空,最后的结果却一样。

怎么填?填什么?

答:按照这个表格填写,发什么就填什么,总共64字节,不需要的填0就行了。

IO命令:

appmask

apptag

reftag

dsmgmt

slba

addr

metadata

rsvd

nblocks

control

Flags

Opcode

Admin 命令:

rsvd11

numd

offset

lid

prp2

prp1

rsvd1

command_id

flags

Opcode

 

 

 

 

 

 

 

 

 

 

 

NVMe处于什么位置

NVMe是一种Host与SSD之间通讯的协议,它在协议栈中隶属高层。

 

NVMe命令该选什么词填什么空?

NVMe制定了Host与SSD之间通讯的命令,以及命令如何执行的。

NVMe有两种命令,一种叫Admin Command,用以Host管理和控制SSD;另外一种就是I/O Command,用以Host和SSD之间数据的传输。下面是NVMe1.2支持的命令列表:

NVMe支持的Admin Command




NVMe支持的I/O Command


发送的太快我来不及执行咋办?

搞两个缓冲区吧,

发送缓冲区:SubmissionQueue (SQ)。

完成缓冲区:CompletionQueue(CQ)

处理完了,我该怎么告诉你呢?

写这个寄存器就行Doorbell Register (DB)

系统结构什么怎样的?


这个namespace是啥玩意?

每个flash块就是一个namaspce,它有个id,叫namaspceID。

NVMe到SDD是怎么玩的?

举例Host需要从flash地址 0x02000000上读取nblock = 2的数据, PRP1给出内存地址是0x10000000,该怎么操作?

首先我们得组包nvme_cmd,这个包为读命令它包含我们读地址(0x02000000)、长度(nblock = 2)、和读到什么地方(prp);然后把这个包扔给sq,写doorbell通知控制器来数据咯,控制器取出命令来转换为TLP包通过PCIe Memory方式把0x02000000的数据写入到0x10000000中,然后在Cq的尾部写入完成标志,再写doorbell告诉控制器我的事干完了。

1:这个命令放在SQ里 。

2:Host通过写SQ的Tail DB,通知SSD来取命令。

3:SSD收到通知,去Host端的SQ中取指。 PCIe是通过发一个Memory Read TLP到Host的SQ中取指的。

4:SSD执行读命令,把数据从闪存中读到缓存中,然后把数据传给Host。

5:SSD往Host的CQ中返回状态。

6:SSD采用中断的方式告诉Host去处理CQ。

7:Host处理相应的CQ。

2、NVMe调试准备

       本次调试采用第三方NVMe卡,软件环境采用Linux 内核3.11.10。插入卡后能够在pci树上看到设备1987:5007,如图:


目前NVMe卡已经能作为pci设备被识别了,接下来开始移植驱动。下载linux3.11.10并解压,提取nvme-core.c 、nvme-scsi.c、nvme.h三个文件,然后编写makefile,如下:


然后加载驱动#insmod nvme_driver.ko, 接下来就可以看到nvme设备了:


注意:nvme0设备 是我们注册file_operations,nvme0n1对应block_device_operations。

现在设备和驱动都调试成功了,接下来就可以通过ioctl调试命令下方工具和解析命令。

2、获取namespace_ID 和 sg_vesion

       获取namespace_id时最简单的ioctl操作,这里就不粘代码了,结果如下:


3、SUBMIT_IO Cmd/Write and Read

       Submitio 就是对应disk的读写,这里只介绍READ/WRITE命令的下发:

 

 

READ命令:

appmask

apptag

reftag

dsmgmt

slba

addr

metadata

rsvd

nblocks

control

Flags

Opcode

 

 

 

0xc1

 

addr

 

 

n

 

 

0x02

Opcode: read命令头0x02

Flags:清0

Control:清0

nblocks: 读的blocks个数,不能超过最大值

metadata:暂时不用

addr:数据保存的地址,最好申请数组空间,大小至少16k

dsmgmt: 0xc1->11000001b, not compressible , sequential read , No  latency information provided,Typical number of reads and writes expected forthis LBA range.

Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.

Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.

Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.

WRITE命令:

appmask

apptag

reftag

dsmgmt

slba

addr

metadata

rsvd

nblocks

control

Flags

Opcode

 

 

 

0xc1

 

addr

 

 

n

 

 

0x01

Opcode: write命令头0x01

Flags:清0

Control:清0

nblocks: 写的blocks个数,不能超过最大值

metadata:暂时不用

addr:数据保存的地址,最好申请数组空间,大小至少16k

dsmgmt: 0xc1->11000001b, not compressible , sequential read , No  latency information provided,Typical number of reads and writes expected forthis LBA range.

Reftag: This field is only used if the namespace is formatted to useend-to-end protection information.

Apptag: This field is only used if the namespace is formatted to useend-to-end protection information.

Appmask: This field is only used if the namespace is formatted to useend-to-end protection information.

值得注意的是ioctl的cmd参数,用户空间的cmd经过魔数、基数、变量型的转化和偏移才得到驱动层的cmd。

4、Admin Cmd send

       根据测试,返回status和result都为0表示命令成功,其他都表示命令失败。

Get Log Page command:

rsvd11

numd

offset

lid

prp2

prp1

rsvd1

command_id

flags

Opcode

 

 

 

 

 

 

 

 

 

 

Opcode:nvme_admin_get_log_page

Flags:清0

Command_id:清0

Prp1:数据保存的地址,最好申请数组空间,大小至少16k

Prp2:datalength,注意datalength的长度

Lid


Offset:清0

Numd:清0

       值得注意的是这里并没有定义namespace_ID, 最好设置rsvd1[0] = ~0。

Get Log Page: SMART/ Health Information

Critical Warning: 00

Composite Temperature: (32 01 )306K氏度

Available Spare: (64)100%

 

Identify command:

rsvd11

cns

Prp2

Prp1

Rsvd2

nsid

command_id

flags

Opcode

 

 

 

 

 

 

 

 

 

Opcode:nvme_admin_identify

Flags:清0

Command_id:清0

Nsid: 0

Prp1:数据保存的地址,最好申请数组空间,大小至少16k

Prp2:datalength,注意datalength的长度

Cns:0x01;

 

Identify Controller Data Structure:见附件

 

Set Features command& Get Featurescommand:

rsvd12

dword11

Fid

Prp2

Prp1

rsvd2

Nsid

command_id

flags

Opcode

 

 

 

 

 

 

 

 

 

 

 

Opcode:nvme_admin_get_features& nvme_admin_set_features

Flags:清0

Command_id:清0

Nsid: 0

Prp1:数据保存的地址,最好申请数组空间,大小至少16k

Prp2:datalength,注意datalength的长度

Fid:

5、驱动处理cmd流程


6、 NVMe块设备文件操作接口

NVMe块设备文件操作集会在申请disk设备的时候进行声明,代码如下:

disk->fops =&nvme_fops;

static conststruct block_device_operations nvme_fops = {

       .owner           =THIS_MODULE,

       .ioctl              =nvme_ioctl,

       .compat_ioctl  = nvme_ioctl,

};

其中owner成员表面该fops的所有者是NVMe块设备驱动,而ioctl和compat_ioctl分别是用户ioctl调用的两种方式,一般是ioctl,而不管是哪种方式,二者都会进入nvme_ioctl。


进入nvme_ioctl()接口后,驱动程序会对cmd类型进行解析被进入不同的分支,这里重点关注NVME_IOCTL_ADMIN_CMD和NVME_IOCTL_SUBMIT_IO。

注意这里两个函数最终都会调用:nvme_submit_sync_cmd(nvmeq,&c, NULL, NVME_IO_TIMEOUT);


其是利用同步的方式进行命令的下发和返回最终返回状态的处理。由于该函数会睡眠,我们需要保持抢占处理使能状态。其有可能在任意地方被抢占,然后重新被调度。

NVMe资料下载

目前最新的协议为NVME-1.2.1Specification,http://www.nvmexpress.org/specifications/可下载; 驱动位于http://www.nvmexpress.org/drivers/,目前提供Microsoft Drivers、Linux Drivers、VMware、UEFI、FreeBSD、Solaris等系统的驱动代码。

 

附录:

Identify Controller Data Structure ,

低位在前高位在后。

87 19 PCI Vendor ID (VID)://Vendor ID:87,Device ID :19

87 19 PCI Subsystem Vendor ID (SSVID)://Subsystem Vendor ID :87, Subsystem ID (SSID): 19

36 37 43 45 30 37 36 36 31 30 31 37 30 3030 30 30 31 38 33:Serial Number (SN):

50 43 49 65 20 53 53 44 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20: Model Number (MN):

45 37 46 4d 30 31 2e 31: Firmware Revision

01 :Recommended  Arbitration Burst  ,一页2K?

00 00 00 :EEE OUIIdentifier (IEEE): 

00: Controller  Multi-Path I/O  and  Namespace Sharing  Capabilities  (CMIC): then the NVM subsystem contains onlya single PCI Express port.

09 :Maximum  Data  Transfer  Size (MDTS):    The value is in unitsof the minimum memory page size (CAP.MPSMIN) and is reported as a power of two(2^n).  512

00 00 :ControllerID (CNTLID):

00 02 01 00 :Version(VER):  Major Version Number :2, MinorVersion Number :1

80 4f 12 00 :RTD3  Resume Latency  (RTD3R):    ?

60 e3 16 00 :RTD3  Entry Latency  (RTD3E):  ?

07 00 :Optional  Admin Command  Support  (OACS): the controller supports the FirmwareCommit and Firmware Image Download commands. the controller supports the FormatNVM command.the controller supports the Security Send and Security Receivecommands.

03 :Abort CommandLimit (ACL): 最大同时传送失败的个数限制

03 :Asynchronous  Event Request  Limit  (AERL): 最大同时传送异步事件个数限制

02 :Firmware  Updates (FRMW):   the controller requiresa reset for firmware to be activated.indicate the number of firmware slots thatthe controller support(1~7) thefirst firmware slot (slot 1) is read/write

03 :Log PageAttributes (LPA): T controller supports the Command Effects log page,n the controller supports the SMART / Health information log page ona per namespace basis

3f :Error Log PageEntries (ELPE): T the maximum number of Error Information log entries that arestored by the controller

04 :Number of PowerStates Support (NPSS): This field indicates the number of NVM Express powerstates supported by the controlle ,

01 :Admin VendorSpecific Command Configuration (AVSCC): Tt all Admin Vendor Specific Commandsuse the format defined in Figure 13.

01 :AutonomousPower State Transition Attributes (APSTA):the controller supports autonomouspower state transitions.

7f 01 :WarningComposite Temperature Threshold (WCTEMP) 告警温度 383k

93 01 :CriticalComposite Temperature Threshold (CCTEMP) 危机温度403k

66 :Submission  Queue Entry  Size  (SQES):define the maximum Submission Queueentry size when using the NVM Command Se :6;define  the  required Submission  Queue  Entry size  when  using the  NVM Command Set:6

44 :Completion  Queue Entry  Size  (CQES): define the maximum Completion Queue entry size when using the NVMCommand Set.:4; define  the  required Completion  Queue  entry size  when  using the  NVM Command Set:4

01 00 00 00 :Numberof Namespaces (NN):This field defines the number of valid namespaces presentfor the controller:1

1e 00 :Optional NVMCommand Support (ONCS):  the controllerdoes not support the Compare command.  the controller supports the Write Uncorrectable command,the controllersupports the Dataset Management command, the controller supports the WriteZeroes command, the  controller  supports the  Save  field in  the  Set Features command and the Select field in the Get Features command.

00 00: FusedOperation Support (FUSES): the controller does not support the Compare andWrite fused operation.

01:Format NVMAttributes (FNA):  then all namespacesshall be configured with the  same  attributes and  a  format of  any  namespace results  in  a format  of  all namespaces

01:Volatile WriteCache 525 indicates that a volatile write cache is present

ff 00:Atomic  Write Unit  Normal 原子写的最大逻辑块个数

00 00:Atomic WriteUnit Power Fail (

01:NVM VendorSpecific Command Configuration l NVM Vendor Specific Commands use the formatdefined in Figure 13. I

00 00:AtomicCompare & Write Unit

16 03 00 00 00 00 00 00  00 00 00 00  00 00  00  00  521c 40 00 16 03 81 00 00 00 00 00 00 00 00 00:PowerState 0 Descriptor (PSD0)

16 03: the  maximum power  consumed  by  the  NVM subsystem in this power state. 790W ?= 7.9w

00:Reserved

00:the controllerprocesses I/O commands in this power state.the scale of the Maximum Power fieldis in 0.01 Watts.

00 00 00 00:he  maximum entry  latency  in microseconds associated with entering this power state.

00 00 00 00:maximum  exit  latency in  microseconds associated withexiting this power state

00:ative read throughputassociated with this power state.

00:the relativeread latency associated with this  powerstat

00 : relative write throughput associatedwith this power state.

52 1c : the typical power consumed by theNVM subsystem over 30  seconds  in this  power  state when  idle .30s空闲消耗多少电7250*0.0001W

40:Idle Power Scale( 0.0001w)

00:保留

81:Active PowerScale:0.01w,the workload usedto calculate maximum power  for  this power  state:001b

f0 00: the largestaverage power consumed by the NVM subsystem over a 10 second period in thispower state with the workload indicated in the Active Power Workload field.

00 00 00 00 00 00 00 00 00 Power State 1Descriptor (PSD1):

{

be 00 00 00 0000 00 00 00 00 00 00 00 00 00 00 52 1c 40 00 be 00 81 00 00 00 00 00 00 00 0000 Power State 2 Descriptor (PSD2);

4c 04 00 03 58 02 00 00 58 02 00 00 02 0202 02 4c 04 40 00 4c 04 41 00 00 00 00 00 00 00 00 00 Power State 3 Descriptor(PSD3):

32 00 00 03 a0 86 01 00 00 71 02 00 03 0303 03 32 00 40 00 32 00 41 00 00 00 00 00 00 00 00 00 Power State 4 Descriptor(PSD4):

}

posted on 2017-06-30 09:34  ingram14  阅读(1454)  评论(0编辑  收藏  举报