RHCA rh442 010 文件系统结构 BDP调优网卡驱动带宽

文件系统结构

用户通过虚拟文件系统，访问底层的文件系统

对于一块磁盘而言，MBR + 分区表记录硬盘的信息
对于一个分区而言，这个分区的第一个块，superblock，超级块，记录分区元数据信息
对于一个文件而言，innode是用来记录文件的索引信息medadata
查看超级块

[root@servera ~]# tune2fs -l /dev/vdd1 
tune2fs 1.44.3 (10-July-2018)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          c7af875c-49da-484e-9fb4-ace01447fd2e
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536
Block count:              262144
Reserved block count:     13107
Free blocks:              249189
Free inodes:              65525
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      127
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Fri Jul  8 14:25:40 2022
Last mount time:          n/a
Last write time:          Fri Jul  8 14:25:40 2022
Mount count:              0
Maximum mount count:      -1
Last checked:             Fri Jul  8 14:25:40 2022
Check interval:           0 (<none>)
Lifetime writes:          33 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      2be7d43d-444c-4728-9256-35e999817aec
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xea3861c5

当文件系统里的文件扩展时，可能会导致数据存放位置不连续，那么就会乱套，所以需要一个组概念
文件系统分了很多个组 Group0 Group1
Group 32768block
0~32767 第一个组

dumpe2fs /dev/vdd1 （详情查看超级块，这里主要看组信息）

Journal size:             32M
Journal length:           8192
Journal sequence:         0x00000001
Journal start:            0


Group 0: (Blocks 0-32767) csum 0xbeb7
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-128
Block bitmap at 129 (+129), csum 0x1b5a0776
Inode bitmap at 137 (+137), csum 0x12c02b55
Inode table at 145-656 (+145)
28521 free blocks, 8181 free inodes, 2 directories, 8181 unused inodes
Free blocks: 4247-32767
Free inodes: 12-8192
Group 1: (Blocks 32768-65535) csum 0x7b73 [INODE_UNINIT, BLOCK_UNINIT]
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32896
Block bitmap at 130 (bg #0 + 130), csum 0x00000000
Inode bitmap at 138 (bg #0 + 138), csum 0x00000000
Inode table at 657-1168 (bg #0 + 657)
32639 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 32897-65535
Free inodes: 8193-16384
Group 2: (Blocks 65536-98303) csum 0xa5af [INODE_UNINIT, BLOCK_UNINIT]
Block bitmap at 131 (bg #0 + 131), csum 0x00000000
Inode bitmap at 139 (bg #0 + 139), csum 0x00000000
Inode table at 1169-1680 (bg #0 + 1169)
32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 65536-98303
Free inodes: 16385-24576
Group 3: (Blocks 98304-131071) csum 0x1cb0 [INODE_UNINIT, BLOCK_UNINIT]

将文件放到一个group里，不让数据变的零散
组里有block inode bitmap记录组里的元信息

group0 第一个block 称为superblock
group1 第一个block 备份superblock
group3 第一个block 备份superblcok
1.3.5.7.9都是超级块备份

破坏超级块实验

[root@servera ~]# mount /dev/vdd1 /data
[root@servera ~]# cd /data
[root@servera data]# cp -r /etc/ .
[root@servera data]# cp /etc/passwd /etc/group . 
[root@servera data]# cd ..
[root@servera /]# umount /data 
[root@servera /]# dd if=/dev/zero of=/dev/vdd1  bs=4K count=10
10+0 records in
10+0 records out
40960 bytes (41 kB, 40 KiB) copied, 0.00177638 s, 23.1 MB/s
[root@servera /]# mount /dev/vdd1 /data
mount: /data: wrong fs type, bad option, bad superblock on /dev/vdd1, missing codepage or helper program, or other error.
[root@servera /]#

分区前40K用空白填充

修复文件系统

[root@servera /]# e2fsck -v /dev/vdd1 
e2fsck 1.44.3 (10-July-2018)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
/dev/vdd1 was not cleanly unmounted, check forced.
Resize inode not valid.  Recreate<y>? yes
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +(98304--98432) +(163840--163968)
Fix<y>? yes
Free blocks count wrong for group #0 (28520, counted=28236).
Fix<y>? yes
Free blocks count wrong for group #1 (32639, counted=26806).
Fix<y>? yes
Free blocks count wrong (249188, counted=243071).
Fix<y>? yes
Free inodes count wrong for group #0 (8181, counted=7161).
Fix<y>? yes
Directories count wrong for group #0 (2, counted=284).
Fix<y>? yes
Free inodes count wrong (65525, counted=64505).
Fix<y>? yes
Inode bitmap differences: Group 0 inode bitmap does not match checksum.
FIXED.

/dev/vdd1: ***** FILE SYSTEM WAS MODIFIED *****

        1031 inodes used (1.57%, out of 65536)
        2 non-contiguous files (0.2%)
        1 non-contiguous directory (0.1%)
            # of inodes with ind/dind/tind blocks: 0/0/0
            Extent depth histogram: 866
    19073 blocks used (7.28%, out of 262144)
        0 bad blocks
        0 large files

        580 regular files
        284 directories
        0 character device files
        0 block device files
        0 fifos
        0 links
        158 symbolic links (157 fast symbolic links)
        0 sockets
------------
        1022 files
[root@servera /]# mount /dev/vdd1  /data/
[root@servera /]# cd /data
[root@servera data]# ls
etc  group  lost+found  passwd
[root@servera data]#

用后面的超级块，重建了超级块

e2fick -b 98304 /dev/vdd1
用特定的备份超级块还原
越往后越靠谱
dumpe2fs 可以看到详情

lost+found不能删
e2fsck -b 98304 /dev/vdd1 启用备用块修复超级块 (ext系列)

xfs如何修
xfs_repair /dev/vdd

日志型文件系统

IBM AIX JFS Journal FileSystem JFS2
从ext3开始
ext4 xfs 日志型文件系统

文件写入文件系统时，会先将索引写入到inode中，然后再将内容写入到block中
a.txt 索引记录它有100M
当写到50M时系统突然死机，再次开机索引去找100M的家伙，可是只写了一半。系统会在整个分区里搜索内容（根本找不到，浪费时间），没有则放弃这个写了一半的（不完整）非日志型

日志型

先写inode在日志区，之后写block。然后把日志inode写到真正inode，最后删除日志区的inode
如果还是写一半，死机。日志区没有写完整到inode里的，直接丢弃（未完整）。不必寻找。没写完整就被丢弃。也不会留下垃圾

两次写索引的动作
小文件，很多，有两次写索引动作。
日志区：临时存放索引的位置

writeback延迟从日志区，写到inode。（小文件的情况下，延迟可以一次性写完inode，但是有安全隐患。默认立马写）

日志区分为两种
内部日志区：一个文件系统100G，在这100G文件系统中划分256M用于日志区
外部日志区：使用另外一个磁盘，创建一个分区，用于日志区

就是把原本磁盘的日志区工作，分给另一个磁盘。每一个磁盘只写一次inode

查看该磁盘需要的外部日志区需要block大小
block size得一致

[root@servera /]# dumpe2fs /dev/vdb1  | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal inode:            8
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             64M
Journal length:           16384
Journal sequence:         0x00000001
Journal start:            0

制作外部日志区

[root@servera /]# mke2fs -O journal_dev -b 4096 /dev/vdc1 
mke2fs 1.44.3 (10-July-2018)
Creating filesystem with 131072 4k blocks and 0 inodes
Filesystem UUID: 0976edfc-8353-4d19-83b0-9f1fc399d6ce
Superblock backups stored on blocks: 

Zeroing journal device:

删除vdb1的日志区

[root@servera /]# tune2fs -O ^has_journal /dev/vdb1 
tune2fs 1.44.3 (10-July-2018)
[root@servera /]# dumpe2fs /dev/vdb1  | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Journal backup:           inode blocks
[root@servera /]#

把vdc1做成vdb1的外部日志区

[root@servera /]# tune2fs -j -J device=/dev/vdc1 /dev/vdb1 
tune2fs 1.44.3 (10-July-2018)
Creating journal on device /dev/vdc1: done
[root@servera /]# dumpe2fs /dev/vdb1  | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal UUID:             0976edfc-8353-4d19-83b0-9f1fc399d6ce
Journal device:	          0xfc21
Journal backup:           inode blocks
[root@servera /]#

fc21
f=15 c=12
1516+12 216+1
252 33 = /dev/vdc1

还原

[root@servera /]# tune2fs -O ^has_journal /dev/vdb1 
tune2fs 1.44.3 (10-July-2018)
Journal removed
[root@servera /]# dumpe2fs /dev/vdb1  | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Journal backup:           inode blocks
[root@servera /]# tune2fs -O has_journal /dev/vdb1 
tune2fs 1.44.3 (10-July-2018)
Creating journal inode: done
[root@servera /]# dumpe2fs /dev/vdb1  | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal inode:            8
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             64M
Journal length:           16384
Journal sequence:         0x00000001
Journal start:            0
[root@servera /]#

xfs

[root@servera /]# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    252:0    0   10G  0 disk 
└─vda1 252:1    0   10G  0 part /
vdb    252:16   0    5G  0 disk 
└─vdb1 252:17   0    3G  0 part 
vdc    252:32   0    5G  0 disk 
└─vdc1 252:33   0  400M  0 part 
vdd    252:48   0    5G  0 disk 
[root@servera /]# mkfs.xfs -l logdev=/dev/vdc1 /dev/vdb1
meta-data=/dev/vdb1              isize=512    agcount=4, agsize=196608 blks
        =                       sectsz=512   attr=2, projid32bit=1
        =                       crc=1        finobt=1, sparse=1, rmapbt=0
        =                       reflink=1
data     =                       bsize=4096   blocks=786432, imaxpct=25
        =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =/dev/vdc1              bsize=4096   blocks=102400, version=2
        =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@servera /]# 



[root@servera /]# xfs_info /dev/vdb1
meta-data=/dev/vdb1              isize=512    agcount=4, agsize=196608 blks
        =                       sectsz=512   attr=2, projid32bit=1
        =                       crc=1        finobt=1, sparse=1, rmapbt=0
        =                       reflink=1
data     =                       bsize=4096   blocks=786432, imaxpct=25
        =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =external               bsize=4096   blocks=102400, version=2
        =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@servera /]#

挂载
/dev/vdb1 /data xfs defaults,logdev=/dev/vdc1 0 0

[root@servera /]# mount | tail -n 1
/dev/vdb1 on /data type xfs (rw,relatime,seclabel,attr2,inode64,logdev=/dev/vdc1,noquota)
[root@servera /]#

网络

发送
把发送的数据放到buffer里面，内核封装数据到pdu里面，pdu移动到传输队列里
驱动设备读取pdu队列头到网卡里，最后网卡增加中断然后发送出去*

接受
网卡接受数据包，接受数据帧，使用dma（直接内存访问）copy到接收的buffer里*
网卡增加硬中断* 内核会处理中断，并调度到软rq里面软rq会把数据包通过路由移动到ip层

*可调优
把数据包放到buffer里，一次性传。有效率
BDP计算buffer大小
BDP Bandwidth delay product 带宽和延迟的乘积

socket buffer = BDP / SOCKETS （两个网卡绑定在一起的话）
只有一个网卡
buffer = bdp
延迟越大，缓存越大
带宽越大，缓存越大
一次性传的数据量多

当延迟较大时，我们两地距离远，一次性传输的数据量建议大一些
当延迟较小，两地距离近，传输数据可以少一些

UPD

[root@servera /]# sysctl -a | grep net.core | grep mem
net.core.rmem_default = 212992   读缓存默认   收数据包
net.core.rmem_max = 212992
net.core.wmem_default = 212992   写缓存默认   发数据包
net.core.wmem_max = 212992

TCP

tcp 的调优 = net.core + tcp

tcp上面那个值也得调
[root@servera /]# sysctl  -a | grep tcp | grep mem
net.ipv4.tcp_mem = 19779	26373	39558    整体tcp开销，单位页
net.ipv4.tcp_rmem = 4096	87380	6291456  tcp读buffer  字节
net.ipv4.tcp_wmem = 4096	16384	4194304  tcp写buffer  字节
                    最小值   默认值   最大值

两个值不能超过第一个

实验

[root@workstation ~]# lab network-latency start 

[root@servera /]# ping 192.168.0.254
PING 192.168.0.254 (192.168.0.254) 56(84) bytes of data.
64 bytes from 192.168.0.254: icmp_seq=1 ttl=64 time=2001 ms
64 bytes from 192.168.0.254: icmp_seq=2 ttl=64 time=2000 ms
64 bytes from 192.168.0.254: icmp_seq=3 ttl=64 time=2001 ms



[root@servera html]# time wget http://192.168.0.254/bigfile
--2022-07-08 22:13:29--  http://192.168.0.254/bigfile
Connecting to 192.168.0.254:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5242880 (5.0M)
Saving to: ‘bigfile.1’

bigfile.1                                            100%[===================================================================================================================>]   5.00M   197KB/s    in 26s     

2022-07-08 22:13:59 (197 KB/s) - ‘bigfile.1’ saved [5242880/5242880]


real	0m30.027s
user	0m0.004s
sys	0m0.014s

得知带宽，ping出延迟
bdp = 100M带宽 * 2s
算出bdp缓存

调整后

[root@servera html]# sysctl -p
net.ipv4.tcp_rmem = 4096	12500000	25000000
net.core.rmem_default = 25000000
[root@servera html]# sysctl -w vm.drop_caches=3
vm.drop_caches = 3




[root@servera html]# !time
time wget http://192.168.0.254/bigfile
--2022-07-08 22:23:13--  http://192.168.0.254/bigfile
Connecting to 192.168.0.254:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5242880 (5.0M)
Saving to: ‘bigfile.1’

bigfile.1                                            100%[===================================================================================================================>]   5.00M   320KB/s    in 16s     

2022-07-08 22:23:33 (320 KB/s) - ‘bigfile.1’ saved [5242880/5242880]


real	0m20.092s
user	0m0.004s
sys	0m0.021s

混合云中
私有云--公有云业务互通
VPN节点之间两个点之间

使用两个点之间特别频繁

补充

如何计算网络的传输能力呢？
相信大家都知道网络是有「带宽」限制的，带宽描述的是网络传输能力，它与内核缓冲区的计量单位不同:
带宽是单位时间内的流量，表达是「速度」，比如常见的带宽 100 MB/s；

缓冲区单位是字节，当网络速度乘以时间才能得到字节数；比如最大带宽是 100 MB/s，网络时延（RTT）是 10ms 时，意味着客户端到服务端的网络一共可以存放 100MB/s * 0.01s = 1MB 的字节。
这个 1MB 是带宽和时延的乘积，所以它就叫「带宽时延积」（缩写为 BDP，Bandwidth Delay Product）。同时，这 1MB 也表示「飞行中」的 TCP 报文大小，它们就在网络线路、路由器等网络设备上。如果飞行报文超过了 1 MB，就会导致网络过载，容易丢包。

由于发送缓冲区大小决定了发送窗口的上限，而发送窗口又决定了「已发送未确认」的飞行报文的上限。因此，发送缓冲区不能超过「带宽时延积」。
发送缓冲区与带宽时延积的关系：
如果发送缓冲区「超过」带宽时延积，超出的部分就没办法有效的网络传输，同时导致网络过载，容易丢包；
如果发送缓冲区「小于」带宽时延积，就不能很好的发挥出网络的传输效率。
所以，发送缓冲区的大小最好是往带宽时延积靠近。
————————————————
原文作者：CrazyZard
转自链接：https://learnku.com/articles/46249

给连接提供buffer

DMA

DMA buffer （dma直接内存访问，给网卡提供多少的dma buffer，不需要提供操作系统地址转换）

[root@foundation0 ~]# ethtool ens160
Settings for ens160:
    Supported ports: [ TP ]
    Supported link modes:   1000baseT/Full     
                            10000baseT/Full    
    Supported pause frame use: No
    Supports auto-negotiation: No     是否支持协商
    Supported FEC modes: Not reported
    Advertised link modes:  Not reported
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Advertised FEC modes: Not reported
    Speed: 10000Mb/s       10000Mb      
    Duplex: Full     全双工
    Port: Twisted Pair
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: off
    MDI-X: Unknown
    Supports Wake-on: uag
    Wake-on: d      
    Link detected: yes
找宿主机信息


[root@foundation0 ~]# modinfo -p e1000
TxDescriptors:Number of transmit descriptors (array of int)
RxDescriptors:Number of receive descriptors (array of int)
Speed:Speed setting (array of int)
Duplex:Duplex setting (array of int)
AutoNeg:Advertised auto-negotiation setting (array of int)
FlowControl:Flow Control setting (array of int)
XsumRX:Disable or enable Receive Checksum offload (array of int)
TxIntDelay:Transmit Interrupt Delay (array of int)
TxAbsIntDelay:Transmit Absolute Interrupt Delay (array of int)
RxIntDelay:Receive Interrupt Delay (array of int)
RxAbsIntDelay:Receive Absolute Interrupt Delay (array of int)
InterruptThrottleRate:Interrupt Throttling Rate (array of int)
SmartPowerDownEnable:Enable PHY smart power down (array of int)
copybreak:Maximum size of packet that is copied to a new buffer on receive (uint)
debug:Debug level (0=none,...,16=all) (int)
[root@foundation0 ~]#

给网卡提供buffer
改驱动

关闭自动协商后调整带宽（开启自动协商则改不了）
rhel7 测试是可以的
ethtools -s eno16777736 autoeg off speed 100

advertise 设置值

advertise N
                Sets the speed and duplex advertised by autonegotiation.  The argument is a hexadecimal value using one or a combination of the following values:

                0x001             10baseT Half
                0x002             10baseT Full
                0x004             100baseT Half
                0x008             100baseT Full
                0x010             1000baseT Half           (not supported by IEEE standards)
                0x020             1000baseT Full
                0x20000           1000baseKX Full
                0x20000000000     1000baseX Full
                0x800000000000    2500baseT Full

不能在虚拟机里面设置

可以写在配置文件里
ETHTOOL_OPTS="-s ${DEVICE} autoneg off speed 1000 duplex full"
rhel8 可以使用nmcli

udp可能快一些
没有经过应用程序，他的带宽速率是多少

双网卡绑定
提升可靠性与速度

bonding_opts 每100毫秒检测一次链路状态如何检测网卡驱动检测 ethtool也可也检测链路状态
use_carrier=0 网卡驱动 1 ethtool

巨帧
交换机支持的话
正常为1500，9000 能大大提高效率
配置文件
MTU=9000

posted @ 2022-07-09 17:41 supermao12 阅读(140) 评论(0) 编辑收藏举报

刷新页面返回顶部

RHCA rh442 010 文件系统结构 BDP调优 网卡驱动带宽

文件系统结构

破坏超级块实验

日志型文件系统

网络

实验

DMA

公告

RHCA rh442 010 文件系统结构 BDP调优网卡驱动带宽