RHCA rh442 010 文件系统结构 BDP调优 网卡驱动带宽
文件系统结构
用户通过虚拟文件系统,访问底层的文件系统
对于一块磁盘而言,MBR + 分区表记录硬盘的信息
对于一个分区而言,这个分区的第一个块,superblock,超级块,记录分区元数据信息
对于一个文件而言,innode是用来记录文件的索引信息medadata
查看超级块
[root@servera ~]# tune2fs -l /dev/vdd1
tune2fs 1.44.3 (10-July-2018)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: c7af875c-49da-484e-9fb4-ace01447fd2e
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 65536
Block count: 262144
Reserved block count: 13107
Free blocks: 249189
Free inodes: 65525
First block: 0
Block size: 4096
Fragment size: 4096
Group descriptor size: 64
Reserved GDT blocks: 127
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Fri Jul 8 14:25:40 2022
Last mount time: n/a
Last write time: Fri Jul 8 14:25:40 2022
Mount count: 0
Maximum mount count: -1
Last checked: Fri Jul 8 14:25:40 2022
Check interval: 0 (<none>)
Lifetime writes: 33 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 2be7d43d-444c-4728-9256-35e999817aec
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0xea3861c5
当文件系统里的文件扩展时,可能会导致数据存放位置不连续,那么就会乱套,所以需要一个组概念
文件系统分了很多个组 Group0 Group1
Group 32768block
0~32767 第一个组
dumpe2fs /dev/vdd1 (详情查看超级块,这里主要看组信息)
Journal size: 32M
Journal length: 8192
Journal sequence: 0x00000001
Journal start: 0
Group 0: (Blocks 0-32767) csum 0xbeb7
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-128
Block bitmap at 129 (+129), csum 0x1b5a0776
Inode bitmap at 137 (+137), csum 0x12c02b55
Inode table at 145-656 (+145)
28521 free blocks, 8181 free inodes, 2 directories, 8181 unused inodes
Free blocks: 4247-32767
Free inodes: 12-8192
Group 1: (Blocks 32768-65535) csum 0x7b73 [INODE_UNINIT, BLOCK_UNINIT]
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32896
Block bitmap at 130 (bg #0 + 130), csum 0x00000000
Inode bitmap at 138 (bg #0 + 138), csum 0x00000000
Inode table at 657-1168 (bg #0 + 657)
32639 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 32897-65535
Free inodes: 8193-16384
Group 2: (Blocks 65536-98303) csum 0xa5af [INODE_UNINIT, BLOCK_UNINIT]
Block bitmap at 131 (bg #0 + 131), csum 0x00000000
Inode bitmap at 139 (bg #0 + 139), csum 0x00000000
Inode table at 1169-1680 (bg #0 + 1169)
32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 65536-98303
Free inodes: 16385-24576
Group 3: (Blocks 98304-131071) csum 0x1cb0 [INODE_UNINIT, BLOCK_UNINIT]
将文件放到一个group里,不让数据变的零散
组里有block inode bitmap记录组里的元信息
group0 第一个block 称为superblock
group1 第一个block 备份superblock
group3 第一个block 备份superblcok
1.3.5.7.9都是超级块备份
破坏超级块实验
[root@servera ~]# mount /dev/vdd1 /data
[root@servera ~]# cd /data
[root@servera data]# cp -r /etc/ .
[root@servera data]# cp /etc/passwd /etc/group .
[root@servera data]# cd ..
[root@servera /]# umount /data
[root@servera /]# dd if=/dev/zero of=/dev/vdd1 bs=4K count=10
10+0 records in
10+0 records out
40960 bytes (41 kB, 40 KiB) copied, 0.00177638 s, 23.1 MB/s
[root@servera /]# mount /dev/vdd1 /data
mount: /data: wrong fs type, bad option, bad superblock on /dev/vdd1, missing codepage or helper program, or other error.
[root@servera /]#
分区前40K用空白填充
修复文件系统
[root@servera /]# e2fsck -v /dev/vdd1
e2fsck 1.44.3 (10-July-2018)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
/dev/vdd1 was not cleanly unmounted, check forced.
Resize inode not valid. Recreate<y>? yes
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: +(98304--98432) +(163840--163968)
Fix<y>? yes
Free blocks count wrong for group #0 (28520, counted=28236).
Fix<y>? yes
Free blocks count wrong for group #1 (32639, counted=26806).
Fix<y>? yes
Free blocks count wrong (249188, counted=243071).
Fix<y>? yes
Free inodes count wrong for group #0 (8181, counted=7161).
Fix<y>? yes
Directories count wrong for group #0 (2, counted=284).
Fix<y>? yes
Free inodes count wrong (65525, counted=64505).
Fix<y>? yes
Inode bitmap differences: Group 0 inode bitmap does not match checksum.
FIXED.
/dev/vdd1: ***** FILE SYSTEM WAS MODIFIED *****
1031 inodes used (1.57%, out of 65536)
2 non-contiguous files (0.2%)
1 non-contiguous directory (0.1%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 866
19073 blocks used (7.28%, out of 262144)
0 bad blocks
0 large files
580 regular files
284 directories
0 character device files
0 block device files
0 fifos
0 links
158 symbolic links (157 fast symbolic links)
0 sockets
------------
1022 files
[root@servera /]# mount /dev/vdd1 /data/
[root@servera /]# cd /data
[root@servera data]# ls
etc group lost+found passwd
[root@servera data]#
用后面的超级块,重建了超级块
e2fick -b 98304 /dev/vdd1
用特定的备份超级块还原
越往后越靠谱
dumpe2fs 可以看到详情
lost+found不能删
e2fsck -b 98304 /dev/vdd1 启用备用块修复超级块 (ext系列)
xfs如何修
xfs_repair /dev/vdd
日志型文件系统
IBM AIX JFS Journal FileSystem JFS2
从ext3开始
ext4 xfs 日志型文件系统
文件写入文件系统时,会先将索引写入到inode中,然后再将内容写入到block中
a.txt 索引记录它有100M
当写到50M时系统突然死机,再次开机索引去找100M的家伙,可是只写了一半。系统会在整个分区里搜索内容(根本找不到,浪费时间),没有则放弃这个写了一半的(不完整) 非日志型
日志型
先写inode在日志区,之后写block。然后把日志inode写到真正inode,最后删除日志区的inode
如果还是写一半,死机。日志区没有写完整到inode里的,直接丢弃(未完整)。不必寻找。没写完整就被丢弃。也不会留下垃圾
两次写索引的动作
小文件,很多,有两次写索引动作。
日志区:临时存放索引的位置
writeback延迟从日志区,写到inode。(小文件的情况下,延迟可以一次性写完inode,但是有安全隐患。默认立马写)
日志区分为两种
内部日志区: 一个文件系统100G,在这100G文件系统中划分256M用于日志区
外部日志区:使用另外一个磁盘,创建一个分区,用于日志区
就是把原本磁盘的日志区工作,分给另一个磁盘。每一个磁盘只写一次inode
查看该磁盘需要的外部日志区需要block大小
block size得一致
[root@servera /]# dumpe2fs /dev/vdb1 | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal inode: 8
Journal backup: inode blocks
Journal features: (none)
Journal size: 64M
Journal length: 16384
Journal sequence: 0x00000001
Journal start: 0
制作外部日志区
[root@servera /]# mke2fs -O journal_dev -b 4096 /dev/vdc1
mke2fs 1.44.3 (10-July-2018)
Creating filesystem with 131072 4k blocks and 0 inodes
Filesystem UUID: 0976edfc-8353-4d19-83b0-9f1fc399d6ce
Superblock backups stored on blocks:
Zeroing journal device:
删除vdb1的日志区
[root@servera /]# tune2fs -O ^has_journal /dev/vdb1
tune2fs 1.44.3 (10-July-2018)
[root@servera /]# dumpe2fs /dev/vdb1 | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Journal backup: inode blocks
[root@servera /]#
把vdc1做成vdb1的外部日志区
[root@servera /]# tune2fs -j -J device=/dev/vdc1 /dev/vdb1
tune2fs 1.44.3 (10-July-2018)
Creating journal on device /dev/vdc1: done
[root@servera /]# dumpe2fs /dev/vdb1 | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal UUID: 0976edfc-8353-4d19-83b0-9f1fc399d6ce
Journal device: 0xfc21
Journal backup: inode blocks
[root@servera /]#
fc21
f=15 c=12
1516+12 216+1
252 33 = /dev/vdc1
还原
[root@servera /]# tune2fs -O ^has_journal /dev/vdb1
tune2fs 1.44.3 (10-July-2018)
Journal removed
[root@servera /]# dumpe2fs /dev/vdb1 | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Journal backup: inode blocks
[root@servera /]# tune2fs -O has_journal /dev/vdb1
tune2fs 1.44.3 (10-July-2018)
Creating journal inode: done
[root@servera /]# dumpe2fs /dev/vdb1 | grep -i journal
dumpe2fs 1.44.3 (10-July-2018)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Journal inode: 8
Journal backup: inode blocks
Journal features: (none)
Journal size: 64M
Journal length: 16384
Journal sequence: 0x00000001
Journal start: 0
[root@servera /]#
xfs
[root@servera /]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 10G 0 disk
└─vda1 252:1 0 10G 0 part /
vdb 252:16 0 5G 0 disk
└─vdb1 252:17 0 3G 0 part
vdc 252:32 0 5G 0 disk
└─vdc1 252:33 0 400M 0 part
vdd 252:48 0 5G 0 disk
[root@servera /]# mkfs.xfs -l logdev=/dev/vdc1 /dev/vdb1
meta-data=/dev/vdb1 isize=512 agcount=4, agsize=196608 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=786432, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =/dev/vdc1 bsize=4096 blocks=102400, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@servera /]#
[root@servera /]# xfs_info /dev/vdb1
meta-data=/dev/vdb1 isize=512 agcount=4, agsize=196608 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=786432, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =external bsize=4096 blocks=102400, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@servera /]#
挂载
/dev/vdb1 /data xfs defaults,logdev=/dev/vdc1 0 0
[root@servera /]# mount | tail -n 1
/dev/vdb1 on /data type xfs (rw,relatime,seclabel,attr2,inode64,logdev=/dev/vdc1,noquota)
[root@servera /]#
网络
发送
把发送的数据放到buffer里面,内核封装数据到pdu里面,pdu移动到传输队列里
驱动设备读取pdu队列头到网卡里,最后网卡增加中断然后发送出去*
接受
网卡接受数据包,接受数据帧,使用dma(直接内存访问)copy到接收的buffer里*
网卡增加硬中断* 内核会处理中断,并调度到软rq里面 软rq会把数据包通过路由移动到ip层
*可调优
把数据包放到buffer里,一次性传。有效率
BDP计算buffer大小
BDP Bandwidth delay product 带宽和延迟的乘积
socket buffer = BDP / SOCKETS (两个网卡绑定在一起的话)
只有一个网卡
buffer = bdp
延迟越大,缓存越大
带宽越大,缓存越大
一次性传的数据量多
当延迟较大时,我们两地距离远,一次性传输的数据量建议大一些
当延迟较小,两地距离近,传输数据可以少一些
UPD
[root@servera /]# sysctl -a | grep net.core | grep mem
net.core.rmem_default = 212992 读缓存默认 收数据包
net.core.rmem_max = 212992
net.core.wmem_default = 212992 写缓存默认 发数据包
net.core.wmem_max = 212992
TCP
tcp 的调优 = net.core + tcp
tcp上面那个值也得调
[root@servera /]# sysctl -a | grep tcp | grep mem
net.ipv4.tcp_mem = 19779 26373 39558 整体tcp开销,单位页
net.ipv4.tcp_rmem = 4096 87380 6291456 tcp读buffer 字节
net.ipv4.tcp_wmem = 4096 16384 4194304 tcp写buffer 字节
最小值 默认值 最大值
两个值不能超过第一个
实验
[root@workstation ~]# lab network-latency start
[root@servera /]# ping 192.168.0.254
PING 192.168.0.254 (192.168.0.254) 56(84) bytes of data.
64 bytes from 192.168.0.254: icmp_seq=1 ttl=64 time=2001 ms
64 bytes from 192.168.0.254: icmp_seq=2 ttl=64 time=2000 ms
64 bytes from 192.168.0.254: icmp_seq=3 ttl=64 time=2001 ms
[root@servera html]# time wget http://192.168.0.254/bigfile
--2022-07-08 22:13:29-- http://192.168.0.254/bigfile
Connecting to 192.168.0.254:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5242880 (5.0M)
Saving to: ‘bigfile.1’
bigfile.1 100%[===================================================================================================================>] 5.00M 197KB/s in 26s
2022-07-08 22:13:59 (197 KB/s) - ‘bigfile.1’ saved [5242880/5242880]
real 0m30.027s
user 0m0.004s
sys 0m0.014s
得知带宽,ping出延迟
bdp = 100M带宽 * 2s
算出bdp缓存
调整后
[root@servera html]# sysctl -p
net.ipv4.tcp_rmem = 4096 12500000 25000000
net.core.rmem_default = 25000000
[root@servera html]# sysctl -w vm.drop_caches=3
vm.drop_caches = 3
[root@servera html]# !time
time wget http://192.168.0.254/bigfile
--2022-07-08 22:23:13-- http://192.168.0.254/bigfile
Connecting to 192.168.0.254:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5242880 (5.0M)
Saving to: ‘bigfile.1’
bigfile.1 100%[===================================================================================================================>] 5.00M 320KB/s in 16s
2022-07-08 22:23:33 (320 KB/s) - ‘bigfile.1’ saved [5242880/5242880]
real 0m20.092s
user 0m0.004s
sys 0m0.021s
混合云中
私有云--公有云业务互通
VPN节点之间 两个点之间
使用两个点之间特别频繁
补充
如何计算网络的传输能力呢?
相信大家都知道网络是有「带宽」限制的,带宽描述的是网络传输能力,它与内核缓冲区的计量单位不同:
带宽是单位时间内的流量,表达是「速度」,比如常见的带宽 100 MB/s;
缓冲区单位是字节,当网络速度乘以时间才能得到字节数;比如最大带宽是 100 MB/s,网络时延(RTT)是 10ms 时,意味着客户端到服务端的网络一共可以存放 100MB/s * 0.01s = 1MB 的字节。
这个 1MB 是带宽和时延的乘积,所以它就叫「带宽时延积」(缩写为 BDP,Bandwidth Delay Product)。同时,这 1MB 也表示「飞行中」的 TCP 报文大小,它们就在网络线路、路由器等网络设备上。如果飞行报文超过了 1 MB,就会导致网络过载,容易丢包。
由于发送缓冲区大小决定了发送窗口的上限,而发送窗口又决定了「已发送未确认」的飞行报文的上限。因此,发送缓冲区不能超过「带宽时延积」。
发送缓冲区与带宽时延积的关系:
如果发送缓冲区「超过」带宽时延积,超出的部分就没办法有效的网络传输,同时导致网络过载,容易丢包;
如果发送缓冲区「小于」带宽时延积,就不能很好的发挥出网络的传输效率。
所以,发送缓冲区的大小最好是往带宽时延积靠近。
————————————————
原文作者:CrazyZard
转自链接:https://learnku.com/articles/46249
给连接提供buffer
DMA
DMA buffer (dma直接内存访问,给网卡提供多少的dma buffer,不需要提供操作系统地址转换)
[root@foundation0 ~]# ethtool ens160
Settings for ens160:
Supported ports: [ TP ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: No 是否支持协商
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 10000Mb/s 10000Mb
Duplex: Full 全双工
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
MDI-X: Unknown
Supports Wake-on: uag
Wake-on: d
Link detected: yes
找宿主机信息
[root@foundation0 ~]# modinfo -p e1000
TxDescriptors:Number of transmit descriptors (array of int)
RxDescriptors:Number of receive descriptors (array of int)
Speed:Speed setting (array of int)
Duplex:Duplex setting (array of int)
AutoNeg:Advertised auto-negotiation setting (array of int)
FlowControl:Flow Control setting (array of int)
XsumRX:Disable or enable Receive Checksum offload (array of int)
TxIntDelay:Transmit Interrupt Delay (array of int)
TxAbsIntDelay:Transmit Absolute Interrupt Delay (array of int)
RxIntDelay:Receive Interrupt Delay (array of int)
RxAbsIntDelay:Receive Absolute Interrupt Delay (array of int)
InterruptThrottleRate:Interrupt Throttling Rate (array of int)
SmartPowerDownEnable:Enable PHY smart power down (array of int)
copybreak:Maximum size of packet that is copied to a new buffer on receive (uint)
debug:Debug level (0=none,...,16=all) (int)
[root@foundation0 ~]#
给网卡提供buffer
改驱动
关闭自动协商后调整带宽 (开启自动协商则改不了)
rhel7 测试是可以的
ethtools -s eno16777736 autoeg off speed 100
advertise 设置值
advertise N
Sets the speed and duplex advertised by autonegotiation. The argument is a hexadecimal value using one or a combination of the following values:
0x001 10baseT Half
0x002 10baseT Full
0x004 100baseT Half
0x008 100baseT Full
0x010 1000baseT Half (not supported by IEEE standards)
0x020 1000baseT Full
0x20000 1000baseKX Full
0x20000000000 1000baseX Full
0x800000000000 2500baseT Full
不能在虚拟机里面设置
可以写在配置文件里
ETHTOOL_OPTS="-s ${DEVICE} autoneg off speed 1000 duplex full"
rhel8 可以使用nmcli
udp可能快一些
没有经过应用程序,他的带宽速率是多少
双网卡绑定
提升可靠性与速度
bonding_opts 每100毫秒检测一次链路状态 如何检测 网卡驱动检测 ethtool也可也检测链路状态
use_carrier=0 网卡驱动 1 ethtool
巨帧
交换机支持的话
正常为1500,9000 能大大提高效率
配置文件
MTU=9000