文件I/O (File I/O)

3.1 Introduction

在unix系统中，大部分使用文件I/O的情况是这五个函数open、read、write、lseek和close。

unbuffered I/O每一次调用read或write都深入到了unix的内核。unbuffered I/O函数不是ISO C的标准，但却是POSIX.1和Single Unix Specification 标准的一部分。

需要注意的是，当我们打开一个文件时，在多进程中，原子操作是很有必要的。函数dup、fcntl、sync、fsync和ioctl 这些设计多进程的函数将会进一步表述

3.2 File Description

文件描述符其实就是个非负整数，用来唯一定位和标识一个文件。Unix操作系统中通常默认0、1、2号文件表述符为标准输入、标准输出和标准错误输出，为了可读性，可以用宏定义的常量来表示，分别为STDIN_FILENO、STDOUT_FILENO、STDERR_FILENO。不要轻易该庙这3个文件表述符代表的意义，不然部分程序可能无法正常工作。

文件描述符（0 ~ OPEN_MAX-1），起初时，每个进程的可以使用的文件最大文件数是20(即OPEN_MAX==20),随后大量的系统开始增加到了64.使用我的CentOS7,利用sysconf(_SC_OPEN_MAX)的返回值是4096（前缀_SC_是system configuration的首字母），说明可以打开4096个文件。

3.3 Open and openat Functions

调用open或openat，可以打开或创建一个文件。

#include<fcntl.h>
int open(const char* path, int oflag/*, mode_t mode*/);
int openat(int fd, const char* path, int oflag/*, mode_t mode*/);
// Both return : file descriptor if OK, -1 on error

最后一个参数mode只有在文件是被创建时才被使用。path既可以是相对路径也可以是绝对路径。

path是被创建和打开的文件的路径，oflag参数选项。

下面的这五个选项是必选项：

O_RDONLY	Open for reading only
O_WRONLY	Open for writiing only
O_RDWR	Open for reading and writing.
O_EXEC	Open for execute only
O_SEARCH	Open for search only(applies to directions)

*其中大部分系统为了兼容以前的版本，定义O_RDONLY = 0, O_WRONLY= 1, ORDWR = 2;

*O_SEARCH是用来评估一个被打开的目录的搜索权限。不过APUE这本书中所涉及的UNIX系统并不支持O_SEARCH。

下面这些选项是可选的：

O_APPEND	Append to the end of file on each write.
O_CLOEXEC	Set the FD_CLOEXEC file descriptor flag.
O_CREAT	Create the file if it doesn’t exist.This option requires a third argument to the open function (a fourth argument to the openat function) — the mode, which specifies the access permission bits of the new file. (When we describe a file’s access permission bits in Section 4.5, we’ll see how to specify the mode and how it can be modified by the umask value of a process.)
O_DIRECTORY	Generate an error if path doesn’t refer to a directory.
O_EXCL	Generate an error if O_CREAT is also specified and the file already exists. This test for whether the file already exists and the creation of the file if it doesn’t exist is an atomic operation We describe atomic operations in more detail in Section 3.11.
O_NOCTTY	If path refers to a terminal device, do not allocate the device as the controlling terminal for this process. We talk about controlling terminals in Section 9.6.
O_NOFOLLOW	Generate an error if path refers to a symbolic link. We discuss symbolic links in Section 4.17
O_NONBLOCK	If path refers to a FIFO, a block special file, or a character special file, this option sets the nonblocking mode for both the opening of the file and subsequent I/O. We describe this mode in Section 14.2.
O_SYNC	Have each write wait for physical I/O to complete, including I/O necessary to update file attributes modified as a result of the write. We use this option in Section 3.14.
O_TRUNC	If the file exists and if it is successfully opened for either write-only or read–write, truncate its length to 0.
O_TTY_INIT	When opening a terminal device that is not already open, set the nonstandard termios parameters to values that result in behavior that conforms to the Single UNIX Specification. We discuss the termios structure when we discuss terminal I/O in Chapter 18.

下面的两个选线也是可选的，它们是Single UNIX Specification和POSIX.1中的同步输入输出选项中的的一部分：

O_DSYNC	Have each write wait for physical I/O to complete, but don’t wait for file attributes to be updated if they don’t affect the ability to read the data just written
O_RSYNC	Have each read operation on the file descriptor wait until any pending writes for the same portion of the file are complete.

*O_DSYNC 和 O_SYNC 标志位相似但却有一点微小的不同， O_DSYNC 被置位时，文件的属性不会被同步更新，只有需要更新时才更新。当O_SYNC被置位时，文件的属性是一直同步更新。比如当写入一个以O_DSYNC置位的方式打开的文件时，文件的时间不会被同步更新。相反，如果写入一个以O_SYNC置位的方式打开的文件时，在write return前，每次write这个文件，都会更新文件的时间，而不管我们是否覆盖原有字节的写入还是追加写入。

每次通过open和openat返回的文件表述符是最小的未使用的文件描述符。这种特性可以被应用程序设计为以标准输入、标准输出、标准错误的方式打开一个文件。当探讨dup2函数时，我们就可以知道如何根据特定的描述符来打开一个文件。

open和openat的区别在于参数fd（文件描述符）。有下列三种情况：

1.参数path指的是绝对路径，那么fd参数可以忽略，此时openat和open一样。

2.参数path是相对路径，fd参数则为在文件系统中，相对路径开始位置的文件描述符。我们可以通过打开目录的方式，获得fd的值。

3.参数path是相对路径，df参数取特殊值AT_FDCWD,在这种情况下，开始位置则为当前的工作目录。此时openat和open一样。

openat函数，被加入到最新的POSIX.1中，为了解决两个问题。

1.让线程可以通过非当前目录的相对路径来打开文件。

2.提供了一种避免time-of-check-to-time-of-use（TOCTTOU）的错误。

所谓TOCTTOU error，为了正确理解意思，英文原文如下：

The basic idea behind TOCTTOU errors is that a program is vulnerable if it makes two file-based function calls where the second call depends on the results of the first call. Because the two calls are not atomic, the file can change between the two calls, thereby invalidating the results of the first call, leading to a program error. TOCTTOU errors in the file system namespace generally deal with attempts to subvert file system permissions by tricking a privileged program into either reducing permissions on a privileged file or modifying a privileged file to open up a security hole. Wei and Pu[2005] discuss TOCTTOU weaknesses in the UNIX file system interface.

关于文件名和文件路径名的截断问题：

如果NAME_MAX的值是14，而我们创建一个名长度为15的文件，那么最后一个字符将被丢弃。

在POSIX.1中，常量_POSIX_NO_TRUNC 决定了是否长文件名被截断或报错。我们可以使用fpathconf或者pathconf函数来获取NAME_MAX的值。

当_POSIX_NO_TRUNC 被置位时, 如果文件名超过了 NAME_MAX. errno is set to ENAMETOOLONG, and an error status is returned。

3.4 Creat Function

#include <fcntl.h>
int creat(const char *path, mode_t mode);
//Returns: file descriptor opened for write-only if OK, −1 on error
// this function is equivalent to the follow one
open(path, O_WRONLY | O_CREAT | O_TRUNC, mode);

creat函数是由于之前的open函数的标志位只有0,1,2即O_RDONLY、 O_WRONLY和ORDWR，那么当一个文件不存在时，就只能通过creat函数来创建一个函数，然后再close这个文件，然后载open文件。

不过现在open函数得到了扩展，已经可以通过O_CREAT来创建文件了。所以以后可以不适用creat函数了。只需用open就行了，open更加简介干脆。

3.5 close Function

#include <unistd.h>
int close(int fd);
//Returns: 0 if OK, −1 on error

close 函数，是用来关闭文件的。Closing a file also releases any record locks that the process may have on the file. We’ll discuss this point further in Section 14.3.

当一个进程终止时，这个进程打开的所有文件都会被内核自动关闭，大量的应用程序利用这个特点，不用显式的关闭文件。

3.6 lseek Function

当打开一个文件时，此文件将绑定一个非负整数current file offset，用来确定当前读取或写入的位置。此offset随着文件读取和写入逐渐增大，当打开一个文件时，若O_APPEND没有置位，则offset为0.

此offset可以通过lseek函数来显示的设定它的值。

#include <unistd.h>
off_t lseek(int fd, off_t offset, int whence);
//Returns: new file offset if OK, −1 on error

whence 参数，表示offset在何处进行偏移：

1.如果whence 是 SEEK_SET，则在文件的开始处，进行计算偏移。

2.如果whence 是SEEK_CUR, 则在当前的位置，进行计算偏移。

3.如果whence 是 SEEK_END, 则在文件的size字节处（我的理解也就是结尾处），进行计算偏移量，此时的偏移量可以是正数也可以是负数。

off_t currpos;
currpos = lseek(fd, 0, SEEK_CUR);

可以使用上式来确定当前的offset。也可以用它来测试一个文件是否支持lseek函数。如果此文件描述符指向的是pipe、FIFO或者Socket，那么上式会返回-1,并且errno 会被设置成ESPIPE。
lseek的首字母l，表示long integer的意思。在offset_t数据类型引入之前，offset的的类型和lseek的返回值类型都是long integer。当long int 的数据类型被引入到C语言后，lseek也被引入到了UNIX的Version7中。（在 Version6中有seek和tell函数）

#include "apue.h"
int main(void)
{
    if (lseek(STDIN_FILENO, 0, SEEK_CUR) == -1)
        printf("cannot seek\n");  // the result here
    else
        printf("seek OK\n");
    exit(0);
}

由于对于普通文件来说，整体偏移量offset的值是个非负值，然而对一些设备文件来说，可能会使用负值（The /dev/kmem device on FreeBSD for the Intel x86 processor supports negative offsets. ）。所以在使用普通文件时，要时刻提防offset变为负数的情况，我们一般推荐通过判断lseek的返回值是否是-1，来判断是否出现错误，而不是看offset的值是否是负数的方式。

当offset的值超过文件内容的大小时，会产生hole.

#include"apue.h"
#include<stdio.h>
#include<fcntl.h>
#include<sys/types.h>
#include<sys/stat.h>

int main(void){
    int fd;
    char buf1[11] = "abcdefghij\0";
    char buf2[11] = "ABCDEFGHIJ\0";
    if((fd = open("file.hole", O_CREAT|O_RDWR, S_IRWXU)) < 0){
        err_sys("Creat error");
    }
    if(write(fd, buf1, 10) != 10){
        err_sys("buf1 write error");
    } // offset now is 10;
    if(lseek(fd,16384,SEEK_SET) == -1){
        err_sys("lseek error");
    }// offset now is 16384;
    if(write(fd, buf2, 10) < 0){
        err_sys("buf2 write error");
    }// offset new is 16394;
    exit(0);
}

其中是file.nohole是创建的同等大小的没有hole的文件，我们发现两者的disk blocks数是不一样的，一个是8，一个是20.关于file holed的问题会在Section4.12中详细介绍。

off_t类型的字节数：不同操作系统的off_t的字节数是不一样的。Most platforms today provide two sets of interfaces to manipulate file offsets: one set that uses 32-bit file offsets and another set that uses 64-bit file offsets.

Note that even though you might enable 64-bit file offsets, your ability to create a file larger than 2 GB (2³¹-1 bytes) depends on the underlying file system type.

3.7 read Function

#include <unistd.h>
ssize_t read(int fd, void *buf, size_t nbytes);
//Returns: number of bytes read, 0 if end of file, −1 on error

read是从当前offset位置，读取数据。

下列几种情况，可能使read返回的字节数小于要读取的字节数：

1.在读一个普通文件遇到end of file时。

2.当读一个Terminal Device（终端设备）时，通常一次最多只能读一行，在Chapter18详细介绍如何改变不一次读一行。

3.从网络上读入内容。网络上的缓存可能小于请求读取的字节数。

4.当读一个pipe或者FIFO时。如果管道内只有少量的字节，则只能读取这些字节。

5.当从record-oriented device中读取内容， Some record-oriented devices，such as magnetic tape, can return up to a single record at a time.

6.当被一个信号，或部分已读的数据打断时。discuss this further in Section 10.5.

3.8 write Function

#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t nbytes);
//Returns: number of bytes written if OK, −1 on error

The return value is usually equal to the nbytes argument; otherwise, an error has occurred. A common cause for a write error is either filling up a disk or exceeding the file size limit for a given process (Section 7.11 and Exercise 10.11).

3.9 I/O Efficiency

#include "apue.h"
#define BUFFSIZE 4096
int
main(void)
{
    int n;
    char buf[BUFFSIZE];
    while ((n = read(STDIN_FILENO, buf, BUFFSIZE)) > 0)
        if (write(STDOUT_FILENO, buf, n) != n)
            err_sys("write error");
     if (n < 0)
        err_sys("read error");
    exit(0);
}

当标准输出重定向到/dev/null后，进行从标准输入读文件测试。结果如下：

当BUFFERSIZE的大小为32时，再增加BUFFERSIZE，对总的clock 时间来说，几乎没哟影响。这是因为Most file systems support some kind of read-ahead to improve performance。When sequential reads are detected, the system tries to read in more data than an application requests, assuming that the application will read it shortly. 当BUFFERSIZE的大小为4096时，再增加BUFFERSIZE，对系统时间的减少效应已经很小，这是因为本次测试的文件系统是块大小为4096字节的linux ext4文件系统。

*在使用此程序进行测试时，每次测试文件一定要改变，因为操作系统内部会缓存文件，从而使得测试结果不准确。

3.10 File Sharing

UNIX系统支持在不同的进程中共享打开的文件。

同一进程打开两个不同的文件时，内部数据结构如下图：

一个进程，当有两个文件描述符指向一个文件时，内部的关系结构如下图：

如果有两个进程打开了同一个文件时，如下图：

以上图中的fd flags即file descriptor flags，包含close-on-exec标志位。

File status flags 的内容如下表如下表所示，主要是对文件的一些操作权限、操作方式、以及文件的性质。

3.11 Atomic Operations

如果单个进程想在文件的末尾追加内容，在以前不支持O_APPEND标志位选项的UNIX系统中，我们只能用如下方式来打开文件。

if (lseek(fd, 0L, 2) < 0) /* position to EOF */
    err_sys("lseek error");
if (write(fd, buf, 100) != 100) /* and write */
    err_sys("write error");

如果在单进程汇总，这样是没有问题的，但是若在多进程中，将会出现很多问题，因为这不是原子操作，（我们的定位到文件的结束位置和文件的写入操作是两个函数调用）。

UNIX系统提供了一个原子的泛式来解决这个问题，那就是在打开文件时引入了O_APPEND标志位。当以O_APPEND打开文件时，那么UNIX内核就会在每次进行写入操作之前来定位当前的文件结束位置，那样就不需要在使用lseek函数来定位问价结束位置了。

pread and pwrite Functions

Single UNIX Specification 含有两个函数来允许应用程序自动seek和perform I/O, 那就是pread和pwrite。

#include <unistd.h>
ssize_t pread(int fd, void *buf, size_t nbytes, off_t offset);
//Returns: number of bytes read, 0 if end of file, −1 on error
ssize_t pwrite(int fd, const void *buf, size_t nbytes, off_t offset);
//Returns: number of bytes written if OK, −1 on error

调用pread，相当于调用了lseek后紧跟着调用read。并且这个调用是不会被中断的。pwrite同理。

Creating a File

原子操作的另一个例子是，当我们想以O_CREAT和O_EXCL的方式open一个文件时，如果文件已经存在，那么open将会失败。

那么为了解决问题，我们可能会这么写：

if ((fd = open(path, O_WRONLY)) < 0) {
    if (errno == ENOENT) {
        if ((fd = creat(path, mode)) < 0)
            err_sys("creat error");
    } 
    else {
        err_sys("open error");
    }
}

那么在open和creat之间如果此文件被另一个进程创建了并向文件内写入了数据，那么就会出现问题（当creat执行时，写入的数据会被清除）

3.12 dup and dup2 Functions

#include <unistd.h>
int dup(int fd);
int dup2(int fd, int fd2);
//Both return: new file descriptor if OK, −1 on error

dup和dup2函数可以返回一个新的文件描述符，dup返回的是系统中可用的文件表述符中的最小的那个最为此文件的描述符。dup2则可以指定新的文件描述符为fd2，如果fd2之前已经被打开，那么会先关闭fd2；如果fd2等于fd，那么返回fd2，并且不关闭此文件描述符，否则，FD_CLOEXEC file descriptor flag is cleared for fd2目的是为了当进程调用exec是，fd2仍然保持open状态。

当调用dup或dup2时，内核数据结构如3.10中的fig3.9所示。

每一个文件描述符都有自己的文件表述符标志位(set of file descriptor flags). the close-on-exec file descriptor for new descriptor is always cleared by dup functions, except the dup2 function while fd2 equivalent to fd.

另一种复制文件描述符的方法是fcntl 函数。

dup(fd) is equivalent to fcntl(fd, F_DUPPFD, 0).

dup2(fd, fd2); is equivalent to close(fd2); fcntl(fd, F_DUPPFD,fd2);不过dup2并不完全与close后加fcntl相等。

首先dup2是一个原子操作。dup2和fcntl的errno是不一样的。

3.13 sync, fsync, and fdatasync Function

通常UNIX系统内核有一个buffer cache或者 page cache，大部分的I/O操作都会通过这些缓存进行。

When we write data to a file, the data is normally copied by the kernel into one of its buffers and queued for writing to disk at
some later time. This is called delayed write.

The kernel eventually writes all the delayed-write blocks to disk, normally when it needs to reuse the buffer for some other disk block. To ensure consistency of the file system on disk with the contents of the buffer cache, the sync, fsync, and fdatasync functions are provided.

#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);
//Returns: 0 if OK, −1 on error
void sync(void);

sync 函数把所有的被修改的block buffers放入写入硬盘队列，并return，sync并不等待写入硬盘完成才return。

The function sync is normally called periodically (usually every 30 seconds) from a system daemon, often called update .

fsync 函数仅仅涉及一个文件，就是文件描述符指定的那个。它等待文件写入硬盘完成才return. fsync function is used when
an application, such as a database, needs to be sure that the modified blocks have been written to the disk.

The fdatasync function is similar to fsync, but it affects only the data portions of a file. With fsync, the file’s attributes are also updated synchronously.

3.14 fcntl Function

The fcntl function can change the properties of a file that is already open.

#include <fcntl.h>
int fcntl(int fd, int cmd, ... /* int arg */ );
//Returns: depends on cmd if OK (see following), −1 on error

The fcntl function is used for five different purposes.
1. Duplicate an existing descriptor (cmd = F_DUPFD or F_DUPFD_CLOEXEC)
2. Get/set file descriptor flags (cmd = F_GETFD or F_SETFD)
3. Get/set file status flags (cmd = F_GETFL or F_SETFL)
4. Get/set asynchronous I/O ownership (cmd = F_GETOWN or F_SETOWN)
5. Get/set record locks (cmd = F_GETLK, F_SETLK, or F_SETLKW)

3.15 ioctl Function

#include <unistd.h> /* System V */
#include <sys/ioctl.h> /* BSD and Linux */
int ioctl(int fd, int request, ...);
//Returns: −1 on error, something else if OK