[Design Pattern] Upload big file - 4. Code Design - part 2 & Summary - Zhentiw

Validates the file size
Verifies the file hash
Marks the file status
Generates the file access URL
..

These operations are highly efficient.

How about file access?

Since the server does not perform actual file merging, it needs to handle dynamic processing when subsequent requests for the file are made. The specific approach is as follows:

Receive File Request:
- The server receives a request for the file and retrieves the corresponding file metadata from the database.
Locate All Chunks:
- The server retrieves the list of all chunk IDs for the file from the metadata and locates the corresponding chunk files in storage.
Stream File Using TaskQueue:
- The server utilizes the TaskQueue to control concurrency during file processing.
- Chunks are read sequentially or in parallel as needed, and a continuous read stream is created.
- The stream is piped directly to the network I/O to serve the file to the client.

查看代码

 import fs from 'fs';
import { TaskQueue } from './taskQueue'; // Assume TaskQueue is implemented

const taskQueue = new TaskQueue(4); // Limit to 4 concurrent file reads

// Simulated database with metadata
const fileMetadata = {
	fileId: '12345',
	chunks: ['chunk1.dat', 'chunk2.dat', 'chunk3.dat'],
};

// Serve file dynamically
async function serveFile(req, res) {
	const { fileId } = req.params;

	// Validate and fetch file metadata
	if (fileId !== fileMetadata.fileId) {
		res.status(404).send('File not found');
		return;
	}

	// Create readable stream and pipe to response
	res.setHeader('Content-Type', 'application/octet-stream');
	res.setHeader('Content-Disposition', 'attachment; filename="output-file.dat"');

	for (const chunk of fileMetadata.chunks) {
		await taskQueue.addTask(() => {
			return new Promise((resolve, reject) => {
				const chunkStream = fs.createReadStream(`./storage/${chunk}`);
				chunkStream
					.on('end', resolve)
					.on('error', reject)
					.pipe(res, { end: false });
			});
		});
	}

	res.end(); // End the response after all chunks
}

// Express server setup
import express from 'express';
const app = express();

app.get('/file/:fileId', serveFile);

app.listen(3000, () => {
	console.log('Server running on http://localhost:3000');
});

Summary

Developed the entire upload SDK from scratch, providing comprehensive support for file uploads, particularly large file uploads, across both frontend and backend. The SDK unifies the development approach for file uploads, covering everything from low-level protocols, utility classes, frontend components, to backend middleware.

In terms of implementation, to ensure flexibility, various design patterns were utilized to achieve complete decoupling between the SDK and upper-layer applications. Additionally, the server's storage structure was meticulously designed to ensure the uniqueness of file storage and transmission

Design choice

The common solution for large file uploads is file chunking. File chunking essentially breaks the large file upload process, which is a single large transaction, into multiple smaller chunk upload transactions, thereby reducing the risk of upload failures.

Implementing large file uploads involves numerous technical details. For example, defining the low-level protocol standard is critical as it determines how the frontend and backend interact, which in turn influences how the frontend and backend code are developed. Beyond the protocol, other considerations include how the frontend handles concurrency control, how to efficiently split files into chunks, and how the backend stores chunks, efficiently merges them, and ensures their uniqueness, among other challenges.

There isn’t a universal solution available on the market for these issues. While public cloud services like OSS (Object Storage Service) provide their own implementations, considering that our product may be deployed in customers' private clouds, the most reliable approach is to implement the entire large file upload process ourselves.

Technical implementation

My initial focus was on designing the upload process.

Traditional large file upload processes typically involve the client completing all chunking first, then calculating the hash for each chunk and the entire file. The hash is then used to exchange file information with the server. However, since hash calculation is a CPU-intensive operation, this approach can lead to prolonged client-side blocking. While using Web Workers can accelerate hash computation, my tests showed that even with multithreading, calculating the hash for extremely large files (e.g., files over 10 GB) on less powerful client machines can take more than 30 seconds, which is unacceptable.

To address this, I optimized the upload process. Assuming most uploads involve new files, I modified the workflow to allow users to start uploading chunks before the complete file hash is calculated. This approach achieves near-zero delay for uploads. Once the full file hash is computed, the hash data is supplemented to the server afterward.

Based on this workflow, I designed a standardized file upload protocol

The protocol consists of four communication standards:

File Creation Protocol:
The frontend sends a GET request to the server with the basic file information and receives a unique upload token in response. All subsequent requests must include this token.
Hash Verification Protocol:
The frontend sends the hash of a specific chunk or the entire file to the server to obtain the status of the chunk or file.
Chunk Upload Protocol:
The frontend uploads the binary data of each chunk to the server for storage.
Chunk Merging Protocol:
The frontend notifies the server that all chunks have been uploaded and the server can proceed with merging the chunks.

After designing the protocol, the next step was to implement it in code

For the frontend, the main challenges centered around two areas: how to split files into chunks and how to control the request flow.

File Chunking

Considering different scenarios, various chunking modes might be needed, such as:

Multithreaded chunking
Time-sliced chunking (similar to React Fiber)
Custom chunking modes defined by upper-layer applications.

To address this, I used the template pattern, leveraging TypeScript's abstract classes to define the overall chunking process. Specific subclasses only need to implement chunk hash calculations, enabling maximum flexibility.

Request Flow Control

Since many requests need to be sent, I developed a concurrent request control class to make full use of network bandwidth.

Additionally, the request process required exposing various hooks to the upper layer, such as:

Progress updates
Request state changes

To handle this, I implemented a generic EventEmitter class using the publish-subscribe pattern. This allows the request process to emit various events, which the upper-layer application can handle by listening to these events, enabling seamless integration.

Of course, the most complex part of the system lies in the backend

Since our project includes a BFF (Backend for Frontend) layer, file handling must be done in the BFF, requiring me to write corresponding server-side code.

The biggest challenge for the server is ensuring the uniqueness of each chunk. This uniqueness includes both storage uniqueness and transmission uniqueness:

Storage Uniqueness: Ensures that chunks are not stored redundantly, avoiding data duplication.
Transmission Uniqueness: Ensures that chunks are not uploaded redundantly, avoiding communication overhead.

To ensure that chunks are not stored redundantly, chunks and files must be decoupled. Chunks are stored independently and do not belong to any specific file, while files are independently recorded and point to their respective chunks in order.

This design means that even if two different files share the same chunk, the server avoids duplicate storage because chunks are independent entities.

When a user requests a file, I retrieve the chunk records for the corresponding file from the database and sequentially read the chunk data using file streams. The data is then directly streamed to the client via a pipeline.

This approach ensures extremely high efficiency for both merging and file access, while eliminating any storage redundancy on the server.

Answer1215

[Design Pattern] Upload big file - 4. Code Design - part 2 & Summary

How to Control Requests?

1. How to Maximize Bandwidth Utilization

2. How to Decouple from Upper-Layer Request Libraries

Key issue for Backend

How to isolate different file uploads?

How to ensure chunks are not duplicated?

What exactly does chunk merging do?