折腾笔记[16]-基于transformers.js在浏览器中运行DeepSeek-R1推理

摘要

基于transformers.js在浏览器中本地运行DeepSeek-R1的推理, 实测可在Intel核显的WebGPU运行。
Running DeepSeek-R1 inference locally in the browser using transformers.js, tested to work on Intel integrated graphics with WebGPU.

关键词

transformers.js;deepseek;llm;webgpu;

关键信息

"@huggingface/transformers": "^3.3.1"
deepseek : DeepSeek-R1-Distill-Qwen-1.5B-ONNX

原理简介

transformers.js简介

[https://github.com/huggingface/transformers.js]
最先进的机器学习技术应用于Web

直接在浏览器中运行🤗 Transformers，无需服务器！

Transformers.js旨在与Hugging Face的transformers Python库功能等效，这意味着您可以使用非常相似的API运行相同的预训练模型。这些模型支持不同模态的常见任务，例如：

📝 自然语言处理：文本分类、命名实体识别、问答、语言建模、摘要、翻译、多项选择和文本生成。
🖼️ 计算机视觉：图像分类、目标检测、分割和深度估计。
🗣️ 音频：自动语音识别、音频分类和文本转语音。
🐙 多模态：嵌入、零样本音频分类、零样本图像分类和零样本目标检测。
Transformers.js使用ONNX Runtime在浏览器中运行模型。最棒的是，您可以使用🤗 Optimum轻松将预训练的PyTorch、TensorFlow或JAX模型转换为ONNX。

更多信息，请查看完整文档。

State-of-the-art Machine Learning for the Web

Run 🤗 Transformers directly in your browser, with no need for a server!

Transformers.js is designed to be functionally equivalent to Hugging Face's transformers python library, meaning you can run the same pretrained models using a very similar API. These models support common tasks in different modalities, such as:

📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
🖼️ Computer Vision: image classification, object detection, segmentation, and depth estimation.
🗣️ Audio: automatic speech recognition, audio classification, and text-to-speech.
🐙 Multimodal: embeddings, zero-shot audio classification, zero-shot image classification, and zero-shot object detection.
Transformers.js uses ONNX Runtime to run models in the browser. The best part about it, is that you can easily convert your pretrained PyTorch, TensorFlow, or JAX models to ONNX using 🤗 Optimum.

For more information, check out the full documentation.

onnx等模型格式简介

[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/summary]
[https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/]
[https://blog.devops.dev/understanding-hugging-face-model-file-formats-ggml-and-gguf-914b0ebd1131?gi=f0bc8c27acd4]
[https://github.com/premAI-io/state-of-open-source-ai/blob/main/model-formats.md]
[https://hackmd.io/@haixuantao/ryS-LAR_a?utm_source=preview-mode&utm_medium=rec]
[https://medium.com/@vimalkansal/understanding-the-gguf-format-a-comprehensive-guide-67de48848256]
将模型转换为ONNX格式并使用Transformers.js兼容的权重是一个很好的方法，特别是在WebML（Web机器学习）逐渐普及的背景下。目前，建议将ONNX权重存放在一个单独的子文件夹中（例如onnx），这样可以方便管理和使用。未来，随着WebML的进一步发展，这种分离的存储方式可能会有所改变。

如果你希望将模型转换为ONNX格式，可以使用🤗 Optimum工具，并按照上述结构组织你的仓库。这样可以让你的模型更容易在Web环境中部署和使用。

ONNX (Open Neural Network Exchange) 是一个开放源代码的AI模型格式，旨在通过定义可扩展的计算图模型、内置操作符和标准数据类型，实现不同框架之间的互操作性。它广泛支持多种框架、工具和硬件，使得模型在不同环境之间的转换变得更加容易。

ONNX 的核心特点：

模型互操作性：ONNX 桥接了不同的AI框架，使得模型可以在它们之间无缝转移，避免了复杂的转换过程。
计算图模型：ONNX 的核心是一个图模型，将AI模型表示为有向图，节点表示操作，提供了极大的灵活性。
标准化数据类型：ONNX 定义了标准的数据类型，确保在模型交换时的一致性，减少了数据类型问题。
内置操作符：ONNX 提供了丰富的内置操作符库，支持常见的AI任务，确保在不同框架中的计算一致性。

ONNX 生态系统：

ONNX Runtime：一个跨平台的高性能推理引擎，支持ONNX模型。
ONNX ML Tools：用于ONNX模型转换的工具，支持与TensorFlow、PyTorch等框架的兼容性。
ONNX Models：一个包含多种预训练模型的仓库，这些模型已转换为ONNX格式。

使用ONNX的步骤：

模型转换：首先需要将模型转换为ONNX格式。例如，如果模型是用PyTorch创建的，可以使用torch.onnx.export进行转换。
推理运行：ONNX模型可以在不同的平台上运行，具体取决于支持的运行时库（Execution Provider）。目前支持CPU、GPU、IoT/边缘设备等多种平台。
量化：ONNX Runtime 提供了一些工具，可以对部分ONNX模型进行量化，以优化推理速度。

ONNX 的局限性：

性能问题：有些用户在将模型转换为ONNX格式后，发现推理速度变慢，这表明转换过程可能并不适用于所有模型。
Protobuf 限制：ONNX 使用 Protobuf 存储和读取模型，这可能导致一些限制。

ONNX 与 GGUF 的比较：

目的：ONNX 是一个通用的AI模型格式，而GGUF专门用于量化的大型语言模型。
兼容性：ONNX 支持更广泛的AI架构，而GGUF在消费级硬件上高效运行高度量化的模型方面表现出色。

ONNX 是一个强大的工具，特别适合需要在不同框架之间转换模型的开发者。然而，它并不总是适用于所有模型，尤其是在性能优化方面。对于需要高效运行量化模型的场景，GGUF 可能是更好的选择。

无论你是初学者还是经验丰富的开发者，理解这些格式将帮助你选择适合你工作流程的工具。

原文

[https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B] with ONNX weights to be compatible with Transformers.js.
Note: Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using 🤗 Optimum and structuring your repo like this one (with ONNX weights located in a subfolder named onnx).

ONNX (Open Neural Network Exchange) provides an open source format for AI models by defining an extensible computation graph model, as well as definitions of built-in operators and standard data types. It is widely supported and can be found in many frameworks, tools, and hardware enabling interoperability between different frameworks. ONNX is an intermediary representation of your model that lets you easily go from one environment to the next.

Model Interoperability: ONNX bridges AI frameworks, allowing seamless model transfer between them, eliminating the need for complex conversions.
Computation Graph Model: ONNX's core is a graph model, representing AI models as directed graphs with nodes for operations, offering flexibility.
Standardised Data Types: ONNX establishes standard data types, ensuring consistency when exchanging models, reducing data type issues.
Built-in Operators: ONNX boasts a rich library of operators for common AI tasks, enabling consistent computation across frameworks.
ONNX Ecosystem:
https://github.com/microsoft/onnxruntime A high-performance inference engine for cross-platform ONNX models.
https://github.com/onnx/onnxmltools Tools for ONNX model conversion and compatibility with frameworks like TensorFlow and PyTorch.
https://github.com/onnx/models A repository of pre-trained models converted to ONNX format for various tasks.
Hub: Helps sharing and collaborating on ONNX models within the community.

Usability around ONNX is fairly developed and has lots of tooling support around it by the community, let's see how we can directly export into onnx and make use of it.

Firstly the model needs to be converted to ONNX format using a relevant converter, for example if our model is created using Pytorch, for conversion we can use:

(onnx-custom-layer)=

torch.onnx.export
For custom operators support same exporter can be used.
optimum by huggingface

Many frameworks/tools are supported, with many examples/tutorials at https://github.com/onnx/tutorials#converting-to-onnx-format.

(onnx-runtime)= It has support for Inference runtime binding APIs written in few programming languages (python, rust, js, java, C#).

ONNX model's inference depends on the platform which runtime library supports, called Execution Provider. Currently there are few ranging from CPU based, GPU based, IoT/edge based and few others. A full list can be found here.

(onnx-quantisation)= Onnxruntime has few example tools that can be used to quantize select ONNX models. Support is currenty based on operators in the model. Read more here.

Also there are few visualisation tools support like https://github.com/lutzroeder/Netron and more for models converted to ONNX format, highly recommended for debugging purposes.

Onnx uses Opsets (Operator sets) number which changes with each ONNX package minor/major releases, new opsets usually introduces new operators. Proper opset needs to be used while creating the onnx model graph.

Also it currently doesn't support 4-bit quantisation (microsoft/onnxruntime#14997).

There are lots of open issues (microsoft/onnxruntime#12880, #10303, #7233, #17116) where users are getting slower inference speed after converting their models to ONNX format when compared to base model format, it shows that conversion might not be easy for all models. On similar grounds an user comments 3 years ago here though it's old, few points still seems relevant. The troubleshooting guide by ONNX runtime community can help with commonly faced issues.

Usage of Protobuf for storing/reading of ONNX models also seems to be causing few limitations which is discussed here.

Purpose: ONNX is more of a general-purpose AI model format, while GGUF is specialized for quantized large language models. Compatibility: ONNX supports a wider range of AI architectures, but GGUF excels in running highly quantized models efficiently on consumer hardware.

If you’re like me, navigating the world of machine learning can sometimes feel like learning a new language — there are tons of acronyms, file formats, and tools that can get overwhelming pretty quickly! When I first started using large language models (like GPT) from Hugging Face, I was confused by all the different files: model.bin, config.json, .onnx—what do they all do? And how do GGML and GGUF fit into the picture?
In this blog, I’m going to break everything down so that you don’t have to struggle like I did. We’ll take a look at the different file formats Hugging Face uses, talk about newer formats like GGML and GGUF, and figure out their pros and cons. Whether you’re just starting out or you’ve been working with these models for a while, this guide should help you get a better handle on the basics.

Hugging Face Model File Formats: What Are They?
Hugging Face provides pretrained models in multiple file formats that help developers easily load, fine-tune, and deploy models. Understanding these files is key to using Hugging Face models effectively.
Key Hugging Face File Formats:
model.bin:
This is the primary file that contains the model’s weights. These weights are learned parameters of the model that help it make predictions or perform tasks.
What are weights? Weights are the numerical values that get adjusted during training, enabling the model to learn.
config.json:
The config.json file contains the model’s architecture details such as the number of layers, hidden units, attention heads, etc. This file is crucial for correctly loading the model’s structure.
What does it do? It tells the framework (like PyTorch or TensorFlow) how to assemble the model before loading the weights from model.bin.
tokenizer.json:
This file contains the tokenizer information, which is responsible for breaking down input text into smaller pieces (tokens) that the model can understand.
Why is this important? Without proper tokenization, the model wouldn’t be able to process textual input correctly.
vocab.txt:
For some models, especially those based on transformers like BERT, vocab.txt holds the vocabulary list of all possible tokens the model can recognize.
What’s inside? A list of words or subwords that the model can break text into, helping it convert input text into a form the model can work with.
merges.txt:
This file is used in Byte-Pair Encoding (BPE) tokenization, which merges smaller units of text into larger ones. It’s common in models like GPT-2.
Why does it matter? It helps the tokenizer decide how to efficiently break down and merge text tokens.
Advanced Hugging Face File Formats:
In addition to the basic files, Hugging Face supports more advanced formats for optimized performance, model sharing, and deployment.
model.onnx:
ONNX stands for Open Neural Network Exchange, a format designed to be interoperable between different frameworks like PyTorch and TensorFlow.
Why use ONNX? It allows models trained in one framework (e.g., PyTorch) to be used in another (e.g., TensorFlow), enabling flexibility in model deployment.
Pros: Lightweight, cross-platform compatibility, and faster inference on CPUs and GPUs.
Cons: Converting models into ONNX may sometimes lead to performance differences or incompatibilities.
model.safetensors:
Safetensors is a new, efficient file format designed to store model weights securely and in a smaller size compared to traditional .bin files.
Why use Safetensors? It’s faster to load, reduces the risk of malicious code (since it doesn’t store code, only data), and is easier to share across systems.
Pros: Secure, lightweight, and faster loading compared to .bin.
Cons: Less widely adopted so far, though growing in popularity.
generation_config.json:
This file holds the generation parameters (like temperature, top-k sampling, max tokens) that control how the model generates text.
Why is it useful? It allows you to customize how the model generates text, affecting creativity, diversity, or predictability of the generated responses.
GGML: GPT-Generated Model Language
GGML is a format developed to simplify the use of large language models like GPT, especially for running them on CPUs. It bundles everything into one file for easy sharing and loading.
Key Features of GGML:
Single File Format: GGML consolidates the model and configuration into a single file, reducing complexity for sharing.
CPU-Compatible: GGML is designed to run efficiently on CPUs, making it accessible for those without high-end GPUs.
Pros of GGML:
Convenience: No need to manage multiple files like in Hugging Face formats.
CPU-Friendly: You can run models on standard hardware without GPUs.
Cons of GGML:
Limited Metadata: GGML lacks support for storing extra information like model version or configuration, making it less flexible.
Compatibility Issues: GGML struggled with introducing new features, requiring manual adjustments for older models.
GGUF: GPT-Generated Unified Format
GGUF is the evolution of GGML, solving many of its limitations. Introduced in 2023, GGUF adds more functionality, better metadata support, and future-proofing for large language models.
Key Features of GGUF:
Unified Format: GGUF retains the single-file approach of GGML but introduces more flexibility.
Metadata Support: Unlike GGML, GGUF allows for storing extra information like model version, architecture, and configuration.
Backward Compatibility: GGUF is compatible with older GGML models but can handle newer features with ease.
Pros of GGUF:
Flexibility: Supports new features and stores metadata.
Backward Compatibility: Works with older GGML models without breaking them.
Easier to Use: Less need for manual parameter adjustments, improving the user experience.
Cons of GGUF:
Transition Period: Moving from GGML to GGUF may take time for users who already have GGML models.
Learning Curve: Though easier to use, GGUF introduces new concepts that users need to learn.
Conclusion
Hugging Face, GGML, and GGUF are all powerful formats with different use cases depending on your needs. Here’s a quick takeaway:
Hugging Face models offer flexibility with separate files for weights, configuration, and tokenization, making them ideal for customization and compatibility across platforms like PyTorch and TensorFlow.
GGML provided a simple single-file solution but lacked flexibility, especially as newer features were introduced.
GGUF is the latest evolution, offering the best of both worlds with a unified file, backward compatibility, and added metadata for future-proofing.
Whether you’re a beginner trying to load your first model or an advanced user looking for optimization, understanding these formats will help you choose the right tool for your workflow.

对比表格

webgpu简介

[https://huggingface.co/docs/transformers.js/guides/webgpu]
WebGPU是加速图形和计算的新网络标准。API使web开发人员能够使用底层系统的GPU直接在浏览器中进行高性能计算。WebGPU是WebGL的继承者，提供了更好的性能，因为它允许与现代GPU进行更直接的交互。最后，它支持通用GPU计算，这使其非常适合机器学习！
截至2024年10月，全球WebGPU支持率约为70%（根据caniuse.com的数据），这意味着一些用户可能无法使用API。
如果以下演示在您的浏览器中不起作用，您可能需要使用功能标志启用它：
Firefox：带有dom.webgpu.enabled标志（见此处）。
Safari：带有WebGPU功能标志（见此处）。
较旧的Chromium浏览器（在Windows、macOS、Linux上）：带有启用不安全的webgpu标志（见此处）。
由于WebGPU的实验性质，特别是在非Chromium浏览器中，您在尝试运行模型时可能会遇到问题（即使它可以在WASM中运行）。

WebGPU is a new web standard for accelerated graphics and compute. The API enables web developers to use the underlying system’s GPU to carry out high-performance computations directly in the browser. WebGPU is the successor to WebGL and provides significantly better performance, because it allows for more direct interaction with modern GPUs. Lastly, it supports general-purpose GPU computations, which makes it just perfect for machine learning!
As of October 2024, global WebGPU support is around 70% (according to caniuse.com), meaning some users may not be able to use the API.
If the following demos do not work in your browser, you may need to enable it using a feature flag:
Firefox: with the dom.webgpu.enabled flag (see here).
Safari: with the WebGPU feature flag (see here).
Older Chromium browsers (on Windows, macOS, Linux): with the enable-unsafe-webgpu flag (see here).
Due to the experimental nature of WebGPU, especially in non-Chromium browsers, you may experience issues when trying to run a model (even it it can run in WASM).

大模型的量化蒸馏版本简介

[https://ai.stackexchange.com/questions/43054/when-to-use-pruning-quantization-distillation-and-others-when-optimizing-spee]
[https://medium.com/aimonks/what-is-quantization-and-distillation-of-models-a67e3a2dc325]
[https://medium.com/@aadityaura_26777/quantization-vs-distillation-in-neural-networks-a-comparison-8ef522e4fbec]

量化:quantization
蒸馏:Distill

量化：精准高效

量化就是数字精度。通过减小模型中权重和激活的位宽，可以缩小模型大小，从而可能提高推理速度。
神经网络具有相互连接的神经元，每个神经元都有在训练过程中调整的权重和偏差。这些参数值以及神经元激活通常存储在32位浮点数中，这提供了精度，但占用了大量内存。例如，一个50层的ResNet需要168MB来存储2600万个32位权重值和1600万个32位元激活值。
量化旨在通过使用较低的带宽（如8位整数）来表示权重和激活，从而减少内存占用。这引入了量化误差，但允许每比特多存储4倍的值。目标是平衡精度和内存使用之间的权衡。每通道量化、随机舍入和重新训练等先进技术可以最大限度地减少对模型准确性的影响。
最常见的两种量化情况是float32->float16和float32->int8。

蒸馏：从教师到学生

蒸馏涉及训练一个较小的神经网络，称为学生，以模仿一个较大的预训练网络，即教师。

在实践中

量化通常在特定于硬件的部署中占有一席之地，而当人们想要一个性能接近大型模型的轻量级模型时，就会寻求蒸馏。在许多情况下，两者的结合——提取模型，然后对其进行量化——可以带来两全其美的好处。必须使选择与部署需求、可用资源以及在准确性和效率方面可接受的权衡相一致。

参考资料

高效神经网络推理量化方法综述[https://arxiv.org/pdf/2103.13630.pdf]
知识提炼：综述[https://arxiv.org/pdf/2006.05525.pdf]

Quantization: Precision for Efficiency

Quantization is all about numeric precision. By reducing the bit-width of weights and activations in a model, one can shrink the model size, potentially increasing inference speed.
Neural networks have interconnected neurons, each with weights and biases that are tuned during training. These parameter values, along with neuron activations, are typically stored in 32-bit floats, which provide precision but take up a lot of memory. For example, a 50-layer ResNet requires 168MB to store 26 million 32-bit weight values and 16 million 32-bit activation values.
Quantization aims to reduce this memory footprint by using lower bandwidths like 8-bit integers to represent both weights and activations. This introduces quantization error but allows the storage of 4x more values per bit. The goal is to balance this tradeoff between precision and memory usage. Advanced techniques like per-channel quantization, stochastic rounding, and re-training can minimize the impact on model accuracy.
The two most common quantization cases are float32 -> float16 and float32 -> int8.

Distillation: From Teacher to Student

Distillation involves training a smaller neural network, called the student, to mimic a larger pre-trained network, the teacher.

In Practice

Quantization often finds its place in hardware-specific deployments, while distillation is sought when one desires a lightweight model with performance close to a larger counterpart. In many scenarios, a combination of both — distilling a model and then quantizing it — can bring forth the benefits of both worlds. It’s essential to align the choice with the deployment needs, available resources, and acceptable trade-offs in terms of accuracy and efficiency.

Resources

A Survey of Quantization Methods for Efficient Neural Network Inference [ https://arxiv.org/pdf/2103.13630.pdf]
Knowledge Distillation: A Survey [https://arxiv.org/pdf/2006.05525.pdf]

bun简介

[https://bun.sh]
Bun 是一个集开发、测试、运行和打包 JavaScript 与 TypeScript 项目于一体的工具。它是一个为速度而设计的全能 JavaScript 运行时和工具包，包含了打包工具、测试运行器以及与 Node.js 兼容的包管理器。Bun 的目标是实现 100% 的 Node.js 兼容性。

Bun 的核心理念是简化开发流程，同时提升性能。它不仅仅是一个运行时，还提供了一系列工具，帮助开发者更高效地构建和部署应用。无论是小型项目还是大型应用，Bun 都能提供出色的支持。

Develop, test, run, and bundle JavaScript & TypeScript projects—all with Bun. Bun is an all-in-one JavaScript runtime & toolkit designed for speed, complete with a bundler, test runner, and Node.js-compatible package manager. Bun aims for 100% Node.js compatibility.

http/https的cors跨域请求简介/安全性

[https://htmlspecs.com/fetch/]
[https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS]
[https://developer.mozilla.org/zh-CN/docs/Web/HTTP/CSP]
[https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Permissions_Policy]
[https://developer.mozilla.org/en-US/observatory]
[https://developer.mozilla.org/zh-CN/docs/Web/Security/Practical_implementation_guides]
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Cross-Origin_Resource_Policy]
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers]

Cross-Origin Resource Sharing (CORS)

CORS（跨域资源共享）概述

CORS是一种基于HTTP头部的机制，允许服务器指示浏览器可以加载来自其自身以外的其他来源（域名、协议或端口）的资源。
浏览器在发起跨域请求时，会先发送一个“预检”请求（preflight request）到目标服务器，以检查服务器是否允许实际请求。预检请求会携带表示实际请求的HTTP方法和头部信息的头部。
从安全角度出发，浏览器限制了从脚本发起的跨域HTTP请求，例如fetch()和XMLHttpRequest遵循同源策略，只有在目标服务器的响应中包含正确的CORS头部时，才能从其他来源请求资源。

使用CORS的请求类型

fetch()或XMLHttpRequest的调用：用于跨域请求数据。
Web字体：在CSS的@font-face中使用跨域字体。
WebGL纹理。
在canvas中绘制的图像/视频帧：通过drawImage()方法。
CSS Shapes中的图像。

CORS的工作原理

服务器通过添加新的HTTP头部来描述哪些来源被允许从浏览器读取信息。
对于可能会对服务器数据产生副作用的HTTP请求方法（如非GET方法或特定MIME类型的POST方法），浏览器需要先发送一个HTTP OPTIONS请求（预检请求）来询问服务器支持的方法，得到服务器的“批准”后才会发送实际请求。
服务器还可以告知客户端是否需要在请求中携带“凭证”（如Cookie和HTTP认证）。

CORS失败的处理

CORS失败会导致错误，但出于安全考虑，具体的错误信息不会提供给JavaScript代码，只能通过浏览器控制台查看详细信息。

CORS的访问控制场景

简单请求：满足特定条件的请求不需要触发CORS预检。这些条件包括：请求方法为GET、HEAD或POST；手动设置的头部仅限于CORS安全列表中的头部（如Accept、Accept-Language等）；Content-Type的值仅限于application/x-www-form-urlencoded、multipart/form-data或text/plain；不使用XMLHttpRequest.upload的事件监听器；不使用ReadableStream对象等。
示例：假设https://foo.example的网页内容希望从https://bar.other获取JSON内容，使用fetch()进行请求。浏览器会发送带有Origin头部的请求到服务器，服务器返回带有Access-Control-Allow-Origin头部的响应，表示允许跨域访问。如果服务器希望限制只有https://foo.example可以访问该资源，则需要在响应中指定Access-Control-Allow-Origin: https://foo.example。

HTTP头部相关说明

HTTP头部用于在请求或响应中传递额外信息。在HTTP/1.X中，头部由大小写不敏感的名称、冒号和可选的空格（会被忽略）以及值组成。在HTTP/2及以上版本中，头部在开发者工具中显示为小写，并且伪头部以冒号开头。
自定义专有头部曾使用X-前缀，但自2012年起这一惯例被废弃，因为当非标准字段成为标准时会带来不便。

Cross-Origin Resource Sharing (CORS) is an HTTP-header based mechanism that allows a server to indicate any origins (domain, scheme, or port) other than its own from which a browser should permit loading resources. CORS also relies on a mechanism by which browsers make a "preflight" request to the server hosting the cross-origin resource, in order to check that the server will permit the actual request. In that preflight, the browser sends headers that indicate the HTTP method and headers that will be used in the actual request.

An example of a cross-origin request: the front-end JavaScript code served from https://domain-a.com uses fetch() to make a request for https://domain-b.com/data.json.

For security reasons, browsers restrict cross-origin HTTP requests initiated from scripts. For example, fetch() and XMLHttpRequest follow the same-origin policy. This means that a web application using those APIs can only request resources from the same origin the application was loaded from unless the response from other origins includes the right CORS headers.

The CORS mechanism supports secure cross-origin requests and data transfers between browsers and servers. Browsers use CORS in APIs such as fetch() or XMLHttpRequest to mitigate the risks of cross-origin HTTP requests.

What requests use CORS?
This cross-origin sharing standard can enable cross-origin HTTP requests for:

Invocations of fetch() or XMLHttpRequest, as discussed above.
Web Fonts (for cross-domain font usage in @font-face within CSS), so that servers can deploy TrueType fonts that can only be loaded cross-origin and used by websites that are permitted to do so.
WebGL textures.
Images/video frames drawn to a canvas using drawImage().
CSS Shapes from images.
This is a general article about Cross-Origin Resource Sharing and includes a discussion of the necessary HTTP headers.

Functional overview
The Cross-Origin Resource Sharing standard works by adding new HTTP headers that let servers describe which origins are permitted to read that information from a web browser. Additionally, for HTTP request methods that can cause side-effects on server data (in particular, HTTP methods other than GET, or POST with certain MIME types), the specification mandates that browsers "preflight" the request, soliciting supported methods from the server with the HTTP OPTIONS request method, and then, upon "approval" from the server, sending the actual request. Servers can also inform clients whether "credentials" (such as Cookies and HTTP Authentication) should be sent with requests.

CORS failures result in errors but for security reasons, specifics about the error are not available to JavaScript. All the code knows is that an error occurred. The only way to determine what specifically went wrong is to look at the browser's console for details.

Subsequent sections discuss scenarios, as well as provide a breakdown of the HTTP headers used.

Examples of access control scenarios
We present three scenarios that demonstrate how Cross-Origin Resource Sharing works. All these examples use fetch(), which can make cross-origin requests in any supporting browser.

Simple requests
Some requests don't trigger a CORS preflight. Those are called simple requests from the obsolete CORS spec, though the Fetch spec (which now defines CORS) doesn't use that term.

The motivation is that the

element from HTML 4.0 (which predates cross-site fetch() and XMLHttpRequest) can submit simple requests to any origin, so anyone writing a server must already be protecting against cross-site request forgery (CSRF). Under this assumption, the server doesn't have to opt-in (by responding to a preflight request) to receive any request that looks like a form submission, since the threat of CSRF is no worse than that of form submission. However, the server still must opt-in using Access-Control-Allow-Origin to share the response with the script.

A simple request is one that meets all the following conditions:

One of the allowed methods:
GET
HEAD
POST
Apart from the headers automatically set by the user agent (for example, Connection, User-Agent, or the other headers defined in the Fetch spec as a forbidden header name), the only headers which are allowed to be manually set are those which the Fetch spec defines as a CORS-safelisted request-header, which are:
Accept
Accept-Language
Content-Language
Content-Type (please note the additional requirements below)
Range (only with a simple range header value; e.g., bytes=256- or bytes=127-255)
The only type/subtype combinations allowed for the media type specified in the Content-Type header are:
application/x-www-form-urlencoded
multipart/form-data
text/plain
If the request is made using an XMLHttpRequest object, no event listeners are registered on the object returned by the XMLHttpRequest.upload property used in the request; that is, given an XMLHttpRequest instance xhr, no code has called xhr.upload.addEventListener() to add an event listener to monitor the upload.
No ReadableStream object is used in the request.
Note: WebKit Nightly and Safari Technology Preview place additional restrictions on the values allowed in the Accept, Accept-Language, and Content-Language headers. If any of those headers have "nonstandard" values, WebKit/Safari does not consider the request to be a "simple request". What values WebKit/Safari consider "nonstandard" is not documented, except in the following WebKit bugs:
Require preflight for non-standard CORS-safelisted request headers Accept, Accept-Language, and Content-Language
Allow commas in Accept, Accept-Language, and Content-Language request headers for simple CORS
Switch to a blacklist model for restricted Accept headers in simple CORS requests
No other browsers implement these extra restrictions because they're not part of the spec.
For example, suppose web content at https://foo.example wishes to fetch JSON content from domain https://bar.other. Code of this sort might be used in JavaScript deployed on foo.example:

const fetchPromise = fetch("https://bar.other");

fetchPromise
  .then((response) => response.json())
  .then((data) => {
    console.log(data);
  });

This operation performs a simple exchange between the client and the server, using CORS headers to handle the privileges.
Let's look at what the browser will send to the server in this case:

GET /resources/public-data/ HTTP/1.1
Host: bar.other
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Connection: keep-alive
Origin: https://foo.example

The request header of note is Origin, which shows that the invocation is coming from https://foo.example.

Now let's see how the server responds:

HTTP/1.1 200 OK
Date: Mon, 01 Dec 2008 00:23:53 GMT
Server: Apache/2
Access-Control-Allow-Origin: *
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: application/xml

[…XML Data…]

In response, the server returns a Access-Control-Allow-Origin header with Access-Control-Allow-Origin: *, which means that the resource can be accessed by any origin.

Access-Control-Allow-Origin: *

This pattern of the Origin and Access-Control-Allow-Origin headers is the simplest use of the access control protocol. If the resource owners at https://bar.other wished to restrict access to the resource to requests only from https://foo.example (i.e., no domain other than https://foo.example can access the resource in a cross-origin manner), they would send:

Access-Control-Allow-Origin: https://foo.example

Note: When responding to a credentialed requests request, the server must specify an origin in the value of the Access-Control-Allow-Origin header, instead of specifying the * wildcard.

The server must not specify the * wildcard for the Access-Control-Allow-Origin response-header value, but must instead specify an explicit origin; for example: Access-Control-Allow-Origin: https://example.com
The server must not specify the * wildcard for the Access-Control-Allow-Headers response-header value, but must instead specify an explicit list of header names; for example, Access-Control-Allow-Headers: X-PINGOTHER, Content-Type
The server must not specify the * wildcard for the Access-Control-Allow-Methods response-header value, but must instead specify an explicit list of method names; for example, Access-Control-Allow-Methods: POST, GET
The server must not specify the * wildcard for the Access-Control-Expose-Headers response-header value, but must instead specify an explicit list of header names; for example, Access-Control-Expose-Headers: Content-Encoding, Kuma-Revision

HTTP headers let the client and the server pass additional information with a message in a request or response. In HTTP/1.X, a header is a case-insensitive name followed by a colon, then optional whitespace which will be ignored, and finally by its value (for example: Allow: POST). In HTTP/2 and above, headers are displayed in lowercase when viewed in developer tools (accept: /), and prefixed with a colon for a special group of pseudo-headers (:status: 200). You can find more information on the syntax in each protocol version in the HTTP messages page.

Custom proprietary headers have historically been used with an X- prefix, but this convention was deprecated in 2012 because of the inconveniences it caused when nonstandard fields became standard in RFC 6648; others are listed in the IANA HTTP Field Name Registry, whose original content was defined in RFC 4229. The IANA registry lists headers, including information about their status.

参考目录

如需了解更多请移步上方参考链接.

Fetch 标准定义了请求、响应及其关联过程：获取。
目标
1 前言
2 基础设施
2.1 URL
2.2 HTTP
2.2.1 方法
2.2.2 标头
2.2.3 状态
2.2.4 主体
2.2.5 请求
2.2.6 响应
2.2.7 杂项
2.3 认证条目
2.4 Fetch 组
2.5 域名解析
2.6 连接
2.7 网络分区密钥
2.8 HTTP 缓存分区
2.9 端口阻止
2.10 是否应因 MIME 类型阻止 response 对 request 的响应？
3 HTTP 扩展
3.1 `Origin` 标头
3.2 CORS 协议
3.2.1 一般
3.2.2 HTTP 请求
3.2.3 HTTP 响应
3.2.4 HTTP 新标头语法
3.2.5 CORS 协议与凭据
3.2.6 示例
3.2.7 CORS 协议例外
3.3 `Content-Length` 标头
3.4 `Content-Type` 标头
3.5 `X-Content-Type-Options` 标头
3.5.1 是否应因 nosniff 阻止 response 对 request 的响应？
3.6 `Cross-Origin-Resource-Policy` 标头
3.7 `Sec-Purpose` 标头
4 获取
4.1 主获取
4.2 Scheme 获取
4.3 HTTP 获取
4.4 HTTP 重定向获取
4.5 HTTP 网络或缓存获取
4.6 HTTP 网络获取
4.7 CORS 预检获取
4.8 CORS 预检缓存
4.9 CORS 检查
4.10 TAO 检查
5 Fetch API
5.1 Headers 类
5.2 BodyInit 联合体
5.3 Body 混入
5.4 Request 类
5.5 Response 类
5.6 Fetch 方法
5.7 垃圾回收
6 data: URL
背景阅读
HTTP 标头层划分
原子 HTTP 重定向处理
基本安全的 CORS 协议设置
CORS 协议和 HTTP 缓存
WebSockets
在其他标准中使用 Fetch
设置请求
调用 Fetch 并处理响应
操作正在进行的 Fetch
致谢
知识产权
索引
本规范定义的术语
参考定义的术语
参考文献
规范性引用
参考性引用
IDL 索引

实现

[https://dev.to/emojiiii/running-deepseek-r1-in-the-browser-a-comprehensive-guide-3j63]
[https://kitt.tools/ai/chat]
[https://www.bilibili.com/video/av113903019233913/]
[https://medium.com/the-web-tub/unlock-ai-power-in-your-hybrid-mobile-app-local-embedding-of-huggingface-model-with-transformers-js-9805a400c924]
[https://github.com/yong-asial/local-model-huggingface]

安装bun工具

curl -fsSL https://bun.sh/install | bash

或:

powershell -c "irm bun.sh/install.ps1 | iex"

本地下载transformers.js
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/]
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/ort-wasm-simd-threaded.jsep.wasm]
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/transformers.min.js]
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/transformers.js]
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/transformers.js.map]
[https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.3.2/dist/ort-wasm-simd-threaded.jsep.mjs]
[https://huggingface.co/docs/transformers.js/api/env]

bun install @huggingface/transformers

下载transformers.js-deepseek示例
[https://github.com/huggingface/transformers.js-examples/tree/main/deepseek-r1-webgpu]

git clone https://github.com/huggingface/transformers.js-examples.git
cd deepseek-r1-webgpu

下载onnx模型
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/onnx/model_q4f16.onnx]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/config.json]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/configuration.json]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/generation_config.json]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/README.md]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/special_tokens_map.json]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/tokenizer.json]
[https://modelscope.cn/models/onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX/resolve/master/tokenizer_config.json]
目录结构:

models
├── deepseek
│   └── onnx-community
│       └── DeepSeek-R1-Distill-Qwen-1.5B-ONNX
│           ├── README.md
│           ├── config.json
│           ├── configuration.json
│           ├── generation_config.json
│           ├── onnx
│           │   └── model_q4f16.onnx
│           ├── special_tokens_map.json
│           ├── tokenizer.json
│           └── tokenizer_config.json

一定要按照这种目录结构,否则transforms.js会找不到文件的.

修改示例以使用本地模型而不是在线加载
[https://huggingface.co/docs/transformers.js/api/env]
[https://github.com/huggingface/transformers.js/issues/730]
[https://github.com/huggingface/transformers.js/issues/310]
目录结构:

>ls
deepseek-r1-webgpu:
README.md		eslint.config.js	package-lock.json	src
bun.lockb		index.html		package.json		vite.config.js
dist			node_modules		public

>tree
src
├── App.jsx
├── components
│   ├── Chat.css
│   ├── Chat.jsx
│   ├── Progress.jsx
│   └── icons
│       ├── ArrowRightIcon.jsx
│       ├── BotIcon.jsx
│       ├── BrainIcon.jsx
│       ├── StopIcon.jsx
│       └── UserIcon.jsx
├── index.css
├── main.jsx
└── worker.js

修改worker.js

import {
  AutoTokenizer,
  AutoModelForCausalLM,
  TextStreamer,
  InterruptableStoppingCriteria,
} from "@huggingface/transformers";

/**
 * 辅助函数，用于检测WebGPU支持
 */
// let fp16_supported = false;
async function check() {
  try {
    const adapter = await navigator.gpu.requestAdapter();
    if (!adapter) {
      throw new Error("WebGPU不支持（未找到适配器）");
    }
    // fp16_supported = adapter.features.has("shader-f16")
  } catch (e) {
    console.log("WebGPU检测失败:", e.toString()); // 调试信息
    self.postMessage({
      status: "error",
      data: e.toString(),
    });
  }
}

/**
 * 使用单例模式实现延迟加载的文本生成管道
 * 改为使用本地模型文件
 */

import { env } from "@huggingface/transformers";
import process from "process";

// 禁止在线加载模型
env.allowRemoteModels = false;
env.allowLocalModels = true;

env.localModelPath = "../models/deepseek/";
// env.localModelPath = process.env.PUBLIC_URL + "/models/deepseek/";
// env.localModelPath = "http://localhost:8008" + "/models/deepseek/";
let model_path = env.localModelPath;
console.log("模型路径:", model_path); // 调试信息
// 使用本地wasm文件
env.backends.onnx.wasm.wasmPaths = "../static/transformers/";
class TextGenerationPipeline {
  static model_id = "onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX";

  static async getInstance(progress_callback = null) {
    this.tokenizer ??= AutoTokenizer.from_pretrained(this.model_id, {
      progress_callback,
    });

    this.model ??= AutoModelForCausalLM.from_pretrained(this.model_id, {
      dtype: "q4f16",
      device: "webgpu",
      // device: "cpu",
      progress_callback,
    });

    return Promise.all([this.tokenizer, this.model]);
  }
}

const stopping_criteria = new InterruptableStoppingCriteria();

let past_key_values_cache = null;
async function generate(messages) {
  // 获取文本生成管道
  const [tokenizer, model] = await TextGenerationPipeline.getInstance();

  const inputs = tokenizer.apply_chat_template(messages, {
    add_generation_prompt: true,
    return_dict: true,
  });

  // 151648: <think>
  // 151649: </think>
  const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] = tokenizer.encode(
    "<think></think>",
    { add_special_tokens: false }
  );

  let state = "thinking"; // 'thinking' 或 'answering'
  let startTime;
  let numTokens = 0;
  let tps;
  const token_callback_function = (tokens) => {
    startTime ??= performance.now();

    if (numTokens++ > 0) {
      tps = (numTokens / (performance.now() - startTime)) * 1000;
    }
    if (tokens[0] == END_THINKING_TOKEN_ID) {
      state = "answering";
    }
  };
  const callback_function = (output) => {
    console.log("生成进度更新:", output); // 调试信息
    self.postMessage({
      status: "update",
      output,
      tps,
      numTokens,
      state,
    });
  };

  const streamer = new TextStreamer(tokenizer, {
    skip_prompt: true,
    skip_special_tokens: true,
    callback_function,
    token_callback_function,
  });

  // 通知主线程开始生成
  self.postMessage({ status: "start" });

  const { past_key_values, sequences } = await model.generate({
    ...inputs,
    // TODO: 修复后添加
    // past_key_values: past_key_values_cache,

    // 采样
    do_sample: false,
    // repetition_penalty: 1.1,
    // top_k: 3,
    // temperature: 0.2,

    max_new_tokens: 2048,
    streamer,
    stopping_criteria,
    return_dict_in_generate: true,
  });
  past_key_values_cache = past_key_values;

  const decoded = tokenizer.batch_decode(sequences, {
    skip_special_tokens: true,
  });

  // 将输出发送回主线程
  console.log("生成完成:", decoded); // 调试信息
  self.postMessage({
    status: "complete",
    output: decoded,
  });
}

async function load() {
  self.postMessage({
    status: "loading",
    data: "加载模型中(需要一些时间热载)...",
  });

  // 加载管道并保存以供将来使用
  const [tokenizer, model] = await TextGenerationPipeline.getInstance((x) => {
    // 添加进度回调以跟踪模型加载
    console.log("模型加载进度:", x); // 调试信息
    self.postMessage(x);
  });

  self.postMessage({
    status: "loading",
    data: "编译着色器并预热模型...",
  });

  // 使用虚拟输入运行模型以编译着色器
  const inputs = tokenizer("a");
  await model.generate({ ...inputs, max_new_tokens: 1 });
  self.postMessage({ status: "ready" });
}
// 监听主线程的消息
self.addEventListener("message", async (e) => {
  const { type, data } = e.data;
  console.log("收到消息:", type, data); // 调试信息

  switch (type) {
    case "check":
      check();
      break;

    case "load":
      load();
      break;

    case "generate":
      stopping_criteria.reset();
      generate(data);
      break;

    case "interrupt":
      stopping_criteria.interrupt();
      break;

    case "reset":
      past_key_values_cache = null;
      stopping_criteria.reset();
      break;
  }
});

效果

加载模型	预热模型	运行效果	核显占用情况

posted @ 2025-02-06 17:20 qsBye 阅读(663) 评论(0) 收藏举报

刷新页面返回顶部

qsBye