在浏览器输入网址，Enter之后浏览器和服务器做了什么？

如题：八股文会给出：

DNS Resolution
Establishing a Connection
Sending an Http Request
Receiving the HTTP Response
Rendering the Web Page

但今天我斗胆插入第0.9步URL Parsing，为什么叫0.9步，是因为动作很小，但确实不可缺少。

URL( uniform resource locator)由四部分组成： scheme、domain、path、resource

URL Parsing做了2个事情：

prase url：只有解析分离出domain，才能有后续的第1步： DNS resolution
url_encode

本文我主要想聊一聊浏览器url_encode 和对应的服务端 url normalization

浏览器url_encode

在浏览器插入https://www.baidu.com/s?wd=博客园马甲哥，Enter之前童鞋们可尝试拷贝地址栏，粘贴到任意位置，内容是：https://www.baidu.com/s?wd=%E5%8D%9A%E5%AE%A2%E5%9B%AD%E9%A9%AC%E7%94%B2%E5%93%A5, 这就是浏览器自动url_encode的结果，浏览器会拿这个网址去做 dns、request等行为。

浏览器中就是url_encoded的结果。

1. 为什么会有url_encode？

https://zhuanlan.zhihu.com/p/557035152?utm_id=0

在URL的最初设计时，希望可以通过书面转录，比如写在餐巾纸上告诉另外一人，因此URI的构成字符必须是可写的ASCII字符。
中文不在ASCII字符中，因此中文出现在URL地址中时，需要进行编码；同时可书写的ASCII字符中，存在一些不安全字符也需要转码，如空格（空格容易被忽略，也容易意想不到的原因引入）。

URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

浏览器会自动对请求路径和查询字符串做url encode，但不会对请求头的值做url encode，是否编码由开发者根据业务决定。

默认按照UTF-8编码， UTF-8 到底是什么意思?

例如：汉字 “你好”

UTF-8字节流打印为：-28 -67 -96 -27 -91 -67
对应的16进制表示为：E4 BD A0 E5 A5 BD
URLEncode编译后为：%E4%BD%A0%E5%A5%BD

当然服务端会对应的url_decode函数, 编码/解码的次数需要对应。

注意，多次url编码不是幂等的，多次url解码是幂等的。

各种语言都提供urlencode、decode的支持，这种支持不仅是url，也有对字符串的支持。

2. js 中的`encodeURI()` vs `encodeURIComponent()`

encodeURI是js 中内置的全局函数，用于url_encode，不会对以下特殊字符编码，这也是为了确保url中原生字符的正确表达:
A–Z a–z 0–9 - _ . ! ~ * ' ( ) ; / ? : @ & = + $ , #

const uri = 'https://mozilla.org/?x=шеллы';
const encoded = encodeURI(uri);
console.log(encoded);
// Expected output: "https://mozilla.org/?x=%D1%88%D0%B5%D0%BB%D0%BB%D1%8B"

encodeURIComponent 也是全局函数，但他的用途是对字符串做完整的url_encode, 这个函数会对上面排除的字符做编码，这个函数一般用于已知是特殊字符需要做url编码。

// Encodes characters such as ?,=,/,&,:
console.log(`?x=${encodeURIComponent('test?')}`);
// Expected output: "?x=test%3F"

3. 我为什么会关注这个问题？

一般web框架会为我们自动解码，所以我们在直接处理http请求时可以忽略该问题。

但是在自行使用 httpclient反代时就要注意这个问题。

直连应用的时候，浏览器的发出的是url_encode请求；

接入openresty网关时，内部使用的$uri nginx内置变量, 这是一个被normalization的uri字符串，与应用预期的接收不符，应用报错。

$uri: (ngx.var.uri) current URI in request,normalized

The value of $uri may change during request processing, e.g. when doing internal redirects, or when using index files.
$request_uri: （ngx.var.request_uri） full original request URI (with arguments)

4. 服务端URI规范化

URI normalization

URI Normalization是与浏览器url parsing 相对应的操作，由服务端根据RFC 3986 normalization实现。

常见的URL规范化步骤包括：
① 百分号编码解码：将百分号编码的字符（如 %20）解码为其对应的字符（如空格）。
② 小写化：将主机名和某些转义字符的小写化。
③ 路径清理：处理路径中的 . 和 .. 部分。
④ 移除默认端口：如果端口是默认端口（例如 HTTP 的 80 端口），则将其移除。
⑤ 排序查询参数：有些实现会对查询参数进行排序，但这是可选的，且不是所有实现都包含这一步骤。

Nginx 中的URI 规范化

在 Nginx 中，通常使用$uri变量来表示已规范化的 URI, 不包括查询字符串。查询字符串可以通过$args变量获取, 规范化后的URI（即$uri变量）不包含查询字符串。

目前我见到的web框架都自动实现了url normalization，完成了url_decoded，不需要我们手动再去解码。

x-www-form-urlencoded 编码模式

另一个与浏览器url_encoded编码，相关的是x-www-form-urlencoded编码模式，这也是form表单默认的编码格式。

https://dev.to/sidthesloth92/understanding-html-form-encoding-url-encoded-and-multipart-forms-3lpa

    <form action="/urlencoded?firstname=sid&lastname=sloth" method="POST" enctype="application/x-www-form-urlencoded">
        <input type="text" name="username" value="sidthesloth"/>
        <input type="text" name="password" value="slothsecret"/>
        <input type="submit" value="Submit" />
    </form>

请求body会产生 username=sidthesloth&password=slothsecret url_encode编码值，这和请求url： /urlencoded?firstname=sid&lastname=sloth 一样，都会被url_encoded。

5. 常见的httpclient是否自动url_encode？

.NET、go、lua的HttpClient(包括curl)都不会自动对 URL 进行编码。如果我们的httpclient想要模仿浏览器发出的url_encoded请求，你需要自己手动进行 URL 编码。

[C#] System.Net.WebUtility.UrlEncode
[golang] url.QueryEscape(rawURL)
[lua] ngx.escape_uri(str, 0) https://stackoverflow.com/questions/78225022/is-there-a-lua-equivalent-of-the-javascript-encodeuri-function
curl --data-urlencode选项提供了url_encoded 编码能力。

curl --data-urlencode "name=John Doe (Junior)" http://example.com
#   name=John%20Doe%20%28Junior%29

总结

本文从一个常见的话题聊起，提出了浏览器发出请求时一个容易被忽略的阶段url_encoded，与此同时服务端根据 url normalization协议完成了url_decoded, 客户端和服务端相辅相成，果然优秀程序的终点是标准协议。

从一个小点 url_encoded，延伸到x-www-form-urlencoded 表单编码模式在body中的表现；延伸到常规的httpclient如果要模仿浏览器请求，应该如何做url_encoded。

posted @ 2024-04-24 10:33 码甲哥不卷阅读(665) 评论(0) 编辑收藏举报

刷新页面返回顶部

有态度马甲-- 精益码农

只做原创，专注于架构，开源，微服务，分布式等领域的技术研究和分享。知其然更知其所以然，不做眼高手低的【高手】

在浏览器输入网址，Enter之后浏览器和服务器做了什么？

浏览器url_encode

1. 为什么会有url_encode？

注意，多次url编码不是幂等的，多次url解码是幂等的。

2. js 中的`encodeURI()` vs `encodeURIComponent()`

3. 我为什么会关注这个问题？

4. 服务端URI规范化

x-www-form-urlencoded 编码模式

5. 常见的httpclient是否自动url_encode？

总结

公告

有态度马甲-- 精益码农

只做原创，专注于架构，开源，微服务，分布式等领域的技术研究和分享。 知其然更知其所以然，不做眼高手低的【高手】

在浏览器输入网址，Enter之后浏览器和服务器做了什么？

浏览器url_encode

1. 为什么会有url_encode？

注意，多次url编码不是幂等的，多次url解码是幂等的。

2. js 中的encodeURI() vs encodeURIComponent()

3. 我为什么会关注这个问题？

4. 服务端URI规范化

x-www-form-urlencoded 编码模式

5. 常见的httpclient是否自动url_encode？

总结

公告

只做原创，专注于架构，开源，微服务，分布式等领域的技术研究和分享。知其然更知其所以然，不做眼高手低的【高手】

2. js 中的`encodeURI()` vs `encodeURIComponent()`