python 3 open() default encoding

回答1

The default UTF-8 encoding of Python 3 only extends to byte->str conversions. open() instead uses your environment to choose an appropriate encoding:

From the Python 3 docs for open():

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

In your case, as you're on Windows with a Western Europe/North America, you will be given the 8bit Windows-1252 character set. Setting encoding to utf-8 overrides this.

Fortunately there are recent attempts to end this madness... someday.

– Jeyekomon

Apr 28, 2020 at 14:05

Motivation

Using the default encoding is a common mistake

Developers using macOS or Linux may forget that the default encoding is not always UTF-8.

For example, using long_description = open("README.md").read() in setup.py is a common mistake. Many Windows users cannot install such packages if there is at least one non-ASCII character (e.g. emoji, author names, copyright symbols, and the like) in their UTF-8-encoded README.md file.

Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII characters in their README, and 82 fail to install from source on non-UTF-8 locales due to not specifying an encoding for a non-ASCII file. [1]

Another example is logging.basicConfig(filename="log.txt"). Some users might expect it to use UTF-8 by default, but the locale encoding is actually what is used. [2]

Even Python experts may assume that the default encoding is UTF-8. This creates bugs that only happen on Windows; see [3], [4], [5], and [6] for example.

Emitting a warning when the encoding argument is omitted will help find such mistakes.

Explicit way to use locale-specific encoding

open(filename) isn’t explicit about which encoding is expected:

If ASCII is assumed, this isn’t a bug, but may result in decreased performance on Windows, particularly with non-Latin-1 locale encodings
If UTF-8 is assumed, this may be a bug or a platform-specific script
If the locale encoding is assumed, the behavior is as expected (but could change if future versions of Python modify the default)

From this point of view, open(filename) is not readable code.

encoding=locale.getpreferredencoding(False) can be used to specify the locale encoding explicitly, but it is too long and easy to misuse (e.g. one can forget to pass False as its argument).

This PEP provides an explicit way to specify the locale encoding.

Prepare to change the default encoding to UTF-8

Since UTF-8 has become the de-facto standard text encoding, we might default to it for opening files in the future.

However, such a change will affect many applications and libraries. If we start emitting DeprecationWarning everywhere the encoding argument is omitted, it will be too noisy and painful.

Although this PEP doesn’t propose changing the default encoding, it will help enable that change by:

Reducing the number of omitted encoding arguments in libraries before we start emitting a DeprecationWarning by default.
Allowing users to pass encoding="locale" to suppress the current warning and any DeprecationWarning added in the future, as well as retaining consistent behavior if later Python versions change the default, ensuring support for any Python version >=3.10.

Which encoding should Python open function use?

回答1

As clearly stated in Python's open documentation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.

Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

https://docs.python.org/3/library/functions.html#open

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

locale.getpreferredencoding(do_setlocale=True)

Return the locale encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess.

On some systems, it is necessary to invoke setlocale() to obtain the user preferences, so this function is not thread-safe. If invoking setlocale is not necessary or desired, do_setlocale should be set to False.

On Android or if the Python UTF-8 Mode is enabled, always return 'UTF-8', the locale encoding and the do_setlocale argument are ignored.

The Python preinitialization configures the LC_CTYPE locale. See also the filesystem encoding and error handler.

Changed in version 3.7: The function now always returns UTF-8 on Android or if the Python UTF-8 Mode is enabled.

posted @ 2022-11-19 10:34 ChuckLu 阅读(202) 评论(0) 收藏举报

刷新页面返回顶部

Chuck Lu

python 3 open() default encoding

python 3 open() default encoding

Motivation

Using the default encoding is a common mistake

Explicit way to use locale-specific encoding

Prepare to change the default encoding to UTF-8

Which encoding should Python open function use?

公告