Table of content:
Encoding in general
ANSI chars 0-127
https://en.wikipedia.org/wiki/ASCII
ANSI chars 0-127 (decimal), 00-7F (hexa):
|
|
- Characters 00-1F (hexa) — control characters
- Characters 20-7F (hexa) — printable characters
- 127 chars can be represented by 7 bits
Coding pages
8 bit coding pages can encode chars 0-255 (decimal), 00-FF (hexa). In general:
- encoding for 00-7F is always the same (= ANSI encoding)
- encoding for 80-FF is unique to coding page (usually represent lang-specific letters and symbols)
Examples:
- cp-1251 (cyrillic)
- oem 437 (US)
- oem 866 (cyrillic)
OEM vs CP:
- OEM are older (dos) versions
- CP today are used in gui apps for example
Unicode
UTF-8, UTF-16, UTF-32 — they all encode the same set of characters. They differ only in the way they encode them.
ANSI chars look the same in all encodings (is it true?).
Encoding on Windows
How encoding works in Windows cmd
CLI apps output set of bytes (like 11001100-01010101-…) to Windows cmd stdout/stderr. Author of CLI app can have any encoding in his mind. Unless this CLI app specifically inspects cmd settings on their own, cmd will interpret this set of bytes with currently active coding page.
If [author’s encoding] match [currently active coding page], all will be rendered as expected (of cause if current font supports all the chars). If not — you’ll get giberrish output to all chars out of ANSI (0-127) range.
New line in Windows cmd
- Windows uses
\r\nfor new line, Linux uses\n, most apps are able to handle this difference automatically (\rand\nare ANSI control chars from 00-1F hexa range) - To enable UTF-8 encoding on Windows cmd (for IO of the terminal), execute “chcp 65001” (then apps that output bytes sequences encoded in UTF-8 will have correctly rendered output)
- Coloring on Windows cmd is done by enabling VT100 feature (present since ~2015 on Windows 10):
- Coloring is encoded via special combinations of ANSI control chars (00-1F)
- This is how it was done on Linux, and Windows cmd started supporting it recently