TL;DR – Specifying the right HTML encoding will prevent the browser from failing to display special characters.
Understanding HTML Character Encoding
The need for character encoding arises from the huge selection of characters available. Apart from your usual Latin letters and Arabic numbers, there are also foreign alphabets, mathematical symbols and other special characters. However, documents that have different HTML encodings defined can display them differently.
An incorrectly interpreted text leads to a variety of issues:
- Users can't read the text properly
- Search engines can't find the data
- Machines can't process the information
All the available characters are grouped into specific sets (also called charsets for short). By defining HTML encoding, you let the browser access the particular set and display its characters correctly.
Note: the Japanese even have a special term for a poorly interpreted bunch of characters – mojibake (文字化け).
ASCII: The Most Basic Charset
The first and simplest HTML character encoding is called ASCII. Most modern charsets use it as a standard base.
ASCII stands for the American Standard Code for Information Interchange. It has been developed from telegraph code in the early 1960s and contains 128 characters, 95 of which are printable:
- Lowercase Latin letters
- Uppercase Latin letters
- Punctuation symbols
- Numbers from 0 to 9
The 33 unprintable characters are also called control characters. These are the transparent symbols – e.g., ones that allow separating words or paragraphs.
However, the popularity of ASCII fell as the Internet grew more and more international. Only supporting Latin characters quickly became not enough.
Theory is great, but we recommend digging deeper!
Your Best Option: UTF-8
Unicode is the industry standard used for the consistency of character encoding. It was published in the early 1990s and has a few charsets, such as UTF-8, UTF-16, and UTF-32.
UTF-8 stands for Unicode Transformation Format 8-bit and has held the title of the most popular HTML character encoding since 2008. By 2019, more than 90 percent of all websites use UTF-8. It is also recommended to use as the default HTML character encoding by the World Web Consortium.
There are multiple compelling reasons to use UTF-8:
- It supports many languages.
- It is completely compatible with ASCII.
- It is natively used by XML.
- It uses less space than other Unicode encodings.
To declare UTF-8 as your preferred HTML character encoding, you will need to use the <meta> tag with the
charset attribute and
UTF-8 as its value:
Alternative HTML Encodings
You can find a ton of alternative encodings in the Encoding Living Standard created by the Web Hypertext Application Technology Working Group (WHATWG). However, we'd strongly advise you to stay with UTF-8, as other charsets contain a smaller selection of characters, and that might cause issues in displaying your website.
HTML Encoding: Useful Tips
- The proper display of your characters depends not only on the charset but also the chosen font: not all of them have versions for every character. If the font you chose does not have the symbol you need, it will either look for matches in other fonts or display another character (e.g., a question mark).
- Don't forget that you also need to specify the HTML encoding when saving your document.