ToolHub
查看所有文章

Complete Guide to HTML Entities

HTML entities are one of the most fundamental yet frequently misunderstood aspects of web development. They are the mechanism by which browsers can display characters that would otherwise be interpreted as HTML markup, and they play a critical role in web security. Whether you are encoding special characters to prevent cross-site scripting attacks, displaying mathematical symbols, or ensuring your content renders correctly across different encodings, understanding HTML entities is essential for every web developer.

What Are HTML Entities?

An HTML entity is a sequence of characters that represents a single character in HTML. Entities begin with an ampersand (&) and end with a semicolon (;). When the browser encounters an entity, it replaces it with the corresponding character during rendering. This mechanism allows you to include characters in your HTML that would otherwise be impossible or problematic to represent directly.

The need for entities arises from the fact that HTML uses certain characters for its own syntax. The less-than sign (<) starts a tag, the greater-than sign (>) ends a tag, and the ampersand (&) starts an entity reference. If you want to display these characters as visible content rather than markup, you must encode them as entities.

Core Principle: Any character that has special meaning in HTML syntax must be encoded as an entity when it appears as content. This is not optional �?it is required for valid HTML.

Common HTML Entities

While HTML defines over 2,000 named entities, a small handful are used constantly in everyday web development. These are the entities you will encounter and use most frequently:

The Five Required Entities

Character Named Entity Decimal Entity Hex Entity Usage
& &amp; &#38; &#x26; Must always be encoded to avoid ambiguity with entity references
< &lt; &#60; &#x3C; Must be encoded to prevent opening an HTML tag
> &gt; &#62; &#x3E; Must be encoded to prevent closing an HTML tag
" &quot; &#34; &#x22; Must be encoded inside double-quoted attribute values
' &apos; &#39; &#x27; Must be encoded inside single-quoted attribute values

Frequently Used Symbol Entities

Character Named Entity Description
  &nbsp; Non-breaking space (prevents line break)
© &copy; Copyright symbol
® &reg; Registered trademark symbol
&trade; Trademark symbol
&mdash; Em dash (long dash)
&ndash; En dash (short dash, for ranges)
« &laquo; Left-pointing double angle quotation mark
» &raquo; Right-pointing double angle quotation mark
&bull; Bullet character
&hellip; Horizontal ellipsis (three dots)

Numeric vs Named Entities

HTML entities come in two forms: named and numeric. Understanding the difference helps you choose the right one for each situation.

Named Entities

Named entities use a human-readable word to identify the character. For example, &lt; represents the less-than sign, and &copy; represents the copyright symbol. Named entities are easier to read and remember, which makes your HTML source code more maintainable. However, named entities only exist for a limited set of characters �?approximately 2,000 out of the more than 140,000 characters in Unicode.

Numeric Entities

Numeric entities reference a character by its Unicode code point. They come in two forms: decimal and hexadecimal. Decimal numeric entities use the format &#NNN; where NNN is the decimal code point. Hexadecimal entities use the format &#xHHH; where HHH is the hexadecimal code point. For example, the copyright symbol can be written as &#169; (decimal) or &#xA9; (hexadecimal).

Numeric entities can represent any Unicode character, making them essential for characters that do not have named entity references. If you need to display a rare CJK character, a historical script glyph, or an emoji that lacks a named entity, you must use the numeric form.

When to Use Each

When to Use HTML Entities

Knowing when to use entities is just as important as knowing how. Here are the primary scenarios where entities are necessary or beneficial:

1. Displaying Reserved Characters

The most critical use case is encoding characters that HTML reserves for its own syntax. If you want to display a literal less-than sign in your content, you must write &lt;. If you write a bare <, the browser will attempt to parse it as the start of an HTML tag, which can break your page layout or create security vulnerabilities.

2. Preventing Ambiguous Ampersands

An ampersand followed by letters and a semicolon is always interpreted as an entity reference. If you have an ampersand in your content that is not part of an entity (such as in "Tom & Jerry"), you must encode it as &amp;. An unencoded ampersand followed by text that happens to match an entity name will be incorrectly decoded. For example, "AT&T" without encoding could become "AT&T" if the browser interprets &T; as an entity reference (though &T; is not a valid entity, the behavior is still technically an error).

3. Including Non-Keyboard Characters

Characters like em dashes, copyright symbols, and mathematical operators may be difficult to type on a standard keyboard. Entities provide a reliable way to include these characters without depending on your editor's input method or character map. For example, &mdash; is often easier to type than finding the em dash character on a keyboard.

4. Non-Breaking Spaces

The &nbsp; entity creates a non-breaking space, which prevents the browser from wrapping text at that position. This is essential for keeping related words together on the same line, such as "10 kg", "Chapter 3", or brand names that should not be split across lines. It is also commonly used to create visual spacing in layouts where CSS is not appropriate.

Security Implications: XSS Prevention

HTML entity encoding is one of the most important defenses against Cross-Site Scripting (XSS) attacks. XSS occurs when an attacker is able to inject executable code into a web page viewed by other users. Entity encoding prevents this by converting dangerous characters into their safe entity equivalents.

How Entity Encoding Prevents XSS

Consider a search page that displays the user's query in the results heading. If a user searches for <script>alert('xss')</script> and the page renders this without encoding, the browser will execute the script. However, if the output is entity-encoded, it becomes &lt;script&gt;alert('xss')&lt;/script&gt;, which the browser displays as literal text rather than executing as code.

Context-Specific Encoding Rules

Entity encoding is effective in HTML content and attribute contexts, but different contexts require different encoding strategies:

Security Warning: Never rely solely on client-side entity encoding for security. Always perform output encoding on the server side when rendering user-supplied data into HTML. Client-side encoding can be bypassed if JavaScript is disabled or if the encoding logic has bugs.

Entity Encoding in Different Contexts

In HTML Attributes

Attribute values require stricter encoding than HTML content because attributes use quotes as delimiters. A double-quoted attribute value must encode the double quote character, and a single-quoted attribute value must encode the single quote. The ampersand and angle brackets must also be encoded.

<!-- Correct: entity-encoded attribute value -->
<a title="5 &gt; 3 is true">Example</a>

<!-- Incorrect: unencoded special characters -->
<a title="5 > 3 is true">Example</a>

In the incorrect example, the > inside the attribute value could cause parsing issues in certain contexts, even though most browsers handle it gracefully. Valid HTML requires encoding these characters in attributes.

In JavaScript

HTML entities are not processed inside <script> tags. If you need to include special characters in JavaScript code embedded in HTML, you must use JavaScript's own escape sequences:

<!-- Wrong: entities don't work in script tags -->
<script>
  var msg = "5 &gt; 3"; // This is literal "&gt;", not ">"
</script>

<!-- Correct: use JavaScript escapes -->
<script>
  var msg = "5 \u003E 3"; // Unicode escape
  var msg2 = "5 > 3";     // Or just use the character directly
</script>

When injecting data from the server into inline JavaScript, the safest approach is to use JSON.stringify() on the server side and place the result inside the script. This handles all necessary escaping automatically, including quotes, backslashes, and control characters.

In CSS

Similarly, HTML entities are not processed inside <style> tags. CSS has its own escape mechanism using a backslash followed by the character's hexadecimal Unicode code point:

<!-- Wrong: entities don't work in style tags -->
<style>
  .quote::before { content: "&ldquo;"; }
</style>

<!-- Correct: use CSS Unicode escapes -->
<style>
  .quote::before { content: "\201C"; }
</style>

The CSS escape sequence uses a backslash followed by one to six hexadecimal digits. If fewer than six digits are used and the next character could be interpreted as a hex digit, add a space after the escape sequence.

Browser Compatibility

HTML entity support is excellent across all modern browsers. The five required entities (&amp;, &lt;, &gt;, &quot;, &apos;) have been supported since the earliest browsers. Named entities for common symbols like &copy;, &nbsp;, and &reg; also have universal support.

The main compatibility concern involves less common named entities introduced in HTML5 and later specifications. Some older browsers may not recognize entities like &ngsp; or &NotEqualTilde;. For maximum compatibility, use numeric entities for uncommon characters, as numeric entity support depends only on Unicode support, which is universal in modern browsers.

One specific note: the &apos; entity was not part of HTML 4 and was only defined in XHTML. While all modern browsers support it, if you need to support very old browsers, use &#39; instead for single quote encoding.

Practical Tips for Working with Entities

Need to encode or decode HTML entities? Try our free HTML Entity Encoder tool. Process text instantly in your browser with no data sent to any server.

Try Our HTML Entity Encoder

Frequently Asked Questions

What is the difference between named and numeric HTML entities?

Named entities use human-readable names like & for ampersand or < for less-than. Numeric entities use the character's Unicode code point, either in decimal (&) or hexadecimal (&) form. Named entities are more readable but only exist for common characters. Numeric entities can represent any Unicode character.

Do HTML entities prevent XSS attacks?

HTML entity encoding is a key defense against XSS when outputting user-supplied data in HTML content and attributes. By encoding special characters like <, >, and &, you prevent attackers from injecting executable HTML or JavaScript. However, entity encoding alone is not sufficient in all contexts �?you need context-specific encoding for JavaScript, CSS, and URL contexts.

When must I use HTML entities?

You must use HTML entities when you need to display characters that have special meaning in HTML: & (ampersand), < (less-than), > (greater-than), " (double quote in attributes), and ' (single quote in attributes). You should also use entities for characters that cannot be typed directly, such as copyright symbols, em dashes, and non-breaking spaces.

Are HTML entities still necessary with UTF-8?

Yes, HTML entities are still necessary even with UTF-8 encoding. While UTF-8 allows you to include most characters directly, the five reserved characters (&, <, >, ", ') must still be encoded as entities when they appear in HTML content or attributes. Entities are also useful for invisible characters like non-breaking spaces and zero-width spaces.

Can I use HTML entities in CSS or JavaScript?

HTML entities are not processed inside <style> or <script> tags. In CSS, use Unicode escape sequences like \2014 for an em dash. In JavaScript strings, use Unicode escapes like \u2014 or the actual UTF-8 character. HTML entities only work in HTML content and attribute values, not in scripting or styling contexts.