In the last episode of this PHP Tutorial Compilation, we looked at working with Links and URLs and how sometimes, special characters will wreak havoc upon our HTML. These problems can at least break a link, or at worst, expose a vulnerability to malicious intent via the goblins of the internets which wish to bring malice to your web application. It turns out, it is not only our URLs that we must protect and encode. We also need to be aware of the large collection of special characters that may appear in the text of your HTML. For this, we also have convenient ways to encode and protect our markup so as to be sure your website remains upright, intact, working smoothly, and most of all being the awesome presence that it is. Let’s do this!
HTML Encoding With PHP
HTML is the markup language of the web, and when it comes down to it, it is the HTML that makes the web tick. All of these fancy programming languages are fun to use in order to facilitate dynamic pages and so on, but without HTML, we wouldn’t have too much now would we? One of the most common things we do with PHP, or any other web based programming language for that matter, is to generate HTML on the fly using logic and conditions. As we know, HTML makes use of angle brackets and other characters to provide a means to the browser to render a web page. HTML has some special reserved characters that you need to watch out for since they have a specific meaning to HTML.
< and >
Here they are, the < and > characters. These two characters are what surround the html tag names in the page. This is what instructs the browser that hey, something really interesting is happening here and you need to interpret the information between these tags as such. These angled brackets also denote that the data within them is not to be outputted to the page in human readable form. Though the web browser sees something like <b>this bold text</b>, the user should see this bold text. Two different encoding styles with two different meanings.
HTML Reserved Characters
There are four main characters that are reserved in HTML which we need to pay close attention to. Here is a table that outlines them all.
What this means is that when you would like your users to see the literal character, you need to provide the coded version. Maybe you would like to say something like Twitter is greater than Facebook, and use the greater than sign. In the actual html will be Twitter is > Facebook, however the user will see the actual Twitter is > Facebook displayed. If this or any of the other reserved characters are not properly encoded, you run the risk of causing your page to not display to the user.
First up we’ll examine the use of htmlspecialchars. It might make sense to observe a broken scenario, and then we’ll look at the solution using htmlspecialchars. Suppose we want to include a link with specific anchor text. The anchor text we’d like to display is <Click Here> & Prosper! So you figure, ok easy enough, we can just place this text we like in between anchor tags and create our link. Let’s try it.
<html> <head> <meta charset="utf-8"> <title>HTML Encode Like a PRO</title> <link href="css/bootstrap.min.css" rel="stylesheet"> <script src="js/respond.js"></script> <script src="http://code.jquery.com/jquery-latest.min.js"></script> <script src="js/bootstrap.min.js"></script> </head> <body> <a href="http://localhost/bootstrapsandbox/encode.php"> <Click Here> & Prosper! </a> </body> </html>
Do you see what that is right there? That right there, is a bit fat fail. The page didn’t display the text we wanted at all. The reason for this is because the browser comes along and sees those angled brackets and thinks that it is dealing with an HTML tag. In this case however, it is not HTML at all. It is the actual angled brackets that we want the user to see in the text of the link. It is times like this that htmlspecialchars comes to the rescue! Observe!
<html> <head> <meta charset="utf-8"> <title>HTML Encode Like a PRO</title> <link href="css/bootstrap.min.css" rel="stylesheet"> <script src="js/respond.js"></script> <script src="http://code.jquery.com/jquery-latest.min.js"></script> <script src="js/bootstrap.min.js"></script> </head> <body> <a href="http://localhost/bootstrapsandbox/encode.php"> <?php echo htmlspecialchars('<Click Here> & Prosper!'); ?> </a> </body> </html>
Nice! Now that link text is working as designed. Think of the htmlspecialchars as a method to disable HTML so to speak. It disables the HTML to the browser and allows the user to see what the browser normally sees. A key point of note is that htmlspecialchars only handles the four reserved characters listed in the table above. Now you may know that there are a rather large collection of symbols that we might want to display in the text of our HTML which the browser will not know how to render. Things like Trademark symbols, Copyright Symbols, At signs, and many more. In this case, you need to bust out the big dog, the htmlentities function.
The htmlentities function covers all characters that have an equivalent html entity representation in the language. Therefore, htmlentities is much more powerful. To illustrate this, we’ll try to enter in some of the more common special characters that you might want to have in your webpage. Let’s try it out.
<?php $text = '© ® ™ £ € ¥'; echo $text; ?>
This might display on some devices, however you may get just a string of really strange hieroglyphic looking type characters. To be safe, you should wrap any text that has special characters into the htmlentities function like so.
<?php $text = '© ® ™ £ € ¥'; echo htmlentities($text); ?>
© ® ™ £ € ¥
URL Encoding Meets HTML Encoding
It’s time to level up friends. Now that we have a good grasp of URL encoding from our last episode, as well as the fundamentals of html encoding via this action packed tutorial, we can put the whole picture together to see how this works. This will sum up our learning of URL as well as HTML encoding.
<html> <head> <meta charset="utf-8"> <title>HTML Encode Like a PRO</title> <link href="css/bootstrap.min.css" rel="stylesheet"> <script src="js/respond.js"></script> <script src="http://code.jquery.com/jquery-latest.min.js"></script> <script src="js/bootstrap.min.js"></script> </head> <body> <?php $page = 'bootstrapsandbox/encode.php'; $variable1 = 'Look out now, < > " and & which are bad!'; $variable2 = 'More bad chars like &#?*$+ and so on!'; $anchortext = '<Click Here> & Prosper!'; $url = 'http://localhost/'; // rawurlencode anything to the left of the ? !! $url .= rawurlencode($page); // urlencode anything to the right of the ? !! $url .= '?' . 'variable1='. urlencode($variable1); $url .= '&' . 'variable2='. urlencode($variable2); // at this point $url is safe to put into the query string // it might NOT however, be safe to output into our HTML! // Just becuase a string is now fully URL encoded does not mean // it is safe for output into page HTML. This is why the $url // parameter must also be run through htmlspecialchars. ?> <a href="<?php echo htmlspecialchars($url); // ?>"> <?php echo htmlspecialchars($anchortext); ?> </a> </body> </html>
The result is a nicely encode URL with and special characters or entities accounted for which is epic!
Handy HTML Entity Table
If you ever need a good reference for all of the HTML entities you can use, here is a list of them. This list is good for being aware of characters that should be run through the htmlentities function as well.
|Various ASCII Character entities.|
|Various ISO-8859-1 HTML entities.|
|HTML entities for Math Symbols.|
|Greek Letters and their HTML Entities.|
|Other Various HTML entities.|