|

PHP Simple HTML DOM Parser vs FriendsOfPHP Goutte

PHP Simple HTML DOM Parser vs FriendsOfPHP Goutte

In this tutorial, we’ll examine how the PHP Simple HTML DOM Parser compares to the powerful FriendsOfPHP Goutte. In the early days, the PHP Simple HTML DOM Parser was all we had to work with concerning data extraction from HTML. Now that we have FriendsOfPHP Goutte, there is a more feature rich way of doing this type of work. Before you get going, you’ll need to configure a few things to get PHP Simple HTML DOM Parser, FriendsOfPHP Goutte, and Guzzle PHP HTTP client up and running. This is super easy to do thanks to Composer. Create a directory on your computer called guzzle. CD into that directory and place this composer.json in it.

Run composer install and or composer update from the command line. Everything will get set up for you. Now you can simply place an index.php file in this directly and test out any of the code we discuss in this tutorial. Your boilerplate in the index.php will look something like this, make sure not to forget to require the autoload file that composer creates for you.

We set this up so you won’t get any errors like “cURL error 60: SSL certificate problem: unable to get local issuer certificate”. If you do not set up your client like above, you may get these errors!


Get HTML Elements


Get HTML Elements (PHP Simple HTML DOM Parser)

When you start with PHP Simple HTML DOM Parser, they will have you doing something like this. In this case, you store some HTML into a variable, then find the value of the src attribute of all image tags, along with finding the value of all href attributes of any links on the page.

Get HTML Elements (FriendsOfPHP Goutte)

As we see above, the PHP Simple HTML DOM Parser Library makes use of a find() method, whereas in FriendsOfPHP Goutte you will typically be making use of a filter() method to find elements in the DOM. Here are the function signatures for both of these methods.

find(string $selector [, int $index])
Return type may vary.
Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of objects.
filter(string $selector)
Always returns a public Crawler instance.
Filters the list of nodes with a CSS selector.

note: The find() method varies on what it returns to you based on the parameters you pass in to it. This can sometimes lead to confusion. On the other hand, the filter() method always returns a Symfony Crawler instance.


Modify HTML Elements


Modify HTML Elements (PHP Simple HTML DOM Parser)

Modify HTML Elements (FriendsOfPHP Goutte)

FriendsOfPHP Goutte actually recommends not to modify the DOM with their software.

While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.

Therefore, we will not try to modify the DOM, but this is how you would fetch the title as above with Goutte.


Extract contents from HTML

Extract contents from HTML (PHP Simple HTML DOM Parser)

Extract contents from HTML (FriendsOfPHP Goutte)

Result of each test.
testing goutte with httpbin
Pretty straightforward stuff here. As we can see, the FriendsOfPHP Goutte versions are typically a little more modern and elegant in their syntax thanks to the use of their each() function which makes it really easy to iterate over every element with an anonymous function.


How to find HTML elements

Really the bread and butter of how these libraries work is via their ability to fetch elements from the DOM using standard CSS Selectors. Here we test almost all of the CSS selectors available, except for the ones that only make sense in the context of an actual web browser. If you don’t see a particular selector in this table, it means it does not work in either library. In testing all of these selectors, we found that Goutte has a larger and more feature rich set of CSS selection options. You can make use of this reference list of CSS selectors that work with Goutte and Simple HTML DOM.

CSS Selector Testing On filter() method of 1. Goutte and find() method of 2. Simple HTML Dom Parser

Selector Format Example Example description 1 2
.class .bash Selects all elements with class=”bash” Yes Yes
#id #manpage Selects the element with id=”manpage” Yes Yes
* * Selects all elements Yes Yes
element li Selects all <li> elements Yes Yes
element, element a, h1 Selects all <a> elements and all <h1> elements Yes Yes
element element li a Selects all <a> elements inside <li> elements Yes Yes
element > element p > a Selects all <a> elements where the parent is a <p> element Yes Yes
element + element div + h1 Selects all <h1> elements that are placed immediately after <div> elements Yes Yes
element1 ~ element2 p ~ h2 Selects every <h2> element that are preceded by a <p> element Yes No
[attribute] [href] Selects all elements with a href attribute Yes Yes
[attribute=value] [data-bare-link=true] Selects all elements with data-bare-link=”true” Yes Yes
[attribute~=value] [alt~=Fork] Selects all elements with a href attribute containing the word “Fork” Yes No
[attribute|=value] [id|=\-curl] Selects all elements with an id attribute value starting with “-curl” Yes No
[attribute^=value] a[href^="https"] Selects every <a> element whose href attribute value begins with “https” Yes Yes
[attribute$=value] a[href$=".org"] Selects every <a> element whose href attribute value ends with “.org” Yes Yes
[attribute*=value] a[href*="bin"] Selects every <a> element whose href attribute value contains the substring
“bin”
Yes Yes
:checked input:checked Selects every checked <input> element Yes No
:disabled input:disabled Selects every disabled <input> element Yes No
:empty div:empty Selects every <div> element that has no children (including text nodes) Yes No
:enabled input:enabled Selects every enabled <input> element (simply means one that does not have disabled attribute) Yes No
:first-child li:first-child Selects every <li> element that is the first child of its parent Yes No
:first-of-type p:first-of-type Selects every <p> element that is the first <p> element of its parent Yes No
:lang(language) p:lang(en) Selects every <p> element with a lang attribute equal to “en” Yes No
:last-child li:last-child Selects every <li> element that is the last child of its parent Yes No
:last-of-type li:last-of-type Selects every <li> element that is the last <li> element of its parent Yes No
:not(selector) :not(div) Selects every element that is not a <div> element Yes No
:nth-child(n) span:nth-child(2) Selects every <span> element that is the second child of its parent Yes No
:nth-last-child(n) span:nth-last-child(2) Selects every <span> element that is the second child of its parent, counting
from the last child
Yes No
:nth-last-of-type(n) span:nth-last-of-type(2) Selects every <span> element that is the second <span> element of its parent, counting
from the last child
Yes No
:nth-of-type(n) span:nth-of-type(1) Selects every <span> element that is the first <span> element of its parent Yes No
:only-child span:only-child Selects every <span> element that is the only child of its parent Yes No
:root :root Selects the document’s root element Yes No

Now, you’ll notice that the above testing references two target urls. One is https://httpbin.org, which is a site dedicated to offering this type of testing playground. The other is a file on a local server consisting of custom HTML markup for test purposes. If you would like to complete the tests in your own environment as well, here is the markup for http://localhost/guzzle/domtesting.php.

All of the testing we have done so far really focuses on using CSS Selectors to query the in memory DOM to get at the elements we’re looking for. One you have an element or elements however, what can you do with them? Often times, you just want the actual data contained in them in the form of text content. For example, you fetch a title tag, but what you really want is the information that the title tag holds. For this you typically apply a method to the DOM element(s) you have captured. In the case of Goutte, this means simply adding ->text() to the captured element like we see in all of the examples so far. Conversely, in Simple HTML DOM Parser, this is done with a property called plaintext. You can do even more with Goutte too, such as logging in to a website, fetching links from reddit, and more.


Other Symfony Crawler Methods

We know that when we pass in a simple selector to filter() we may find many elements in the DOM of that type. In that case, we iterate over them with the each() method like we’ve seen many times.

We can be more specific by using other methods, let’s see how.


eq()

Here we make use of the eq() method to find an element at a specific position from within the node list. Notice we no longer need to iterate over the collection using each().


slice()

In this example, we try out the slice() method before iterating over the results. We pass the integers of 3 and 2 to the method. 3 indicates we start at offset 3, while 2 indicates we want to capture 2 items from that given start point.


reduce()

This example shows using the reduce() method to only return strings that are greater than 2 characters in length.


first()


last()


siblings()

This example is pretty cool. First, we reach in and grab the li at position 2 (this is Bananas). Then, we call siblings() which gives us only the siblings of that element. Finally we use each() to iterate over the result.


attr()

This is one of those bread and butter methods to make use of. In this example, we find the one form on the page, then retrieve the value of the action attribute. We have been using attr() throughout this tutorial, so it should be fairly second nature by now.


nodeName()

Suppose you are filtering by a class name like so, but you need to find what tag this class is assigned to. For this you can use the nodeName() method.


html()

If you would like to access all of the HTML inside of a given element, you can use the html() method.


selectLink()

By making use of the selectLink() method, you can actually navigate to the content you wish to retrieve. In this example, we visit reddit, click on the ‘top’ link, and find the number one post at the current time.

PHP Simple HTML DOM Parser vs FriendsOfPHP Goutte Summary

This was a fun little tutorial that had a look at how to use both the Simple HTML DOM Parser and Goutte. Although the Simple HTML DOM Parser was a pretty neat tool at the time of it’s release, it is becoming very dated, and development has ceased. Goutte on the other hand is made up of beautiful Symfony components which are actively maintained and developed. In addition, Goutte is a much more feature rich and fun to use.