In this tutorial, we’ll examine how the PHP Simple HTML DOM Parser compares to the powerful FriendsOfPHP Goutte. In the early days, the PHP Simple HTML DOM Parser was all we had to work with concerning data extraction from HTML. Now that we have FriendsOfPHP Goutte, there is a more feature rich way of doing this type of work. Before you get going, you’ll need to configure a few things to get PHP Simple HTML DOM Parser, FriendsOfPHP Goutte, and Guzzle PHP HTTP client up and running. This is super easy to do thanks to Composer. Create a directory on your computer called guzzle. CD into that directory and place this composer.json
in it.
{ "require": { "guzzlehttp/guzzle": "~6.0", "emanueleminotto/simple-html-dom": "^1.5", "fabpot/goutte": "^3.1" } }
Run composer install
and or composer update
from the command line. Everything will get set up for you. Now you can simply place an index.php
file in this directly and test out any of the code we discuss in this tutorial. Your boilerplate in the index.php will look something like this, make sure not to forget to require the autoload file that composer creates for you.
60, 'verify' => false, ]); // Hackery to allow HTTPS $client->setClient($guzzleclient);
We set this up so you won’t get any errors like “cURL error 60: SSL certificate problem: unable to get local issuer certificate”. If you do not set up your client like above, you may get these errors!
Get HTML Elements
Get HTML Elements (PHP Simple HTML DOM Parser)
When you start with PHP Simple HTML DOM Parser, they will have you doing something like this. In this case, you store some HTML into a variable, then find the value of the src attribute of all image tags, along with finding the value of all href attributes of any links on the page.
// Create DOM from URL or file $html = file_get_html('https://www.facebook.com'); // Find all images foreach ($html->find('img') as $element) { echo $element->src . '
'; } // Find all links foreach ($html->find('a') as $element) { echo $element->href . '
'; }
Get HTML Elements (FriendsOfPHP Goutte)
// Make a GET request (Create DOM from URL or file) $crawler = $client->request('GET', 'https://www.facebook.com'); // Filter the DOM by calling an anonymous function on each node (Find all images) $crawler->filter('img')->each(function ($node) { echo $node->attr('src') . '
'; }); // (Find all links) $crawler->filter('a')->each(function ($node) { echo $node->attr('href') . '
'; });
As we see above, the PHP Simple HTML DOM Parser Library makes use of a find()
method, whereas in FriendsOfPHP Goutte you will typically be making use of a filter()
method to find elements in the DOM. Here are the function signatures for both of these methods.
find( string $selector [, int $index]) Return type may vary. |
Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of objects. |
filter( string $selector) Always returns a public Crawler instance. |
Filters the list of nodes with a CSS selector. |
note: The find() method varies on what it returns to you based on the parameters you pass in to it. This can sometimes lead to confusion. On the other hand, the filter() method always returns a Symfony Crawler instance.
Modify HTML Elements
Modify HTML Elements (PHP Simple HTML DOM Parser)
$html = file_get_html('https://httpbin.org'); foreach ($html->find('title') as $element) { echo $element->plaintext; // httpbin(1): HTTP Client Testing Service } $html->find('title', 0)->innertext = 'Made with PHP Simple HTML DOM Parser!'; foreach ($html->find('title') as $element) { echo $element->plaintext; // Made with PHP Simple HTML DOM Parser! }
Modify HTML Elements (FriendsOfPHP Goutte)
FriendsOfPHP Goutte actually recommends not to modify the DOM with their software.
While possible, the DomCrawler component is not designed for manipulation of the DOM or re-dumping HTML/XML.
Therefore, we will not try to modify the DOM, but this is how you would fetch the title as above with Goutte.
$crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('title')->each(function ($node) { echo $node->text() . '
'; // httpbin(1): HTTP Client Testing Service });
Extract contents from HTML
Extract contents from HTML (PHP Simple HTML DOM Parser)
$html = file_get_html('https://httpbin.org'); foreach ($html->find('li') as $li) { echo $li->plaintext . '
'; }
Extract contents from HTML (FriendsOfPHP Goutte)
$crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('li')->each(function ($node) { echo $node->text() . '
'; });
Result of each test.
Pretty straightforward stuff here. As we can see, the FriendsOfPHP Goutte versions are typically a little more modern and elegant in their syntax thanks to the use of their each() function which makes it really easy to iterate over every element with an anonymous function.
How to find HTML elements
Really the bread and butter of how these libraries work is via their ability to fetch elements from the DOM using standard CSS Selectors. Here we test almost all of the CSS selectors available, except for the ones that only make sense in the context of an actual web browser. If you don’t see a particular selector in this table, it means it does not work in either library. In testing all of these selectors, we found that Goutte has a larger and more feature rich set of CSS selection options. You can make use of this reference list of CSS selectors that work with Goutte and Simple HTML DOM.
CSS Selector Testing On
|
||||
Selector Format | Example | Example description | 1 | 2 |
---|---|---|---|---|
.class | .bash |
Selects all elements with class=”bash” | Yes | Yes |
#id | #manpage |
Selects the element with id=”manpage” | Yes | Yes |
* | * |
Selects all elements | Yes | Yes |
element | li |
Selects all <li> elements | Yes | Yes |
element, element | a<strong>,</strong> h1 |
Selects all <a> elements and all <h1> elements | Yes | Yes |
element element | li a |
Selects all <a> elements inside <li> elements | Yes | Yes |
element > element | p <strong>></strong> a |
Selects all <a> elements where the parent is a <p> element | Yes | Yes |
element + element | div <strong>+</strong> h1 |
Selects all <h1> elements that are placed immediately after <div> elements | Yes | Yes |
element1 ~ element2 | p <strong>~</strong> h2 |
Selects every <h2> element that are preceded by a <p> element | Yes | No |
[attribute] | [href] |
Selects all elements with a href attribute | Yes | Yes |
[attribute=value] | [data-bare-link=true] |
Selects all elements with data-bare-link=”true” | Yes | Yes |
[attribute~=value] | [alt~=Fork] |
Selects all elements with a href attribute containing the word “Fork” | Yes | No |
[attribute|=value] | [id|=\-curl] |
Selects all elements with an id attribute value starting with “-curl” | Yes | No |
[attribute^=value] | a[href^="https"] |
Selects every <a> element whose href attribute value begins with “https” | Yes | Yes |
[attribute$=value] | a[href$=".org"] |
Selects every <a> element whose href attribute value ends with “.org” | Yes | Yes |
[attribute*=value] | a[href*="bin"] |
Selects every <a> element whose href attribute value contains the substring “bin” |
Yes | Yes |
:checked | input:checked |
Selects every checked <input> element | Yes | No |
:disabled | input:disabled |
Selects every disabled <input> element | Yes | No |
:empty | div:empty |
Selects every <div> element that has no children (including text nodes) | Yes | No |
:enabled | input:enabled |
Selects every enabled <input> element (simply means one that does not have disabled attribute) | Yes | No |
:first-child | li:first-child |
Selects every <li> element that is the first child of its parent | Yes | No |
:first-of-type | p:first-of-type |
Selects every <p> element that is the first <p> element of its parent | Yes | No |
:lang(language) | p:lang(en) |
Selects every <p> element with a lang attribute equal to “en” | Yes | No |
:last-child | li:last-child |
Selects every <li> element that is the last child of its parent | Yes | No |
:last-of-type | li:last-of-type |
Selects every <li> element that is the last <li> element of its parent | Yes | No |
:not(selector) | :not(div) |
Selects every element that is not a <div> element | Yes | No |
:nth-child(n) | span:nth-child(2) |
Selects every <span> element that is the second child of its parent | Yes | No |
:nth-last-child(n) | span:nth-last-child(2) |
Selects every <span> element that is the second child of its parent, counting from the last child |
Yes | No |
:nth-last-of-type(n) | span:nth-last-of-type(2) |
Selects every <span> element that is the second <span> element of its parent, counting from the last child |
Yes | No |
:nth-of-type(n) | span:nth-of-type(1) |
Selects every <span> element that is the first <span> element of its parent | Yes | No |
:only-child | span:only-child |
Selects every <span> element that is the only child of its parent | Yes | No |
:root | :root |
Selects the document’s root element | Yes | No |
//--------------------------------------------------------- //--------------------------------------------------------- // .bash selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('.bash')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('.bash') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // #manpage selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('#manpage')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('#manpage') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // * all elements selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('*')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('*') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // li selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('li')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('li') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // a,h1 selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('a,h1')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('a,h1') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // li a selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('li a')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('li a') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // p > a selector test $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('p > a')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('p > a') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // div + h1 selector test // Goutte // note: In Goutte you must use chained method calls for this selector // to work (->filter('div')->filter('h1')) $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('div')->filter('h1')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('div + h1') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // p ~ h2 selector test $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('p ~ h2')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('p ~ h2') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // [data-bare-link=true] selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('[data-bare-link=true]')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('[data-bare-link=true]') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // [alt~=Fork] selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('[alt~=Fork]')->each(function ($node) { echo $node->attr('alt') . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('[alt~=Fork]') as $node) { echo $node->alt . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // [id|=\-curl] selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('[id|=\-curl]')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('[id|=\-curl]') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // a[href^="https"] selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('a[href^="https"]')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('a[href^="https"]') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // a[href$=".org"] selector test // Goutte $crawler = $client->request('GET', 'https://httpbin.org'); $crawler->filter('a[href$=".org"]')->each(function ($node) { echo $node->text() . '
'; }); // Simple HTML Dom $html = file_get_html('https://httpbin.org'); foreach ($html->find('a[href$=".org"]') as $node) { echo $node->plaintext . '
'; } //--------------------------------------------------------- //--------------------------------------------------------- // input:checked selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('input:checked')->each(function ($node) { echo $node->attr('value') . '
'; // Condo }); //--------------------------------------------------------- //--------------------------------------------------------- // input:disabled selector test $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('input:disabled')->each(function ($node) { echo $node->attr('name') . '
'; // job }); //--------------------------------------------------------- //--------------------------------------------------------- // div:empty selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('div:empty')->each(function ($node) { echo $node->attr('id') . '
'; // notextbud }); //--------------------------------------------------------- //--------------------------------------------------------- // input:enabled selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('input:enabled')->each(function ($node) { echo $node->attr('name') . '
'; // shelter // shelter // name // // state // username }); //--------------------------------------------------------- //--------------------------------------------------------- // li:first-child selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('li:first-child')->each(function ($node) { echo $node->text() . '
'; // Apples // 1 // one }); //--------------------------------------------------------- //--------------------------------------------------------- // p:first-of-type selector test $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('p:first-of-type')->each(function ($node) { echo $node->text() . '
'; // This first paragraph has text. // Do you speak English? // Yum! }); //--------------------------------------------------------- //--------------------------------------------------------- // p:lang(en) selector test $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('p:lang(en)')->each(function ($node) { echo $node->text() . '
'; // Do you speak English? }); //--------------------------------------------------------- //--------------------------------------------------------- // li:last-child selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('li:last-child')->each(function ($node) { echo $node->text() . '
'; // Blueberries // 4 // four }); //--------------------------------------------------------- //--------------------------------------------------------- // li:last-of-type selector test $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('li:last-of-type')->each(function ($node) { echo $node->text() . '
'; // Blueberries // 4 // four }); //--------------------------------------------------------- //--------------------------------------------------------- // :not(div) selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter(':not(div)')->each(function ($node) { echo $node->attr('type') . '
'; // This works, but returns too much data to put in a comment! }); //--------------------------------------------------------- //--------------------------------------------------------- // span:nth-child(2) selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('span:nth-child(2)')->each(function ($node) { echo $node->text() . '
'; // Lego Dimensions }); //--------------------------------------------------------- //--------------------------------------------------------- // span:nth-last-child(2) selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('span:nth-last-child(2)')->each(function ($node) { echo $node->text() . '
'; // Minecraft }); //--------------------------------------------------------- //--------------------------------------------------------- // span:nth-of-type(1) selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('span:nth-of-type(1)')->each(function ($node) { echo $node->text() . '
'; // Star Wars // Ha Ha! // Contrived Markup! }); //--------------------------------------------------------- //--------------------------------------------------------- // span:only-child selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter('span:only-child')->each(function ($node) { echo $node->text() . '
'; // Contrived Markup! }); //--------------------------------------------------------- //--------------------------------------------------------- // :root selector test // Goutte $crawler = $client->request('GET', 'http://localhost/guzzle/domtesting.php'); $crawler->filter(':root')->each(function ($node) { echo $node->text() . '
'; // Works but too much output to comment! }); //---------------------------------------------------------
Now, you’ll notice that the above testing references two target urls. One is https://httpbin.org, which is a site dedicated to offering this type of testing playground. The other is a file on a local server consisting of custom HTML markup for test purposes. If you would like to complete the tests in your own environment as well, here is the markup for http://localhost/guzzle/domtesting.php.
<html>
<head>
<meta charset="utf-8">
<title>HTML DOM Testing</title>
</head>
<body>
<a href="#">This is link one</a> <a class="simplehtmldom" href="#">This is link two</a> <a title="A title!" href="#">This link has a title!</a>
<div>Hello Div One.</div>
<div id="hello">This div has an id of hello.</div>
<div id="friendsofphp">This div has an id of foo.</div>
<div class="simplehtmldom">This div has a class of simplehtmldom</div>
<img src="http://placehold.it/350x150"> <img title="placeholder" src="http://placehold.it/350x150">
<div id="levelone">
<div id="leveltwo">
<div id="levelthree">This div is nested three levels deep.</div>
</div>
</div>
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
<li>Pineapples</li>
<li>Blueberries</li>
</ul>
<table width="100%" border="0">
<tr>
<th>Make</th>
<th>Model</th>
</tr>
<tr>
<td>Tesla</td>
<td>Roadster</td>
</tr>
<tr>
<td class="motorcycle">Alta Motors</td>
<td class="motorcycle">Redshift MX</td>
</tr>
</table>
<form action="domtesting.php">
<input type="checkbox" name="shelter" value="House">
I have a House<br>
<input type="checkbox" name="shelter" value="Condo" checked>
I have a Condo<br>
Name:
<input type="text" name="name">
<br>
Job:
<input type="text" name="job" disabled>
<br>
<input type="number" min="4" max="9" value="5">
State:
<input type="text" name="state" value="Massachusetts" readonly>
<br>
Username:
<input type="text" name="username" required>
<br>
<input type="submit" value="Submit">
</form>
<div>
<p></p>
<p>Text in a paragraph</p>
</div>
<div>
<p>This first paragraph has text.</p>
<p>Other text in a paragraph</p>
</div>
<div id="notextbud"></div>
<ul>
<li>1</li>
<li>2</li>
<li>3</li>
<li>4</li>
</ul>
<ul>
<li>one</li>
<li>two</li>
<li>three</li>
<li>four</li>
</ul>
<p lang="en">Do you speak English?</p>
<div> <span>Star Wars</span> <span>Lego Dimensions</span> <span>Minecraft</span> <span>Samsung</span> </div>
<div>
<p>Yum!</p>
<p>Yo!</p>
<span>Ha Ha!</span> </div>
<div> <span>Contrived Markup!</span> </div>
<table width="100%" border="0" cellpadding="3" cellspacing="3" class="table table-bordered">
<tbody>
<tr>
<td width="50%">mixed find(string $selector [, int $index])</td>
<td width="50%">Find elements by the CSS selector. Returns the Nth element <strong>object</strong> if <strong>index</strong> is set, otherwise return an <strong>array</strong> of object. </td>
</tr>
<tr>
<td width="50%">public Crawler filter(string $selector) </td>
<td width="50%">Filters the list of nodes with a CSS selector. </td>
</tr>
</tbody>
</table>
<code>