Scrape Screens with zend-dom

Even in this day-and-age of readily available APIs and RSS/Atom feeds, many
sites offer none of them. How do you get at the data in those cases? Through the
ancient internet art of screen scraping.

The problem then becomes: how do you get at the data you need in a pile of HTML
soup? You could use regular expressions or any of the various string functions
in PHP. All of these are easily subject to error, though, and often require some
convoluted code to get at the data of interest.

Alternately, you could treat the HTML as XML, and use the DOM
extension
, which is typically built-in to PHP. Doing so,
however, requires more than a passing familiarity with
XPath, which is something of a black art.

If you use JavaScript libraries or write CSS fairly often, you may be familiar
with CSS selectors, which allow you to target either specific nodes or groups of
nodes within an HTML document. These are generally rather intuitive:

jQuery('section.slide h2').each(function (node) {
  alert(node.textContent);
});

What if you could do that with PHP?

Introducing zend-dom

zend-dom provides CSS selector
capabilities for PHP, via the ZendDomQuery class, including:

  • element types (h2, span, etc.)
  • class attributes (.error, .next, etc.)
  • element identifiers (#nav, #main, etc.)
  • arbitrary element attributes (div[onclick="foo"]), including word matches
    (div[role~="navigation"]) and substring matches (div[role*="complement"])
  • descendents (div .foo span)

While it does not implement the full spectrum of CSS selectors, it does provide
enough to generally allow you to get at the information you need within a page.

Example: retrieving a navigation list

As an example, let’s fetch the navigation list from the ZendDomQuery
documentation page itself:

use ZendDomQuery;

$html = file_get_contents('https://docs.zendframework.com/zend-dom/query/');
$query = new Query($html);
$results = $query->execute('ul.bs-sidenav li a');

printf("Received %d results:n", count($results));
foreach ($results as $result) {
    printf("- [%s](%s)n", $result->getAttribute('href'), $result->textContent);
}

The above queries for ul.bs-sidenav li a — in other words, all links
within list items of the sidenav unordered list.

When you execute() a query, you are returned a ZendDomNodeList instance,
which decorates a DOMNodeList in order to
provide features such as Countable, and access to the original query and
document. In the example above, we count() the results, and then loop over them.

Each item in the list is a DOMNode, giving you
access to any attributes, the text content, and any child elements. In our
case, we access the href attribute (the link target), and report the text
content (the link text).

The results are:

Received 3 results:
- [#querying-html-and-xml-documents](Querying HTML and XML Documents)
- [#theory-of-operation](Theory of Operation)
- [#methods-available](Methods Available)

Other uses

Another use case is for testing. When you have classes that return HTML, or if
you want to execute requests and test the generated output, you often don’t want
to test exact contents, but rather look for specific data or fragments within
the document.

We provide these capabilities for zend-mvc
applications via the zend-test component,
which provides a number of CSS selector assertions
for use in querying the content returned in your MVC responses. Having these
capabilities allows testing for dynamic content as well as static content,
providing a number of vectors for ensuring application quality.

Start scraping!

While this post was rather brief, we hope you can appreciate the powerful
capabilities of this component! We have used this functionality in a variety of
ways, from testing applications to creating feeds based on content differences
in web pages, to finding and retrieving image URIs from pages.

Get more information from the zend-dom documentation.

Source: Zend feed