PHP Screen Scraping Tutorial

Like this blog? Consider exploring one of our sponsored banner ads...

UPDATE: New Screen Scraping Post

Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use. A simple application of screen scraping could be to build a database of all the NFL teams complete with player details.

What the heck, let’s do it… The first step is to get the page HTML into a PHP variable. This is super easy if the page is publicly accessible via a URL – no login or form post required to access… For more complex scraping you can use cURL to get the html source of the page but the rest of the process would be about the same. Anyway, let’s scrape the site.

$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";
 
$raw = file_get_contents($url);

The easiest way to do pattern matching I have found is without newlines. Here is how I remove them from the raw html before I start parsing out the data I want to scrape.

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
 
$content = str_replace($newlines, "", html_entity_decode($raw));

So now you have the source code of the page as a string variable, you need to parse out the results. Tis is where each scraping application will differ. Depending on the page structure and what elements you want to retrieve, you will have to alter the regular expression matching. You can view the source and see that the roster data you want is in a nice table with class name “standard_table”. I also notice that this class name is unique to the page. So the next step is to get the start and end string positions for this table, and then extract just the table from the content:

$start = strpos($content,'<table cellpadding="2" class="standard_table"');
 
$end = strpos($content,'</table>',$start) + 8;
 
$table = substr($content,$start,$end-$start);

Now we have just the table containing the roster data, and we need to parse out the rows and cells. The easiest way to do this is with preg_match_all. If this code is not clear, you can print_r and die() in the loop to see what the rows and cells arrays contain.

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
 
foreach ($rows[0] as $row){
 
	if ((strpos($row,'<th')===false)){
 
		preg_match_all("|<td(.*)</td>|U",$row,$cells);
 
		$number = strip_tags($cells[0][0]);
 
		$name = strip_tags($cells[0][1]);
 
		$position = strip_tags($cells[0][2]);
 
		echo "{$position} - {$name} - Number {$number} <br>\n";
 
	}
 
}

So now we have parsed all the data for a given team from the official NFL site. To do all the teams, wrap this in a loop and as a final step, write all the data to a database table and voila, you have a database of all team rosters for the NFL.

This simple scraping example is just to illustrate the basic concept. Also keep in mind that if the source structure of the page you want to scrape changes, you will need to adjust your pattern matching. You should always scrape the page once and save the results in a file, then read that file into your code for development testing to minimize the hits to the live server. My personal opinion is that anything that is publicly accessible via the internet should be able to be scraped. What is the difference if you were to copy and paste it, basically that is what you are doing but doing it programmatically. You can definitely get into trouble if you misuse some data that you scraped, you could probably violate copyrights or whatever. Please scrape responsibly :)


About this entry