Category Archives: Screen Scraping

PHP Screen Scraping Class

After some positive feedback I have decided to continue to develop the PHP Screen Scraping class. This post will server as the permanent home for the class.

Download PHP Screen Scraping Class


20009-07-30 Added setHeader() function

Screen Scraping Twitter

I got an email today asking for help to scrape Twitter. In particular, to be able to login. So I am going to show everyone, NOT to encourage anyone to violate Twitters terms of use but as an educational blog post about how PHP and cURL can be used to post variables and store cookies.

Again, I am using the cScrape class I wrote, which you can download.

Step 1
First go to and look at the source code of the login to get the form field names and the form post location. You will see that the form posts to and the username and password fields are session[username_or_email] and session[password] respectively.

Step 2
Now you are ready to login. So using the fetch function in the Scrape class you create an associative array to contain the form values you want to post. The other thing you will need to do is uncomment the lines for CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. Cookies will be required to stay logged in and scrape around. The paths to the cookie files need to be writable by your app. Also you will need to uncomment the line about CURLOPT_FOLLOWLOCATION.

$data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret");

Step 1.5
Oops that didn’t work. All I got back was 403 Forbidden: The server understood the request, but is refusing to fulfill it. Ahhh I see another variable called authenticity_token I bet Twitter was looking for that. So let’s back up and first hit to get the authenticity_token variable, and then make the login post request with that variable included in our array of parameters.

$data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret");
$data['authenticity_token'] = $scrape->fetchBetween('name="authenticity_token" type="hidden" value="','"',$scrape->result);
echo $scrape->result;

So that’s basically it. Now you are logged in and can scrape around and request other pages as you normally would. Sorry it wasn’t a longer post. I really do enjoy this kind of stuff so if anyone has a request, hit me up.

1) Make sure that you are properly parsing the token variable
2) Make sure that you uncommented the lines about CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, those options need to be enabled and be sure the path set is writable by your application
3) Make sure that the path to the cookie file is writable and that it is getting data written to it
4) If you get a message about being redirected you need to uncomment the line about CURLOPT_FOLLOWLOCATION, that option needs to be enabled true

PHP Screen Scraping

I had a request recently for help with scraping a little content from :)

So here we go. This time I made a quick PHP class with some basic functions to grab the source fo the page as well as fetchBetween, fetchAfter, fetchAll, etc. You can get the latest version of the class at – be sure to rename it to cScrape.php. If there is an interest I can continue to develop this class a tool for screens craping with PHP.

Anyway so here we go scraping all the companies from this page as well as the details of each company, found by clicking on the company.

Step 1 – Initialize the class and fetch the page:

include ('cScrape.php');
$scrape = new Scrape();
$url = '';
$data = $scrape->removeNewlines($scrape->result);

Step 2 – find your anchor and get the chunk of html that contains what you want

$data = $scrape->fetchBetween('<table width="490" border="0" cellpadding="3"','</table>',$data,true);
$rows = $scrape->fetchAllBetween('<TR','</tr>',$data,true);

Step 3 – parse out the individual values and print out the first record for demo

foreach ($rows as $id => $row){
	$record = array();
	$cells = $scrape->fetchAllBetween('<td','</td>',$row,true);
	$record['company'] = strip_tags($cells[1]);
	$url = '' . $scrape->fetchBetween('<a href="','">',$cells[1],false);
	$url = str_replace(' ','%20',$url);
	$data2 = $scrape->removeNewlines($scrape->result);
	$data2 = $scrape->fetchBetween('<div id="tabText">','</div>',$data2,true);
	$data2 = $scrape->fetchAfter('</table>',$data2,false);
	$details = explode('<br />',$data2);
	$record['address'] = $details[0];
	$location = explode(',',$details[1]);
	$record['city'] = trim($location[0]);
	$location = explode(' ',trim($location[1]));
	$record['state'] = trim($location[0]);
	$record['zip'] = trim($location[1]);
	for($i=2; $i<=5; $i++){
		$detail = trim($details[$i]);
		if(substr($detail,0,6)=='Phone:') $record['phone'] = str_replace('Phone: ','',$detail);
		else if(substr($detail,0,4)=='Fax:') $record['fax'] = str_replace('Fax: ','',$detail);
		else if(substr($detail,0,4)=='Web:') $record['web'] = strip_tags(str_replace('Web: ','',$detail));
		else if(substr($detail,0,6)=='Email:') $record['email'] = strip_tags(str_replace('Email: ','',$detail));

PHP Screen Scraping Tutorial

UPDATE: New Screen Scraping Post

Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use. A simple application of screen scraping could be to build a database of all the NFL teams complete with player details.

What the heck, let’s do it… The first step is to get the page HTML into a PHP variable. This is super easy if the page is publicly accessible via a URL – no login or form post required to access… For more complex scraping you can use cURL to get the html source of the page but the rest of the process would be about the same. Anyway, let’s scrape the site.

$url = "";
$raw = file_get_contents($url);

The easiest way to do pattern matching I have found is without newlines. Here is how I remove them from the raw html before I start parsing out the data I want to scrape.

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

So now you have the source code of the page as a string variable, you need to parse out the results. Tis is where each scraping application will differ. Depending on the page structure and what elements you want to retrieve, you will have to alter the regular expression matching. You can view the source and see that the roster data you want is in a nice table with class name “standard_table”. I also notice that this class name is unique to the page. So the next step is to get the start and end string positions for this table, and then extract just the table from the content:

$start = strpos($content,'<table cellpadding="2" class="standard_table"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);

Now we have just the table containing the roster data, and we need to parse out the rows and cells. The easiest way to do this is with preg_match_all. If this code is not clear, you can print_r and die() in the loop to see what the rows and cells arrays contain.

foreach ($rows[0] as $row){
	if ((strpos($row,'<th')===false)){
		$number = strip_tags($cells[0][0]);
		$name = strip_tags($cells[0][1]);
		$position = strip_tags($cells[0][2]);
		echo "{$position} - {$name} - Number {$number} <br>\n";

So now we have parsed all the data for a given team from the official NFL site. To do all the teams, wrap this in a loop and as a final step, write all the data to a database table and voila, you have a database of all team rosters for the NFL.

This simple scraping example is just to illustrate the basic concept. Also keep in mind that if the source structure of the page you want to scrape changes, you will need to adjust your pattern matching. You should always scrape the page once and save the results in a file, then read that file into your code for development testing to minimize the hits to the live server. My personal opinion is that anything that is publicly accessible via the internet should be able to be scraped. What is the difference if you were to copy and paste it, basically that is what you are doing but doing it programmatically. You can definitely get into trouble if you misuse some data that you scraped, you could probably violate copyrights or whatever. Please scrape responsibly :)