PHP Screen Scraping

I had a request recently for help with scraping a little content from http://www.newyork411.com :)

So here we go. This time I made a quick PHP class with some basic functions to grab the source fo the page as well as fetchBetween, fetchAfter, fetchAll, etc. You can get the latest version of the class at http://www.bradino.com/downloads/cScrape.txt - be sure to rename it to cScrape.php. If there is an interest I can continue to develop this class a tool for screens craping with PHP.

Anyway so here we go scraping all the companies from this page http://www.newyork411.com/Ad_Agencies_Production_Companies/category-cid-50553.htm as well as the details of each company, found by clicking on the company.

Step 1 - Initialize the class and fetch the page:

PHP:
  1. include ('cScrape.php');
  2.  
  3. $scrape = new Scrape();
  4.  
  5. $url = 'http://www.newyork411.com/Ad_Agencies_Production_Companies/category-cid-50553.htm';
  6.  
  7. $scrape->fetch($url);
  8.  
  9. $data = $scrape->removeNewlines($scrape->result);

Step 2 - find your anchor and get the chunk of html that contains what you want

PHP:
  1. $data = $scrape->fetchBetween('<table width="490" border="0" cellpadding="3"','</table>',$data,true);
  2.  
  3. $rows = $scrape->fetchAllBetween('<TR','</tr>',$data,true);

Step 3 - parse out the individual values and print out the first record for demo

PHP:
  1. foreach ($rows as $id => $row){
  2.    
  3.     $record = array();
  4.    
  5.     $cells = $scrape->fetchAllBetween('<td','</td>',$row,true);
  6.    
  7.     $record['company'] = strip_tags($cells[1]);
  8.    
  9.     $url = 'http://www.newyork411.com' . $scrape->fetchBetween('<a href="','">',$cells[1],false);
  10.    
  11.     $url = str_replace(' ','%20',$url);
  12.    
  13.     $scrape->fetch($url);
  14.    
  15.     $data2 = $scrape->removeNewlines($scrape->result);
  16.    
  17.     $data2 = $scrape->fetchBetween('<div id="tabText">','</div>',$data2,true);
  18.    
  19.     $data2 = $scrape->fetchAfter('</table>',$data2,false);
  20.  
  21.     $details = explode('<br />',$data2);
  22.    
  23.     $record['address'] = $details[0];
  24.    
  25.     $location = explode(',',$details[1]);
  26.    
  27.     $record['city'] = trim($location[0]);
  28.    
  29.     $location = explode(' ',trim($location[1]));
  30.    
  31.     $record['state'] = trim($location[0]);
  32.    
  33.     $record['zip'] = trim($location[1]);
  34.    
  35.     for($i=2; $i<=5; $i++){
  36.        
  37.         $detail = trim($details[$i]);
  38.        
  39.         if(substr($detail,0,6)=='Phone:') $record['phone'] = str_replace('Phone: ','',$detail);
  40.        
  41.         else if(substr($detail,0,4)=='Fax:') $record['fax'] = str_replace('Fax: ','',$detail);
  42.        
  43.         else if(substr($detail,0,4)=='Web:') $record['web'] = strip_tags(str_replace('Web: ','',$detail));
  44.        
  45.         else if(substr($detail,0,6)=='Email:') $record['email'] = strip_tags(str_replace('Email: ','',$detail));
  46.        
  47.     }
  48.    
  49.     print_r($record);
  50.    
  51.     die();
  52. }

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Netvouz
  • DZone
  • Reddit
  • Furl
  • NewsVine
  • Simpy
  • Slashdot
  • Spurl
  • StumbleUpon
  • YahooMyWeb
  • TailRank

Home | PHP | PHP Screen Scraping