PHP Screen Scraping
I had a request recently for help with scraping a little content from http://www.newyork411.com :)
So here we go. This time I made a quick PHP class with some basic functions to grab the source fo the page as well as fetchBetween, fetchAfter, fetchAll, etc. You can get the latest version of the class at http://www.bradino.com/downloads/cScrape.txt - be sure to rename it to cScrape.php. If there is an interest I can continue to develop this class a tool for screens craping with PHP.
Anyway so here we go scraping all the companies from this page http://www.newyork411.com/Ad_Agencies_Production_Companies/category-cid-50553.htm as well as the details of each company, found by clicking on the company.
Step 1 - Initialize the class and fetch the page:
$scrape = new Scrape();
$url = 'http://www.newyork411.com/Ad_Agencies_Production_Companies/category-cid-50553.htm';
$scrape->fetch($url);
$data = $scrape->removeNewlines($scrape->result);
Step 2 - find your anchor and get the chunk of html that contains what you want
$rows = $scrape->fetchAllBetween('<TR','</tr>',$data,true);
Step 3 - parse out the individual values and print out the first record for demo
$record = array();
$cells = $scrape->fetchAllBetween('<td','</td>',$row,true);
$record['company'] = strip_tags($cells[1]);
$url = 'http://www.newyork411.com' . $scrape->fetchBetween('<a href="','">',$cells[1],false);
$url = str_replace(' ','%20',$url);
$scrape->fetch($url);
$data2 = $scrape->removeNewlines($scrape->result);
$data2 = $scrape->fetchBetween('<div id="tabText">','</div>',$data2,true);
$data2 = $scrape->fetchAfter('</table>',$data2,false);
$details = explode('<br />',$data2);
$record['address'] = $details[0];
$location = explode(',',$details[1]);
$record['city'] = trim($location[0]);
$location = explode(' ',trim($location[1]));
$record['state'] = trim($location[0]);
$record['zip'] = trim($location[1]);
for($i=2; $i<=5; $i++){
$detail = trim($details[$i]);
if(substr($detail,0,6)=='Phone:') $record['phone'] = str_replace('Phone: ','',$detail);
else if(substr($detail,0,4)=='Fax:') $record['fax'] = str_replace('Fax: ','',$detail);
else if(substr($detail,0,4)=='Web:') $record['web'] = strip_tags(str_replace('Web: ','',$detail));
else if(substr($detail,0,6)=='Email:') $record['email'] = strip_tags(str_replace('Email: ','',$detail));
}
print_r($record);
die();
}
You’re currently reading “ PHP Screen Scraping ,” an entry on BRADINO
- Published:
- 6.2.08 / 2pm
- Category:
- PHP, Screen Scraping























Hey,Im new to php and was trying out this tutorial ( which is awesome by the way) and just can’t seem to get it to work..I’ve d/l the cScrape.php file ( and renamed it) and put the other steps into a seperate file which I’ve named Scrape.php but whenI upload it it just gives me a blank page and the page source is blank too! Please help me out as I’m dying to get this working!
Thanks,
Rob!
could you give us both some help.. all i get is …
Parse error: parse error, unexpected T_STRING, expecting T_OLD_FUNCTION or T_FUNCTION or T_VAR or ‘}’ in /home/content/t/e/c/techsavy/html/tgusaw/members/scrape.php on line 10
i would like to use this scrape function
Roman
Rob if you are getting a blank page, they may have slightly changed the page structure. At the end or step 1 I would echo the $data variable and see what you get.
Roman, what is the contents of your scrape.php file? Looks like you have a problem on line 10. If you email me the files I will check them out.
Interesting, I use file_get_contents and I like doing what I call a “Scrape and Save” works like some kind of cache, so my IP does not always appear in the analytics of the site I am scraping. I save the whole value of file_get_contents in a MySQL column and do all text string operations on the save one.
Aside from that.. I use ini_set and assign the user_agent a browser string so it looks like a web browser. :)
Hi Bradino,
I just wanted to stop back bye and tell you I’ve used this on a few of my projects and it is fantastic. I’m currently scraping Twitter (shhhh) and your class makes it super easy. The only thing I have to figure out now is how to use cURL to log in before I do a scrape which is beyond your class.
Anyway, just wanted to say I appreciate you putting this out there. So much so I tracked your site down again just to write this.
Cheers!
umm.. i wonder why the tags are open in the example?
Hey Brad,
Your tutorial is great. I successfully was able to figure out how to scrape one portion of information from a website, but I am stuck with another portion. I can’t figure out how to accurately scrape its table layout.
I want to scrape the Humidity from this url:
http://www.weather.com/weather/local/95404?lswe=95404&lwsa=WeatherLocalUndeclared&from=searchbox_localwx
But I can’t figure out how to weed out the other data from the table it is in.
Thanks much for your help!
You’ve made screen-scraping incredibly easy, perhaps too easy. :)
i have done according to turotial but why i cant see multiple record? plz help me to out………….
tq
shakahwat
It could be that they have changed the source code of the page. What site would you like to scrape?
im new with PHP, im trying to use this source code to scraping items price from amazon but doesn’t work. can anyone give an advise how i can do that.
hello
i tried with no changing from same source to scrape data but it scrapes only for first record please can you tell me what i need to add to scrape for all
please share here or mail me
thanks
gull.bird at gmail.com
Just using your script for a project, I wanted to say thank you so much – I owe you big time :)
How does one better target which table they want to pull content from? There are occasions where no unique ID is labelled, and classes overlap. Is there a function where one can query the 1st / 2nd / 3rd instance of a table on the page?
Thanks for your help.
Well you can do a fetchAllBetween where you can grab all the tables and then loop through them to get to the Nth table. I like your idea though, I am going to write a new function to return the Nth match of whatever. Thanks for the feedback!