PHP Screen Scraping Tutorial
UPDATE: New Screen Scraping Post
Screen Scraping is a great skill that every PHP developer should have experience with. Basically it involves scraping the source code of a web page, getting it into a string, and then parsing out the parts that you want to use. A simple application of screen scraping could be to build a database of all the NFL teams complete with player details.
What the heck, let's do it... The first step is to get the page HTML into a PHP variable. This is super easy if the page is publicly accessible via a URL - no login or form post required to access... For more complex scraping you can use cURL to get the html source of the page but the rest of the process would be about the same. Anyway, let's scrape the site.
The easiest way to do pattern matching I have found is without newlines. Here is how I remove them from the raw html before I start parsing out the data I want to scrape.
$content = str_replace($newlines, "", html_entity_decode($raw));
So now you have the source code of the page as a string variable, you need to parse out the results. Tis is where each scraping application will differ. Depending on the page structure and what elements you want to retrieve, you will have to alter the regular expression matching. You can view the source and see that the roster data you want is in a nice table with class name "standard_table". I also notice that this class name is unique to the page. So the next step is to get the start and end string positions for this table, and then extract just the table from the content:
Now we have just the table containing the roster data, and we need to parse out the rows and cells. The easiest way to do this is with preg_match_all. If this code is not clear, you can print_r and die() in the loop to see what the rows and cells arrays contain.
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$number = strip_tags($cells[0][0]);
$name = strip_tags($cells[0][1]);
$position = strip_tags($cells[0][2]);
echo "{$position} - {$name} - Number {$number} <br>\n";
}
}
So now we have parsed all the data for a given team from the official NFL site. To do all the teams, wrap this in a loop and as a final step, write all the data to a database table and voila, you have a database of all team rosters for the NFL.
This simple scraping example is just to illustrate the basic concept. Also keep in mind that if the source structure of the page you want to scrape changes, you will need to adjust your pattern matching. You should always scrape the page once and save the results in a file, then read that file into your code for development testing to minimize the hits to the live server. My personal opinion is that anything that is publicly accessible via the internet should be able to be scraped. What is the difference if you were to copy and paste it, basically that is what you are doing but doing it programmatically. You can definitely get into trouble if you misuse some data that you scraped, you could probably violate copyrights or whatever. Please scrape responsibly :)
You’re currently reading “ PHP Screen Scraping Tutorial ,” an entry on BRADINO
- Published:
- 11.20.07 / 12am
- Category:
- PHP, Screen Scraping























Screen scraping with regular expressions? Seriously. If you want to do real screen scraping, and screen scraping that won’t break unless the target page’s structure is dramatically altered, use DOM/XPath. http://www.developertutorials.com/tutorials/php/scraping-links-with-php-8-01-05/page1.html has a good tutorial on it.
Thank you for this tutorial – I’m developing a portal, and wanted to create channels without asking all of our offices to develop new RSS feeds on their pages, or manually create content – this is really helpful to us.
Very nice tutorial. Scraping a table is exactly what I’m trying to do. Thank you.
That’s awesome…Exactly what i wanted to do- scrape a table
Josho, thanks but no thanks! This tutorial was exactly what I needed (went over to your recommended tutorial, did exactly as I was told and nothing worked!). Plz don’t trash good tutorials and recommend bad ones.
I think Josho’s point is that while this is a good ’starter’, the subject (like most) does get more complex – e.g. sites that demand ‘correct’ user-agents to return content.
Readers SHOULD make themselves aware of the more advanced (albeit more complex) options such as curl and the Php5 DOM constructs.
PREG matching is not as bad as some people make it sound here. The PHP DOM functions are good, but they are only worth the trouble IMHO if the HTML that you are dealing with is reasonably well written (note I did not say well-formed). I have scraped some sites that are so disorganized that the DOM functions were painful to use. In that case, some preg_matching to determine search boundaries and then applying more specific searches is much easier, at least for me.
This stuff is awesome, thanks man!
Thank you this helped a lot. :)
Thank you so much for taking the time to write this page!!
Its just what i needed!
Thanks man,really nice tutorial……
I think what you’re missing is that RegEx has the power to do everything that DOM does, only much much messier.
Is there a sample php page I can download?
It is easy for me to work with a working ample page as I am a newb lol
Good tutorial, worked nicely, am off to try it myself now. As for the doubters and question raisers – there will always be a better way to do something until the something that’s supposed to be improved is obsolete.
Tutorials like this are very much needed and very helpful to a learning PHP programmer.
Thanks!
Experiences are documented throughout the process in all internships. ,
I have been scraping websites for many years, since back before either the PCRE functions or the DOM functions were available. Both are useful tools, and both have their place.
For example, if we are doing a targetted search, such as just looking for one item, or many items that are all in a list of the same form (such as a table full of links), then a simple preg_match() based on the closest surrounding tags (perhaps using a [td class="foo"]whatwewant[/td]) may be the fastest both to code and to run. DOM is most useful when you want to collect a variety of different information from a page. However, very complex page structures can make the nested arrays of the DOM so difficult to navigate that it’s easier to just, again, pick what you want using preg_match. In some cases I have found that it is best to use preg_match multiple times – first to split the page into sections, then running specific preg_matches that are specific to the sections.
I have also found that running ‘tidy->repairString()’ over a page with wrap=0 can be very useful in turning bad HTML into something that can improve results from either method. I have had DOM spend literally minutes trying to decipher a page before finally failing.
Also, when using preg_match, don’t try to do too much in one match. The more complex your matching, the more likely it is to break, after a page redesign or a change in display order, or even having some target string be different than you expected.
In the long run, dealing with the intricacies of the actual page fetching will be more work than working with the text. Snoopy is a class that makes a lot of that work easier. The most recent version does work with PHP 5.
Perhaps the most difficult common pages to scrape (until you figure them out) are those ASPX pages that use the javascript POSTBACK method for the links on the page. Use the Web Developer Firefox Add-on to help see what the form variables are – __VIEWSTATE is one of the key variables, I forget the others offhand.
Worked like a charm. Wonderful. Thank you, thank you, thank you!
It brought up a blank screen – what am I missing?
Hi,
It a very nice example. How do I enter the data $number to a database? I have 13 columns.
Regards
Class Library References
# system.Net
# system.IO
private void Page_Load(object sender, System.Event 0Args e)
{
//Retrieve URL from user input box
if (page.Is Post Back)
lit HTML from Scraped Page.Text = Get HTML Page( URL.Text );
}
public String Get HTML Page(string structure URL)
}
// the HTML retrieved from the page
String structure result:;
Web Response obj Request = System.Net.HTTP Web Request. Create(structure URL);
obj Response = obj Request.Get Response();
// the using keyword will automatically dispose the object
// once complete
using complete
using (stream Reader SR =
new stream Reader (obj Response.Get Response Stream()))
{
structure Result = SR.Read To End();
// close and clean up the stream Reader
}
return structure Result;
}
page scrape in c# – Hello Kong Vanny Friends
Enter the URL to the page you want to scrape (include thehttp:/)
private void page_Load(object sender, system.Event server e)
}
//Retrieve URL from user input box
if(page.Is Post Back)
lit HTML from Scrape Page. Text =Get HTML page(obj URL.Text);
}
public string Get HTML page(string structure URL)
{
//the HTML retrieved from the page string structure result;
Web Response obj Response;
Web Request obj Request= system.Net.HTTP Web Request.Create(structure URL);
obj Response = obj Request.Get Response();
// the using keyword will automatically dispose the object
// once complete
using (stream Reader SR = new Stream Reader(obj Response.Get Response Stream()))
{
structure Result = SR.Read To End();
// close and clean up the stream Reader
SR.close();
}
return structure result;
}
page scrape in C# – obj Friends
Enter the URL to the page you want to scrape (include the http://)
Beautiful. Thank you for writing this!
I always wondered how this could be done so simply.
:)