Screen Scraping Twitter
I got an email today asking for help to scrape Twitter. In particular, to be able to login. So I am going to show everyone, NOT to encourage anyone to violate Twitters terms of use but as an educational blog post about how PHP and cURL can be used to post variables and store cookies.
Again, I am using the cScrape class I wrote, which you can download.
Step 1
First go to twitter.com and look at the source code of the login to get the form field names and the form post location. You will see that the form posts to https://twitter.com/session and the username and password fields are session[username_or_email] and session[password] respectively.
Step 2
Now you are ready to login. So using the fetch function in the Scrape class you create an associative array to contain the form values you want to post. The other thing you will need to do is uncomment the lines for CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. Cookies will be required to stay logged in and scrape around. The paths to the cookie files need to be writable by your app. Also you will need to uncomment the line about CURLOPT_FOLLOWLOCATION.
$scrape->fetch('https://twitter.com/sessions',$data);
Step 1.5
Oops that didn't work. All I got back was 403 Forbidden: The server understood the request, but is refusing to fulfill it. Ahhh I see another variable called authenticity_token I bet Twitter was looking for that. So let's back up and first hit twitter.com to get the authenticity_token variable, and then make the login post request with that variable included in our array of parameters.
$data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret");
$data['authenticity_token'] = $scrape->fetchBetween('name="authenticity_token" type="hidden" value="','"',$scrape->result);
$scrape->fetch('https://twitter.com/sessions',$data);
echo $scrape->result;
So that's basically it. Now you are logged in and can scrape around and request other pages as you normally would. Sorry it wasn't a longer post. I really do enjoy this kind of stuff so if anyone has a request, hit me up.
Errors?
1) Make sure that you are properly parsing the token variable
2) Make sure that you uncommented the lines about CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, those options need to be enabled and be sure the path set is writable by your application
3) Make sure that the path to the cookie file is writable and that it is getting data written to it
4) If you get a message about being redirected you need to uncomment the line about CURLOPT_FOLLOWLOCATION, that option needs to be enabled true
You’re currently reading “ Screen Scraping Twitter ,” an entry on BRADINO
- Published:
- 3.28.09 / 1pm
- Category:
- PHP, Screen Scraping























Hi,
I wrote my own screen scraper logging into Twitter but couldn’t get it to work as I am getting this error:
403 Forbidden: The server understood the request, but is refusing to fulfill it.
So I searched around and found your code, which is giving me the same exact error. Are you still able to use this code to login to Twitter for scraping? Any ideas what I am doing wrong? I have cookies enable and I am passing back the token.
By the way, I think for step 1.5, you need the line $scrape = new Scrape(); before the $scrape->fetch(’https://twitter.com’); line in order for the code to work.
Turned out that my error was caused by Xampp. I uploaded the script to an actual server, ran it, and it worked perfect.
If anyone else is using Xampp, you will have to post the data using a string instead of an array (don’t ask me why, it just won’t work with an array) basically like this:
$post_data=”authenticity_token=”.$token.”&”.urlencode(”session[username_or_email]“).”=$username&”.urlencode(”session[password]“).”=$password”;
Hey man, thanks for writing this class and making it public! I’m getting an error when trying to include the file containing the class. Here is my php file (renamed as txt):
http://is.gd/1ZB4f
The error I get is:
Parse error: parse error, unexpected T_STRING, expecting T_OLD_FUNCTION or T_FUNCTION or T_VAR or ‘}’ in /the_full_path/cScrape.php on line 3
What version of PHP do you run? I wonder if that has anything to do with it?
PHP5+
I have tried using the friendships/create API call, but it requires POST, without any POST data.
I’ve tried rewriting your code a little, but I can’t get it to work.
Would you be able to show me how to do it?
I constantly get the error that I can’t be authenticated, but I do this on the back of other API method calls, which also require AUTH, and which work fine!
Not sure if I may have a boo boo there, of your current version won’t do it!
Cheers.
With the Scrape class I wrote, it takes a data array as a parameter to put it into POST mode. I would imagine that you could use an array with one element, of which Twitter would ignore, or else just comment out the conditional where it looks for the data array to contain something before setting it to POST. Otherwise email me your code example and I will get it to work for you.
BRAD