Techniques for Mastering cURL
cURL is a tool for transferring files and data with URL syntax, supporting many protocols including HTTP, FTP, TELNET and more. Initially, cURL was designed to be a command line tool. Lucky for us, the cURL library is also supported by PHP. In this article, we will look at some of the advanced features of cURL, and how we can use them in our PHP scripts.
Why cURL?
It’s true that there are other ways of fetching the contents of a web page. Many times, mostly due to laziness, I have just used simple PHP functions instead of cURL:
- $content = file_get_contents("http://www.nettuts.com");
- // or
- $lines = file("http://www.nettuts.com");
- // or
- readfile("http://www.nettuts.com");
However they have virtually no flexibility and lack sufficient error handling. Also, there are certain tasks that you simply can not do, like dealing with cookies, authentication, form posts, file uploads etc.
cURL is a powerful library that supports many different protocols, options, and provides detailed information about the URL requests.
Basic Structure
Before we move on to more complicated examples, let’s review the basic structure of a cURL request in PHP. There are four main steps:
- Initialize
- Set Options
- Execute and Fetch Result
- Free up the cURL handle
- // 1. initialize
- $ch = curl_init();
- // 2. set the options, including the url
- curl_setopt($ch, CURLOPT_URL, "http://www.nettuts.com");
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- curl_setopt($ch, CURLOPT_HEADER, 0);
- // 3. execute and fetch the resulting HTML output
- $output = curl_exec($ch);
- // 4. free up the curl handle
- curl_close($ch);
Step #2 (i.e. curl_setopt() calls) is going to be a big part of this article, because that is where all the magic happens. There is a long list of cURL options that can be set, which can configure the URL request in detail. It might be difficult to go through the whole list and digest it all at once. So today, we are just going to use some of the more common and useful options in various code examples.
Checking for Errors
Optionally, you can also add error checking:
- // ...
- $output = curl_exec($ch);
- if ($output === FALSE) {
- echo "cURL Error: " . curl_error($ch);
- }
- // ...
Please note that we need to use “=== FALSE” for comparison instead of “== FALSE”. Because we need to distinguish between empty output vs. the boolean value FALSE, which indicates an error.
Getting Information
Another optional step is to get information about the cURL request, after it has been executed.
- // ...
- curl_exec($ch);
- $info = curl_getinfo($ch);
- echo 'Took ' . $info['total_time'] . ' seconds for url ' . $info['url'];
- // ...
Following information is included in the returned array:
- “url”
- “content_type”
- “http_code”
- “header_size”
- “request_size”
- “filetime”
- “ssl_verify_result”
- “redirect_count”
- “total_time”
- “namelookup_time”
- “connect_time”
- “pretransfer_time”
- “size_upload”
- “size_download”
- “speed_download”
- “speed_upload”
- “download_content_length”
- “upload_content_length”
- “starttransfer_time”
- “redirect_time”
Detect Redirection Based on Browser
In this first example, we will write a script that can detect URL redirections based on different browser settings. For example, some websites redirect cellphone browsers, or even surfers from different countries.
We are going to be using the CURLOPT_HTTPHEADER option to set our outgoing HTTP Headers including the user agent string and the accepted languages. Finally we will check to see if these websites are trying to redirect us to different URLs.
- // test URLs
- $urls = array(
- "http://www.cnn.com",
- "http://www.mozilla.com",
- "http://www.facebook.com"
- );
- // test browsers
- $browsers = array(
- "standard" => array (
- "user_agent" => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)",
- "language" => "en-us,en;q=0.5"
- ),
- "iphone" => array (
- "user_agent" => "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A537a Safari/419.3",
- "language" => "en"
- ),
- "french" => array (
- "user_agent" => "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 2.0.50727)",
- "language" => "fr,fr-FR;q=0.5"
- )
- );
- foreach ($urls as $url) {
- echo "URL: $url\n";
- foreach ($browsers as $test_name => $browser) {
- $ch = curl_init();
- // set url
- curl_setopt($ch, CURLOPT_URL, $url);
- // set browser specific headers
- curl_setopt($ch, CURLOPT_HTTPHEADER, array(
- "User-Agent: {$browser['user_agent']}",
- "Accept-Language: {$browser['language']}"
- ));
- // we don't want the page contents
- curl_setopt($ch, CURLOPT_NOBODY, 1);
- // we need the HTTP Header returned
- curl_setopt($ch, CURLOPT_HEADER, 1);
- // return the results instead of outputting it
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- $output = curl_exec($ch);
- curl_close($ch);
- // was there a redirection HTTP header?
- if (preg_match("!Location: (.*)!", $output, $matches)) {
- echo "$test_name: redirects to $matches[1]\n";
- } else {
- echo "$test_name: no redirection\n";
- }
- }
- echo "\n\n";
- }
First we have a set of URLs to test, followed by a set of browser settings to test each of these URLs against. Then we loop through these test cases and make a cURL request for each.
Because of the way setup the cURL options, the returned output will only contain the HTTP headers (saved in $output). With a simple regex, we can see if there was a “Location:” header included.
When you run this script, you should get an output like this:
POSTing to a URL
On a GET request, data can be sent to a URL via the “query string”. For example, when you do a search on Google, the search term is located in the query string part of the URL:
- http://www.google.com/search?q=nettuts
You may not need cURL to simulate this in a web script. You can just be lazy and hit that url with “file_get_contents()” to receive the results.
But some HTML forms are set to the POST method. When these forms are submitted through the browser, the data is sent via the HTTP Request body, rather than the query string. For example, if you do a search on the CodeIgniter forums, you will be POSTing your search query to:
- http://codeigniter.com/forums/do_search/
We can write a PHP script to simulate this kind of URL request. First let’s create a simple file for accepting and displaying the POST data. Let’s call it post_output.php:
- print_r($_POST);
Next we create a PHP script to perform a cURL request:
- $url = "http://localhost/post_output.php";
- $post_data = array (
- "foo" => "bar",
- "query" => "Nettuts",
- "action" => "Submit"
- );
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL, $url);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- // we are doing a POST request
- curl_setopt($ch, CURLOPT_POST, 1);
- // adding the post variables to the request
- curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
- $output = curl_exec($ch);
- curl_close($ch);
- echo $output;
When you run this script, you should get an output like this:
It sent a POST to the post_output.php script, which dumped the $_POST variable, and we captured that output via cURL.
File Upload
Uploading files works very similarly to the previous POST example, since all file upload forms have the POST method.
First let’s create a file for receiving the request and call it upload_output.php:
- print_r($_FILES);
And here is the actual script performing the file upload:
- $url = "http://localhost/upload_output.php";
- $post_data = array (
- "foo" => "bar",
- // file to be uploaded
- "upload" => "@C:/wamp/www/test.zip"
- );
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL, $url);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- curl_setopt($ch, CURLOPT_POST, 1);
- curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
- $output = curl_exec($ch);
- curl_close($ch);
- echo $output;
When you want to upload a file, all you have to do is pass its file path just like a post variable, and put the @ symbol in front of it. Now when you run this script you should get an output like this:
Multi cURL
One of the more advanced features of cURL is the ability to create a “multi” cURL handle. This allows you to open connections to multiple URLs simultaneously and asynchronously.
On a regular cURL request, the script execution stops and waits for the URL request to finish before it can continue. If you intend to hit multiple URLs, this can take a long time, as you can only request one URL at a time. We can overcome this limitation by using the multi handle.
Let’s look at this sample code from php.net:
- // create both cURL resources
- $ch1 = curl_init();
- $ch2 = curl_init();
- // set URL and other appropriate options
- curl_setopt($ch1, CURLOPT_URL, "http://lxr.php.net/");
- curl_setopt($ch1, CURLOPT_HEADER, 0);
- curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
- curl_setopt($ch2, CURLOPT_HEADER, 0);
- //create the multiple cURL handle
- $mh = curl_multi_init();
- //add the two handles
- curl_multi_add_handle($mh,$ch1);
- curl_multi_add_handle($mh,$ch2);
- $active = null;
- //execute the handles
- do {
- $mrc = curl_multi_exec($mh, $active);
- } while ($mrc == CURLM_CALL_MULTI_PERFORM);
- while ($active && $mrc == CURLM_OK) {
- if (curl_multi_select($mh) != -1) {
- do {
- $mrc = curl_multi_exec($mh, $active);
- } while ($mrc == CURLM_CALL_MULTI_PERFORM);
- }
- }
- //close the handles
- curl_multi_remove_handle($mh, $ch1);
- curl_multi_remove_handle($mh, $ch2);
- curl_multi_close($mh);
The idea is that you can open multiple cURL handles and assign them to a single multi handle. Then you can wait for them to finish executing while in a loop.
There are two main loops in this example. The first do-while loop repeatedly calls curl_multi_exec(). This function is non-blocking. It executes as little as possible and returns a status value. As long as the returned value is the constant ‘CURLM_CALL_MULTI_PERFORM’, it means that there is still more immediate work to do (for example, sending http headers to the URLs.) That’s why we keep calling it until the return value is something else.
In the following while loop, we continue as long as the $active variable is ‘true’. This was passed as the second argument to the curl_multi_exec() call. It is set to ‘true’ as long as there are active connections withing the multi handle. Next thing we do is to call curl_multi_select(). This function is ‘blocking’ until there is any connection activity, such as receiving a response. When that happens, we go into yet another do-while loop to continue executing.
Let’s see if we can create a working example ourselves, that has a practical purpose.
WordPress Link Checker
Imagine a blog with many posts containing links to external websites. Some of these links might end up dead after a while for various reasons. Maybe the page is longer there, or the entire website is gone.
We are going to be building a script that analyzes all the links and finds non-loading websites and 404 pages and returns a report to us.
Note that this is not going to be an actual WordPress plug-in. It is only a standalone utility script, and it is just for demonstration purposes.
So let’s get started. First we need to fetch the links from the database:
- // CONFIG
- $db_host = 'localhost';
- $db_user = 'root';
- $db_pass = '';
- $db_name = 'wordpress';
- $excluded_domains = array(
- 'localhost', 'www.mydomain.com');
- $max_connections = 10;
- // initialize some variables
- $url_list = array();
- $working_urls = array();
- $dead_urls = array();
- $not_found_urls = array();
- $active = null;
- // connect to MySQL
- if (!mysql_connect($db_host, $db_user, $db_pass)) {
- die('Could not connect: ' . mysql_error());
- }
- if (!mysql_select_db($db_name)) {
- die('Could not select db: ' . mysql_error());
- }
- // get all published posts that have links
- $q = "SELECT post_content FROM wp_posts
- WHERE post_content LIKE '%href=%'
- AND post_status = 'publish'
- AND post_type = 'post'";
- $r = mysql_query($q) or die(mysql_error());
- while ($d = mysql_fetch_assoc($r)) {
- // get all links via regex
- if (preg_match_all("!href=\"(.*?)\"!", $d['post_content'], $matches)) {
- foreach ($matches[1] as $url) {
- // exclude some domains
- $tmp = parse_url($url);
- if (in_array($tmp['host'], $excluded_domains)) {
- continue;
- }
- // store the url
- $url_list []= $url;
- }
- }
- }
- // remove duplicates
- $url_list = array_values(array_unique($url_list));
- if (!$url_list) {
- die('No URL to check');
- }
First we have some database configuration, followed by an array of domain names we will ignore ($excluded_domains). Also we set a number for maximum simultaneous connections we will be using later ($max_connections). Then we connect to the database, fetch posts that contain links, and collect them into an array ($url_list).
Following code might be a little complex, so I will try to explain it in small steps.
- // 1. multi handle
- $mh = curl_multi_init();
- // 2. add multiple URLs to the multi handle
- for ($i = 0; $i < $max_connections; $i++) {
- add_url_to_multi_handle($mh, $url_list);
- }
- // 3. initial execution
- do {
- $mrc = curl_multi_exec($mh, $active);
- } while ($mrc == CURLM_CALL_MULTI_PERFORM);
- // 4. main loop
- while ($active && $mrc == CURLM_OK) {
- // 5. there is activity
- if (curl_multi_select($mh) != -1) {
- // 6. do work
- do {
- $mrc = curl_multi_exec($mh, $active);
- } while ($mrc == CURLM_CALL_MULTI_PERFORM);
- // 7. is there info?
- if ($mhinfo = curl_multi_info_read($mh)) {
- // this means one of the requests were finished
- // 8. get the info on the curl handle
- $chinfo = curl_getinfo($mhinfo['handle']);
- // 9. dead link?
- if (!$chinfo['http_code']) {
- $dead_urls []= $chinfo['url'];
- // 10. 404?
- } else if ($chinfo['http_code'] == 404) {
- $not_found_urls []= $chinfo['url'];
- // 11. working
- } else {
- $working_urls []= $chinfo['url'];
- }
- // 12. remove the handle
- curl_multi_remove_handle($mh, $mhinfo['handle']);
- curl_close($mhinfo['handle']);
- // 13. add a new url and do work
- if (add_url_to_multi_handle($mh, $url_list)) {
- do {
- $mrc = curl_multi_exec($mh, $active);
- } while ($mrc == CURLM_CALL_MULTI_PERFORM);
- }
- }
- }
- }
- // 14. finished
- curl_multi_close($mh);
- echo "==Dead URLs==\n";
- echo implode("\n",$dead_urls) . "\n\n";
- echo "==404 URLs==\n";
- echo implode("\n",$not_found_urls) . "\n\n";
- echo "==Working URLs==\n";
- echo implode("\n",$working_urls);
- // 15. adds a url to the multi handle
- function add_url_to_multi_handle($mh, $url_list) {
- static $index = 0;
- // if we have another url to get
- if ($url_list[$index]) {
- // new curl handle
- $ch = curl_init();
- // set the url
- curl_setopt($ch, CURLOPT_URL, $url_list[$index]);
- // to prevent the response from being outputted
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- // follow redirections
- curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
- // do not need the body. this saves bandwidth and time
- curl_setopt($ch, CURLOPT_NOBODY, 1);
- // add it to the multi handle
- curl_multi_add_handle($mh, $ch);
- // increment so next url is used next time
- $index++;
- return true;
- } else {
- // we are done adding new URLs
- return false;
- }
- }
And here is the explanation for the code above. Numbers in the list correspond to the numbers in the code comments.
- Created a multi handle.
- We will be creating the add_url_to_multi_handle() function later on. Every time it is called, it will add a url to the multi handle. Initially, we add 10 (based on $max_connections) URLs to the multi handle.
- We must run curl_multi_exec() for the initial work. As long as it returns CURLM_CALL_MULTI_PERFORM, there is work to do. This is mainly for creating the connections. It does not wait for the full URL response.
- This main loop runs as long as there is some activity in the multi handle.
- curl_multi_select() waits the script until an activity to happens with any of the URL quests.
- Again we must let cURL do some work, mainly for fetching response data.
- We check for info. There is an array returned if a URL request was finished.
- There is a cURL handle in the returned array. We use that to fetch info on the individual cURL request.
- If the link was dead or timed out, there will be no http code.
- If the link was a 404 page, the http code will be set to 404.
- Otherwise we assume it was a working link. (You may add additional checks for 500 error codes etc...)
- We remove the cURL handle from the multi handle since it is no longer needed, and close it.
- We can now add another url to the multi handle, and again do the initial work before moving on.
- Everything is finished. We can close the multi handle and print a report.
- This is the function that adds a new url to the multi handle. The static variable $index is incremented every time this function is called, so we can keep track of where we left off.
I ran the script on my blog (with some broken links added on purpose, for testing), and here is what it looked like:
It took only less than 2 seconds to go through about 40 URLs. The performance gains are significant when dealing with even larger sets of URLs. If you open ten connections at the same time, it can run up to ten times faster. Also you can just utilize the non-blocking nature of the multi curl handle to do URL requests without stalling your web script.
Some Other Useful cURL Options
HTTP Authentication
If there is HTTP based authentication on a URL, you can use this:
- $url = "http://www.somesite.com/members/";
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL, $url);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- // send the username and password
- curl_setopt($ch, CURLOPT_USERPWD, "myusername:mypassword");
- // if you allow redirections
- curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
- // this lets cURL keep sending the username and password
- // after being redirected
- curl_setopt($ch, CURLOPT_UNRESTRICTED_AUTH, 1);
- $output = curl_exec($ch);
- curl_close($ch);
FTP Upload
PHP does have an FTP library, but you can also use cURL:
- // open a file pointer
- $file = fopen("/path/to/file", "r");
- // the url contains most of the info needed
- $url = "ftp://username:password@mydomain.com:21/path/to/new/file";
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL, $url);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- // upload related options
- curl_setopt($ch, CURLOPT_UPLOAD, 1);
- curl_setopt($ch, CURLOPT_INFILE, $fp);
- curl_setopt($ch, CURLOPT_INFILESIZE, filesize("/path/to/file"));
- // set for ASCII mode (e.g. text files)
- curl_setopt($ch, CURLOPT_FTPASCII, 1);
- $output = curl_exec($ch);
- curl_close($ch);
Using a Proxy
You can perform your URL request through a proxy:
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL,'http://www.example.com');
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- // set the proxy address to use
- curl_setopt($ch, CURLOPT_PROXY, '11.11.11.11:8080');
- // if the proxy requires a username and password
- curl_setopt($ch, CURLOPT_PROXYUSERPWD,'user:pass');
- $output = curl_exec($ch);
- curl_close ($ch);
Callback Functions
It is possible to have cURL call given callback functions during the URL request, before it is finished. For example, as the contents of the response is being downloaded, you can start using the data, without waiting for the whole download to complete.
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL,'http://net.tutsplus.com');
- curl_setopt($ch, CURLOPT_WRITEFUNCTION,"progress_function");
- curl_exec($ch);
- curl_close ($ch);
- function progress_function($ch,$str) {
- echo $str;
- return strlen($str);
- }
The callback function MUST return the length of the string, which is a requirement for this to work properly.
As the URL response is being fetched, every time a data packet is received, the callback function is called.
Conclusion
We have explored the power and the flexibility of the cURL library today. I hope you enjoyed and learned from the this article. Next time you need to make a URL request in your web application, consider using cURL.
Thank you and have a great day!