RSS Feeds for FTP Servers
March 22, 2006
The applications for RSS have extended far beyond a way to distribute news items. RSS is now used for everything from tracking packages to car dealer inventories. These reflect one of the great aspects of RSS: you can use it to tell you when something happens that you care about, rather than having to go check for yourself. In that spirit, this article will show you how to write a PHP script that will monitor an FTP server for you, notifying you of the newest files added or changed.
PHP, FTP, and Thee
With all the emphasis on web-related functionality, the FTP commands in PHP are often overlooked. The good news is that these functions are included with standard PHP 4, so no external libraries are required.
It is important to make sure your PHP install has the FTP functions enabled, however.
To do
this, use phpinfo()
in a simple file to see what has been enabled:
<?php phpinfo(); ?>
When you view the above script in your web browser, you'll see the classic PHP Info page with nearly all the configuration information you'll ever need. Scroll down to the FTP section to see if "FTP support" has been "enabled." It should look something like this:
Figure 1. PHP Info showing FTP is enabled
(If the FTP functions are not enabled, you'll need to make arrangements to either get it enabled or host this tutorial's script on a different server.)
It is a lot to absorb, but the PHP Manual on FTP functions is a useful reference. You may want to keep it close at hand while going through this tutorial. (And if you have insomnia, you could always try reading RFC 959, the specification for FTP itself.)
Know the Code
For this tutorial, we will create a PHP script called ftp_monitor.php. We will go through the script piece by piece, but you might also want to download the complete source code for reference.
The exploreFtpServer()
Function
Let's start with the heart of the FTP functionality, encapsulated in the
exploreFtpServer()
function. This will take parameters for the FTP hostname,
username, password, and initial path. The purpose of this function is to explore the
FTP
server, recurse through any directories, and then return an associative array of filenames
and corresponding file timestamps.
After declaring the function signature, use PHP's ftp_connect()
to
attempt a connection to the FTP server. If the connection is created, we will keep
the
connection ID in the variable $cid
to be used with all of PHP's FTP-related
functions.
function exploreFtpServer($host, $user, $pass, $path) { // Connect $cid = ftp_connect($host) or die("Couldn't connect to server");
For the sake of simplicity, the above code will summarily halt the script execution if a FTP connection cannot be established. After you have used this script for a while, you may want to add in some more robust error handling.
Once we have a connection, we will attempt to authenticate with the server by using
the PHP
function ftp_login()
. Like most of PHP's FTP functions, the first argument is the
connection ID ($cid
). This particular one also takes a username and
password.
If the login is successful, we'll use ftp_pasv()
to tell
the FTP server we are going to use passive mode. This means all connections will be
initiated by the script. In so doing, you will be able to run this script behind a
firewall.
Now that the connection is all set up, we can recurse through the FTP server directories,
starting at the specified path. We'll use a scanDirectory()
function to
accomplish this, to be written after this one.
Finally, whether authentication worked or not, we'll need to close the connection
to the
server using ftp_close()
. Again, this example will halt the script with die()
if authentication
fails, but you may choose to handle the failure in a different way.
// Login if (ftp_login($cid, $user, $pass)) { // Passive mode ftp_pasv($cid, true); // Recurse directory structure $fileList = scanDirectory($cid, $path); // Disconnect ftp_close($cid); } else { // Disconnect ftp_close($cid); die("Couldn't authenticate."); }
At this point we have a populated $fileList
variable, which is an associative
array. The keys are the file names, and the values are the timestamps of the files.
This
array will be most useful if sorted by timestamp--newest first--so sort it with arsort()
and return
it.
// Sort by timestamp, newest (largest number) first arsort($fileList); // Return the result return $fileList; }
The scanDirectory()
Function
Now we are ready to write the scanDirectory()
function, which is called from
our exploreFtpServer()
described above. The purpose of this function is to scan
an FTP directory for files and subdirectories, adding the former to a list and recursing
through the latter. The parameters to pass in are the FTP connection ID ($cid
)
and a starting directory ($dir
). We'll also declare a static variable
$fileList
which will be used to retain our file list across recursive calls
to the function.
To get the contents of a given directory in FTP, we'll use the ftp_nlist()
function. Unfortunately, this function isn't perfect. On most FTP servers I have tested,
it
will return a list of file and directory names. But there are a few, like WU-FTPD, which only return a listing of filenames. On
such servers our script can monitor only the initial directory provided; no subdirectories
will be monitored.
The alternative to ftp_nlist()
is ftp_rawlist()
,
which should provide all directory contents regardless of type. Unfortunately, the
format of
data returned by ftp_rawlist()
does not seem to be standardized, so any attempt
to come up with a "universal parser" is a daunting task. (Read the user comments on
ftp_rawlist()
to
see what I mean.) Thus, for the sake of the tutorial, we'll stick with the imperfect
but
much simpler ftp_nlist()
.
function scanDirectory($cid, $dir) { // Use static value to collect results static $fileList=array(); // Get a listing of directory contents $contents = ftp_nlist($cid, $dir);
The $contents
variable is now populated with a simple array of file and
directory names. Depending on the server, these items may or may not contain the path
of the
file itself. (We may get "foo.txt", or we may get "/i/pity/the/foo.txt".)
This next section of scanDirectory()
will iterate through each name and use
ftp_size()
to determine whether the name is a file or directory. (This is a cheap trick: directories
return a size of -1
.) If the item is a file, we'll prepend a leading slash if
needed to keep our paths consistent, then use the ftp_mdtm()
function
to get its modification timestamp. We'll then add the filename as the key in our
$fileList
associative array, and use its timestamp as the value:
// Iterate through the directory contents if($contents!=null) { foreach ($contents as $item) { // Is the item a file? if (ftp_size($cid, $item)>=0) { // Prepend slash if not present if($item[0]!="/") $item = "/" . $item; // Add file and modify timestamp to results $fileList[$item] = ftp_mdtm($cid, $item); }
Now we'll need to deal with an item returned by ftp_nlist()
that is a
directory. We'll be sure to ignore aliases to the same or parent directory. If we
have a
usable directory name, we can call scanDirectory()
to recurse into it. (This
requires some extra logic to handle variations among servers that use full or relative
paths.) With both files and directories handled, we can return the $fileList
containing every file found thus far. Here's how it all looks:
else // Item is a directory { // Exclude self/parent aliases if($item!="." && $item!=".." && $item!="/") { // Server uses full path names if($item==strstr($item, $dir)) { scanDirectory($cid, $item); } else { // Server uses relative path names if($dir=="/") { scanDirectory($cid, $dir . $item); } else { scanDirectory($cid, $dir . "/" . $item); } } } } } } // Return the results return $fileList; }
The generateRssFeed()
Function
This has started to feel like a PHP tutorial, hasn't it? Good news: the hard part is over, and now all that remains is to write the function that actually generates the RSS feed.
This function is the one that we'll call directly from another PHP script, passing in all the parameters needed to make it work:
$host
: The FTP server hostname. Example: "ftp.foo.com".$user
: The FTP username. Example: "anonymous".$pass
: The FTP user password. Example: "guest".$path
: The starting directory on the server. Example: "/pub/crawl"$itemCount
: The number of items to return in the RSS feed.
The first thing we'll do is call our exploreFtpServer()
function to get the
list of files and their timestamps from the FTP server. Once the list is returned,
we can
see whether the list is shorter than the $itemCount
passed in, and use the
smaller number of the two:
function generateRssFeed($host, $user, $pass, $path, $itemCount) { // Get array of file/time arrays $fileList = exploreFtpServer($host, $user, $pass, $path); // Use user's count or # of files found, whichever is smaller if(count($fileList)<$itemCount) $itemCount=count($fileList);
We have a few variables to declare before continuing. First is a $linkPrefix
to hold the hostname prefixed with the FTP protocol. Next is a $channelPubDate
which will hold the publication date of the RSS feed. We will also create an
$items
array to hold the RSS items we create.
// Create link prefix for feed $linkPrefix = 'ftp://' . $host; // Declare date for channel/pubDate $channelPubDate = null; // Array for item strings $items = array();
We're now ready to grab the most recent files and create RSS items for each one. For this tutorial, we're going to keep it nice and simple: Each item will have a title, link, and date. Feel free, however, to spice up your feed with more information. (One fun idea would be to display a file icon which matches the file's extension.)
As you'll recall, the $fileList
returned from exploreFtpServer()
is sorted by timestamp, newest file first. As we loop through the array we'll use
the
timestamp to create the publication date of the RSS item. The first (newest) timestamp
will
also be used to create the publication date for the RSS feed as a whole.
// Create array of RSS items from most recent files foreach ($fileList as $filePath => $time) { // Create item/pubDate according to RFC822 $itemPubDate = date("r", $time); // Also use first item/pubDate as channel/pubDate if($channelPubDate==null) $channelPubDate = $itemPubDate;
Next we'll create the file's URI, starting with "ftp://". This $fileUri
variable will be used to populate both the RSS item title and the link. We should
replace
any spaces in the filenames with the encoded value of "%20" to ensure the URI is
well-formed.
Now that we have all the information we need, it is time to create the XML for each
RSS
item. When that is done, we'll add it to our $items
array for use later on. We
will also be sure to end this loop if we have reached the $itemCount
threshold.
// Create URI to ftp file $fileUri = ereg_replace(" ", "%20", $linkPrefix . $filePath); // Create item $item = '<item>' .'<title>' . $fileUri . '</title>' .'<link>'. $fileUri . '</link>' .'<pubDate>' . $itemPubDate . '</pubDate>' .'</item>'; // Add to item array array_push($items, $item); // If max items for feed reached, stop if(count($items)==$itemCount) break; }
Finally we get to create the RSS feed itself. Building XML in PHP using strings is
easy,
but rarely pretty to look at. Note that we're using join()
to add in all of our
RSS items, with a line break after each. (The line breaks aren't necessary, but they
make
the feed easier to read when you're troubleshooting.)
// Build the RSS feed $rss = '<rss version="2.0">' . '<channel>' . '<title>FTP Monitor: ' . $host . '</title>' . '<link>' . $linkPrefix . '</link>' . '<description>The ' . $itemCount.' latest changes on ' . $host . $path . ' (out of ' . count($fileList) . ' files)</description>' . '<pubDate>' . $channelPubDate . '</pubDate>' . "\n" . join("\n", $items) . "\n" . '</channel>' . '</rss>';
The feed is ready to go. All that remains is to set the HTTP header to indicate that we are returning an XML document, then output the feed:
// Set header for XML mime type header("Content-type: text/xml; charset=UTF-8"); // Display RSS feed echo($rss); }
And that's the last of the real work. If you haven't done so yet, be sure to download the complete source code of ftp_monitor.php to see it all in one place.
Put It to Work
After placing ftp_monitor.php on your PHP-enabled web server, you can reference it from any other PHP script. Here is an example of how that might look:
<?php // Import the FTP Monitor require_once('ftp_monitor.php'); // Connection params to monitor FreeBSD snapshots $host = "ftp.freebsd.org"; $user = "anonymous"; $pass = "guest@anon.com"; $path = "/pub/FreeBSD/snapshots"; // Generate RSS 2.0 feed showing newest FreeBSD snapshots generateRssFeed($host, $user, $pass, $path, 10); ?>
Here is a sample output file from the above connection parameters: freebsd.xml. When viewed in SharpReader, the items look like this:
Figure 2. FTP Monitor items for ftp.freebsd.org
Many RSS aggregators will automatically follow an item's link if it does not have a description element. SharpReader is one of these aggregators, and it also supports the ftp:// protocol. Thus, clicking on one of the items from our FTP monitor will start to download it automatically. This usually works just fine if the FTP server allows anonymous connections. If you had to provide a real username and password in ftp_monitor.php, however, your ability to "click and download" will depend on whether your RSS reader can prompt you for FTP credentials.
Enhance Your Performance
There are a few caveats to keep in mind when using this script. First, many FTP servers aren't exactly speedy, so the performance of this script will be bound by the response times of the FTP commands themselves. Simply put, the more directories it has to recurse, the longer it will take. Try to limit the scope of what you need to monitor.
Second, this script is not intended to be hit by a lot of concurrent users. The speed issue is one factor, but the other is FTP connections. For every concurrent hit to this script, a connection is made to the FTP server. It won't take much for the available connections to max out. So, if you want to provide an RSS feed to a lot of users, you should hide this script behind a cache which calls it on a periodic basis. Let the real load be handled by your cache, not the ftp_monitor script.
If you keep these constraints in mind, you can provide a nice service to your users which provides the information they need without frustrating response times.
Give It Back
If you find this script useful, or if you come up with a cool modification to it, I'd love to hear from you. Post a comment and share what you have learned with the xml.com community.