If you're like me, you've probably read a dozen or two articles about PHP performance in your career. Many of them are quite good, but some are simply flat out wrong, or misinformed.
One of the old truisms that has been repeated for as long as I can recall is "don't use readfile() if you have big files, because it reads the whole file into memory and your server will explode." The usual advice is to manually stream a file, like so:
<?php
$fp = fopen('bigfile.tar', 'rb');
while (!feof($fp)) {
print fread($fp, 1024);
}
fclose($fp);
?>
There's just one problem with that age-old truism: It's not true.
Recently I was running some benchmarks to compare the performance of various different ways to read big files off disk in PHP. PHP has no shortage of tools for shoveling data from disk to browser. I count at least four possible approaches, including the manual approach above:
<?php
$src = fopen($file, 'rb');
$dest = fopen('php://output', 'a');
stream_copy_to_stream($src, $dest);
fclose($src);
fclose($dest);
?>
<?php
$fp = fopen($file, 'r');
fpassthru($fp);
fclose($fp);
?>
<?php
readfile($file);
?>
So I setup some test scripts to bang away at, delivering a 20 MB file. I figured I'd see some sort of variation between the different approaches, with some CPU vs. memory trade-off. Instead, averaged over 50 runs this is what I found:
Method | Runtime (sec) | Peak Mem (bytes) |
---|---|---|
fread() | 0.053484 | 786432 |
fpassthru() | 0.035846 | 786432 |
readfile() | 0.036068 | 786432 |
stream_copy_to_stream() | 0.036272 | 786432 |
(Peak memory numbers are as reported by memory_get_peak_usage(TRUE);
memory_get_peak_usage()
was slightly lower and varied a bit more, but not by a statistically significant amount.)
The "real" memory usage of all four methods was identical. The CPU time was nearly the same for fpassthru(), readfile(), and stream_copy_to_stream(), but about 50% higher with fread(). I even tried cranking my PHP memory limit way down, below the 20 MB file size I was using, to cause it to crash. Nada.
As it turns out, readfile(), fpassthru(), and stream_copy_to_stream() are nearly identical internally. All use the PHP streams API internally, and in fact the same internal operation, php_stream_passthru(). Depending on your OS, it will either use mmap or do its own chunked iteration using 8 KB chunks. That is, in the worst case, it will do exactly the same thing you would have done manually, but all done in C code so that it's faster. If your OS supports it, it will use mmap(), which is an OS level operation that allows contents on disk to be read as if they were in RAM; the OS takes care of paging the file in and out of physical memory as needed.
So why the stigma around readfile()? There's a couple of possible reasons that I see.
- top. The top command, available on virtually any *nix, reports among other things the memory used by various running processes. It's a very old tool, however, and when using memory-mapped files (mmap) doesn't know how to differentiate between actual memory usage and mmaped files. That is, it may over-report the memory the file is using.
- Output buffering. PHP optionally supports buffering of output sent by a script. There are various good and bad reasons to do so, but one side effect is that data is not sent until the program says to. If you're sending a large file, then that file will get buffered and not sent like any other output. That is, it's not PHP's normal working memory that overflows but the output buffer. Since it's not reported separately, though, that can lead to people misinterpreting the root cause.
- "It's common knowledge." Let's face it, most PHP tutorials online either date from the Bill Clinton presidency or are derived from other tutorials that old. Some of the PHP introductory material online is truly god-awful. (If you ever find a tutorial that suggests to use the mysql_*() functions, close it immediately.) It's possible (although I did not dig through ancient C code to confirm it) that somewhere in PHP's dark past readfile() was less smart than it is now, or perhaps operating systems were less capable. So warnings about readfile() may have been valid at some point, but they have not been since PHP 5.0, at the very least.
Or maybe none of those. I'm not sure. In any event, readfile() is not the enemy. In fact, if you're sending a file to the browser then any of the available automated functions are perfectly fine... except for looping on fread() yourself, which is the slowest option available. Even if you need to do some processing on the data as you go, you're better off implementing a stream filter.
If you're feeling more adventurous, there's also mod_xsendfile for Apache (also built into lighthttpd), which should be even faster. I haven't worked with it, though, so caveat programmer.
Larry not considered harmful
Wow. I'm really surprised by this. Especially since I am quite sure that fpassthru() snippet saved me a few times before. Not so much in Drupal but in other third-party scripts using the readfile() method. Customers tried to use them to send large (100MB+) files and this almost always failed. Adapting the script to use fpassthru() was then a quick fix which solved the problem. I'm sure these scripts were so simple that they didn't use output buffering. Could there be another explanation for this? (Testing is nice to show the presence of a problem, but not the absence of one)
Buffering maybe
I asked about this issue on the php-internals mailing list before posting this. Everyone there agreed that they have no idea why readfile() would cause memory issues, except for one person that we did conclude had output buffering enabled. So that's just about the only issue we could think of.
Thanks!
I have had this "truth" stored at the back of my mind for a long time and would probably not have tested it for a long while more. Thanks for debunking it!
For file handling though, I would recommend looking into using the Gaufrette library: https://github.com/KnpLabs/Gaufrette
It's object oriented and has a special in-memory adaptor, which makes testing your implementation very easy.
Since I'm a sadist, I'd be
Since I'm a sadist, I'd be curious how those numbers would change if you were to add a big old recursive scandir that then calls those functions multiple times. Generally, my rule of thumb is "reading and writing anything is bad, m'kay... ...but sometimes you gotta read and write things."
That seems interesting, but
That seems interesting, but they will of course all display the same peak memory usage if that's all you're doing with your script or server.
A more interesting question might be what happens under load, specifically how much of that memory is marked as retrievable by other processes.
You're testing the wrong use case
I think you're overestimating the usage of read-a-file-and-dump-it-to-screen here.
A lot of times, a file is read to be processed somehow.
The lovely one-liner would be file_get_contents(), not readfile().
And yes, it would eat your memory for breakfast.
So for anything but trivial cases you would have to work on the file one line/chunk at a time.
I don't think Larry is
I don't think Larry is missing the use case of loading a file to be processed and what that entails. That's an entirely valid but separate use case.
In Drupal (where Larry is an initiative lead) there are a number time where files are read and delivered to users through PHP. For example, in Drupal core there is the ability to have private files (uploads/downloads - you need auth permission to obtain them). In order to handle these private files they are served through PHP and stored in a folder no in the web root.
Another case is there are a number of contributed modules that track downloads on files. To do this the download is passed through a PHP file which increments a counter and then passes the file on to the user.
This post is directly relevant to the Drupal internal function file_transfer.
Just different
As Matt said, that's a different use case. There's plenty of use cases for "grab file on disk, give to browser, done". There's also lots of use cases for "grab file on disk, mess with it, then give that to the browser, done." I was only looking at the first case here.
As I note toward the end, though, if possible you're probably better off putting your processing into a stream filter in many cases. It may be a bit weirder if you're not used to it, but that allows PHP to do all of the hard work and memory management for you. You just provide the translation logic in a reusable fashion.
Streams
According to http://php.net/readfile, readfile() supports streams the same way fopen() does, so is that still a reason to avoid fread()? I created an issue to change how Drupal does it.
Processing files is a separate issue, but I think it's similar enough that it's probably why people thought readfile() would also be slow. At least it was for me until I read this. Thanks!
Not quite urban legend
I'm not sure which output_buffering on/off was the default over the years.
Somebody can check the various changes to php.ini* and find out.
But there were a ton of "benchmarks" to prove one or the other was faster, with caveats for filesizes and hardware and networking bandwidth and apache config and...
So, many users picked ON and many picked OFF, based on which benchmark they believed.
Or even figured out that only benchmarking their app/hardware/network would give any real correct answer.
But I digress...
If you set output_buffering "wrong" you get the whole file in RAM, and run out of memory.
If not, readfile performs just like fpassthrough et al.
This may be a "Duh" to most people reading your blog, but it's not quite so obvious a connection to every PHP scripter.
Except file_get_contents() which clearly attempts to load the whole file in RAM. If you don't get that one, stop writing code. :-)