Question: Nowadays it is more practical to purchase an eBook than the dead-tree version. But the PDF frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per eBook. Is it possible to programmatically remove these blank pages?
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal PDF. I've also tried various programs and see the same results in this respect.
Answer: I don't know of an open source free solution that can detect and remove blank pages. However, VeryDOC commercial HTML Converter can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank. And during removing all blank pages from PDF, this software will not damage or compress input PDF file. You can have a free trial of HTML Converter and then decide whether pay for it or not. In the following part, I will show you how to use this software.
There are two versions of this software: GUI version and command line version. But for removing all blank pages from PDF, we’d better use the command line version. When downloading finishes, there will be a zip file. Please extract it to some folder then you can find the executable file and call it from MS Dos Windows. This software also allows you to use it together with ASP, VB, VC, Delphi, BCB, Java, .NET and COM+ etc., so you can use it programmatically.
Here is the usage for your reference: htmltools [options] <EMF-WMF-HTML-URL-RTF-file> [<PDF-PS-Image-file>]
When you need to remove all blank pages from PDF, please refer to the following command line template:
htmltools.exe -noempty -mergepdf C:\test.pdf C:\out.pdf
Please call this software in MS Dos Windows and then input parameters -noempty –mergepdf then the full path of input PDF and output PDF file. By this method, we can remove all blank pages from PDF. Now let us check related parameters:
-noempty : Delete empty pages from PDF file
-mergepdf <string> : Merge two PDF files into one PDF file
If you need to know more parameters and functions of this software, please visit its homepage. Now we can use those command line templates to remove all blank pages from PDF. During the using, if you have any question, please contact us as soon as possible.