Splitting and Merging Pdf Files in C# Using iTextSharp

Posted on March 9 2013 11:44 AM by John Atten in CodeProject, C#   ||   Comments (3)

I recently posted about using PdfBox.net to manipulate Pdf documents in your C# application. This time, I take a quick look at iTextSharp, another library for working with Pdf documents from within the .NET framework.

Some Navigation Aids:

What is iTextSharp?

iTextSharp is a direct .NET port of the open source iText Java library for PDF generation and manipulation. As the project’s summary page on SourceForge states, iText “  . . . can be used to create PDF Documents from scratch, to convert XML to PDF . . . to fill out interactive PDF forms, to stamp new content on existing PDF documents, to split and merge existing PDF documents, and much more.”

iTextSharp presents a formidable set of tools for developers who need to create and/or manipulate Pdf files. This does come with a cost, however. The Pdf file format itself is complex; therefore, programming libraries which seek to provide a flexible interface for working with Pdf files become complex by default. iText is no exception.

I noted in my previous post on PdfBox that PdfBox was a little easier for me to get up and running with, at least for rather basic tasks such as splitting and merging existing Pdf files. I also noted that iText looked to be a little more complex, and I was correct. However, iTextSharp does not suffer some of the performance drawbacks inherent to PdfBox, at least on the .net platform.

Superior Performance vs. PdfBox

Aston-Martin-V8-Sports-Car-For-EveryAs I observed in my previous post, PdfBox.net is NOT a direct port of the PdfBox Java library, but instead is a Java library running within .net using IKVM. While I found it very cool to be able to run Java code in a .NET context, there was a serious performance hit, most notably the first time the PdfBox library was called, and the massive IKVM library spun up what amounts to a .Net implementation of the Java Virtual Machine, within which the Java code of the PdfBox library is then executed.

Needless to say, iTextSharp does not suffer this limitation. the library itself it relatively lightweight, and fast.

Extracting and Merging Pages from an Existing Pdf File

One of the most common tasks we need to do is extract pages from one Pdf into a new file. We’ll take a look at some relatively basic sample code which does just that, and get a feel for using the iTextSharp programming model.

In the following code sample, the primary iTextSharp classes we will be using are the PdfReader, Document, PdfCopy, and PdfImportedPage classes.

My simplified understanding of how this works is as follows: The PdfReader instance contains the content of the source PDF file. The Document class, once initialized with the PdfReader instance and a new output FileStream, essentially becomes a container into which pages extracted from the source file represented in the PdfReader class will be copied. Note that the Document class represents the Pdf content as HTML, which will be used to construct a properly formatted Pdf file. The result is then output to the Filestream, and saved to disk at the location specified by the destination file name.

You can download the iTextSharp source code and binaries as a single package from Files page at the iTextSharp project site. Just click on the “Download itextsharp-all-5.4.0.zip” link. Extract the files from the .zip archive, and stash them somewhere convenient. Next, set a reference in your project to the itextsharp.dll. You will need to browse to the folder where you stashed the extracted contents of the iTextSharp download.

NOTE: The complete example code for this post is available at my Github Repo.

I went ahead and created a project named iTextTools, with a class file named PdfExtractorUtility. Add the following using statements at the top of the file:

Set up references and Using Statements to use iTextSharp

using iTextSharp.text;
using iTextSharp.text.pdf;
using System;
// CLASS DEPENDS ON iTextSharp: http://sourceforge.net/projects/itextsharp/
namespace iTextTools
{
    public class PdfExtractorUtility
    {
    }
}

 

First, I’ll add a simple method to extract a single page from an existing PDF file and save to a new file:

Extract Single Page from Existing PDF to a new File:

public void ExtractPage(string sourcePdfPath, string outputPdfPath, 
    int pageNumber, string password = "")
{
    PdfReader reader = null;
    Document document = null;
    PdfCopy pdfCopyProvider = null;
    PdfImportedPage importedPage = null;
    try
    {
        // Intialize a new PdfReader instance with the contents of the source Pdf file:
        reader = new PdfReader(sourcePdfPath);
 
        // Capture the correct size and orientation for the page:
        document = new Document(reader.GetPageSizeWithRotation(pageNumber));
 
        // Initialize an instance of the PdfCopyClass with the source 
        // document and an output file stream:
        pdfCopyProvider = new PdfCopy(document, 
            new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
        document.Open();
 
        // Extract the desired page number:
        importedPage = pdfCopyProvider.GetImportedPage(reader, pageNumber);
        pdfCopyProvider.AddPage(importedPage);
        document.Close();
        reader.Close();
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

 

As you can see, simply pass in the path to the source document, the page number to be extracted, and an output file path, and you’re done.

If we want to be able to a range of contiguous pages, we might add another method defining a start and end point:

Extract a Range of Pages from Existing PDF to a new File:

public void ExtractPages(string sourcePdfPath, string outputPdfPath, 
    int startPage, int endPage)
{
    PdfReader reader = null;
    Document sourceDocument = null;
    PdfCopy pdfCopyProvider = null;
    PdfImportedPage importedPage = null;
    try
    {
        // Intialize a new PdfReader instance with the contents of the source Pdf file:
        reader = new PdfReader(sourcePdfPath);
 
        // For simplicity, I am assuming all the pages share the same size
        // and rotation as the first page:
        sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
 
        // Initialize an instance of the PdfCopyClass with the source 
        // document and an output file stream:
        pdfCopyProvider = new PdfCopy(sourceDocument, 
            new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
 
            sourceDocument.Open();
 
        // Walk the specified range and add the page copies to the output file:
        for (int i = startPage; i <= endPage; i++)
        {
            importedPage = pdfCopyProvider.GetImportedPage(reader, i);
            pdfCopyProvider.AddPage(importedPage);
        }
        sourceDocument.Close();
        reader.Close();
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

 

What if we want non-contiguous pages from the source document? Well, we might override the above method with one which accepts an array of ints representing the desired pages:

Extract multiple non-contiguous pages from Existing PDF to a new File:

public void ExtractPages(string sourcePdfPath, 
    string outputPdfPath, int[] extractThesePages)
{
    PdfReader reader = null;
    Document sourceDocument = null;
    PdfCopy pdfCopyProvider = null;
    PdfImportedPage importedPage = null;
    try
    {
        // Intialize a new PdfReader instance with the 
        // contents of the source Pdf file:
        reader = new PdfReader(sourcePdfPath);
 
        // For simplicity, I am assuming all the pages share the same size
        // and rotation as the first page:
        sourceDocument = new Document(reader.GetPageSizeWithRotation(extractThesePages[0]));
 
        // Initialize an instance of the PdfCopyClass with the source 
        // document and an output file stream:
        pdfCopyProvider = new PdfCopy(sourceDocument,
            new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
        sourceDocument.Open();
 
        // Walk the array and add the page copies to the output file:
        foreach (int pageNumber in extractThesePages)
        {
            importedPage = pdfCopyProvider.GetImportedPage(reader, pageNumber);
            pdfCopyProvider.AddPage(importedPage);
        }
        sourceDocument.Close();
        reader.Close();
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

 

Scratching the Surface

Obviously, the example(s) above are a simplistic first exploration of what appears to be a powerful library. What I notice about iText in general is that, unlike some API’s, the path to achieving your desired result is often not intuitive. I believe this is as much to do with the nature of the PDF file format, and possibly the structure of lower-level libraries upon which iTextSharp is built.

That said, there is without a doubt much to be discerned by exploring the iTextSharp source code. Additionally, there are a number of resources to assist the erstwhile developer in using this library:

Additional Resources for iTextSharp

Lastly, there is a book authored by one of the primary contributors to the iText project, Bruno Lowagie:

 

Posted on March 9 2013 11:44 AM by John Atten     

Comments (3)

Working with Pdf Files in C# Using PdfBox and IKVM

Posted on January 30 2013 07:09 PM by John Atten in Java, C#, Hacks, CodeProject   ||   Comments (2)

I have found two primary libraries for programmatically manipulating PDF files;  PdfBox and iText. These are both Java libraries, but I needed something I could use with C Sharp. Well, as it turns out there is an implementation of each of these libraries for .NET, each with its own strengths and weaknesses:

Some Navigation Links:

PdfBox - .Net version

The .NET implementation of PdfBox is not a direct port - rather, it uses IKVM to run the Java version inter-operably with .NET. IKVM features an actual .net implementation of a Java Virtual Machine, and a .net implementation of Java Class Libraries along with tools which enable Java and .Net interoperability. 

PdfIcon_PngPdfBox’s dependency on IKVK incurs a lot of baggage in performance terms. When the IKVM libraries load, and (I am assuming) the “’Virtual’ Java Virtual Machine” spins up, things slow way down until the load is complete. On the other hand, for some of the more common things one might want to do with a PDF programmatically, the API is (relatively) straightforward, and well documented.

When you run a project which uses PdfBox, you WILL notice a lag the first time PdfBox and IKVM are loaded. After that, things seem to perform sufficiently, at least for what I needed to do.

Side Note: iTextSharp

iTextSharp is a direct port of the Java library to .Net.

iTextSharp looks to be the more robust library in terms of fine-grained control, and is extensively documented in a book by one of the authors of the library, iText in Action (Second Edition). However, the learning curve was a little steeper for iText, and I needed to get a project out the door. I will examine iTextSharp in another post, because it looks really cool, and supposedly does not suffer the performance limitations of PdfBox.

Getting started with PdfBoxVS-References-After_adding-PdfBox-and-IKVM

Before you can use PdfBox, you need to either build the project from source, or download the ready-to-use binaries. I just downloaded the binaries for version 1.2.1 from this helpful gentleman’s site, which, since they depend on IKVM, also includes the IKVM binaries. However, there are detailed instruction for building from source on the PdfBox site. Personally, I would start with the downloaded binaries to see if PdfBox is what you want to use first.

Important to note here: apparently, the PdfBox binaries are dependent upon the exact dependent DLL’s used to build them. See the notes on the PdfBox .Net Version page.

Once you have built or downloaded the binaries, you will need to set references to PdfBox and ALL the included IKVM binaries in your Visual Studio Project. Create a new Visual Studio project named “PdfBoxExamples” and add references to ALL the PdfBox and IKVM binaries. There are a LOT. Deal with it. Your project references folder will look like the picture to the right when you are done.

The PdfBox API is quite dense, but there is a handy reference at the Apache Pdfbox site.  The PDF file format is complex, to say the least, so when you first take a gander at the available classes and methods presented by the PDF box API, it can be difficult to know where to begin. Also, there is the small issue that what you are looking at is a Java API, so some of the naming conventions are a little different. Also, the PdfBox API often returns what appear to be Java classes. This comes back to that .Net implementation of the Java Class libraries I mentioned earlier.

Things to Do with PdfBox

It seems like there are three common things I often want to do with PDF files: Extract text into a string or text file, split the document into one or more parts, or merge pages or documents together. To get started with using PdfBox we will look at extracting text first, since the set up for this is pretty straightforward, and there isn’t any real Java/.Net weirdness here.

Extracting Text from a PDF File

To do this, we will call upon two PdfBox namespaces (“Packages” in Java, loosely), and two Classes:

The namespace org.apache.pdfbox.pdmodel gives us access to the PDDocument class and the namespace  org.apache.pdfbox.util   gives us the PDFTextStripper class.

In your new PdfBoxExamples project, add a new class, name it “PdfTextExtractor," and add the following code:

The PdfTextExtractor Class
using System;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
namespace PdfBoxExamples
{
    public class pdfTextExtractor
    {
        public static String PDFText(String PDFFilePath)
        {
            PDDocument doc = PDDocument.load(PDFFilePath);
            PDFTextStripper stripper = new PDFTextStripper();
            return stripper.getText(doc);
        }
    }
}

 

As you can see, we use the PDDocument class (from the org.apache.pdfbox.pdmodel namespace) and initialize is using the static .load method defined as a class member on PDDocument. As long as we pass it a valid file path, the .load method will return an instance of PDDocument, ready for us to work with.

Once we have the PDDocument instance, we need an instance of the PDFTextStripper class, from the namespace org.apache.pdfbox.util. We pass our instance of PDDocument in as a parameter, and get back a string representing the text contained in the original PDF file.

Be prepared. PDF documents can employ some strange layouts, especially when there are tables and/or form fields involved. The text you get back will tend not to retain the formatting from the document, and in some cases can be bizarre.

However, the ability to strip text in this manner can be very useful, For example, I recently needed to download an individual PDF file for each county in the state of Missouri, and strip some tabular data our of each one. I hacked together an iterator/downloader to pull down the files, and the, using a modified version of the text stripping tool illustrated above and some rather painful Regex, I was able to get what I needed.

Splitting the Pages of a PDF File

At the simplest level, suppose you had a PDF file and you wanted to split it into individual pages. We can use the Splitter Class, again from the org.apache.pdf.util namespace. Add another class to you project, named PDFFileSplitter, and copy the following code into the editor:

The PdfFileSplitter Class
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
namespace PdfBoxExamples
{
    public class PDFFileSplitter
    {
        public static java.util.List SplitPDFFile(string SourcePath, 
            int splitPageQty = 1)
        {
            var doc = PDDocument.load(SourcePath);
            var splitter = new Splitter();
            splitter.setSplitAtPage(splitPageQty);
            return (java.util.List)splitter.split(doc);
        }
    }
}

 

Notice anything strange in the code above? That’s right. We have declared a static method with a return type of java.util.List. WHAT? This is where working with PdfBox and more importantly, IKVM becomes weird/cool. Cool, because I am using a direct Java class implementation in Visual Studio, in my C# code. Weird, because my method returns a bizarre type (from a C# perspective, anyway) that I was unsure what to do with.

I would probably add to the above class so that the splitter persisted the split documents to disk, or change the return type of my method to object[], and use the .ToArray() method, like so:

The PdfFileSplitter Class (improved?)
public static object[] SplitPDFFile(string SourcePath, 
    int splitPageQty = 1)
{
    var doc = PDDocument.load(SourcePath);
    var splitter = new Splitter();
    splitter.setSplitAtPage(splitPageQty);
    return (object[])splitter.split(doc).toArray();
}

 

In any case, the code in either example loads up the specified PDF file into a PDDocument instance, which is then passed to the org.apache.pdfbox.Splitter, along with an int parameter. The output in the example above is a Java ArrayList containing a single page from your original document in each element. Your original document is not altered by this process, by the way.

The int parameter is telling the Splitter how many pages should be in each split section. In other words, if you start with a six-page PDF file, the output will be three two-page files. If you started with a 5-page file, the output would be two two-page files and one single-page file. You get the idea.

Extract Multiple Pages from a PDF Into a New File

Something slightly more useful might be a method which accepts an array of integers as a parameter, with each integer representing a page number within a group to be extracted into a new, composite document. For example, say I needed pages 1, 6, and 7 from a 44 page PDF pulled out and merged into a new document (in reality, I needed to do this for pages 1, 6, and 7 for each of about 200  individual documents). We might add a method to our PdfFileSplitter Class as follows:

The ExtractToSingleFile Method
public static void ExtractToSingleFile(int[] PageNumbers, 
    string sourceFilePath, string outputFilePath)
{
    var originalDocument = PDDocument.load(sourceFilePath);
    var originalCatalog = originalDocument.getDocumentCatalog();
    java.util.List sourceDocumentPages = originalCatalog.getAllPages();
    var newDocument = new PDDocument();
    foreach (var pageNumber in PageNumbers)
    {
        // Page numbers are 1-based, but PDPages are contained in a zero-based array:
        int pageIndex = pageNumber - 1;
        newDocument.addPage((PDPage)sourceDocumentPages.get(pageIndex));
    }
    newDocument.save(outputFilePath);
}

 

Below is a simple example to illustrate how we might call this method from a client:

Calling the ExtractToSingleFile Method:
public void ExtractAndMergePages()
{
    string sourcePath = @"C:\SomeDirectory\YourFile.pdf";
    string outputPath = @"C:\SomeDirectory\YourNewFile.pdf";
    int[] pageNumbers = { 1, 6, 7 };
    PDFFileSplitter.ExtractToSingleFile(pageNumbers, sourcePath, outputPath);
}

 

Limit Class Dependency on PdfBox

It is always good to limit dependencies within a project. In this case, especially, I would want to keep those odd Java class references constrained to the highest degree possible. In other words, where possible, I would attempt to either return standard .net types from my classes which consume the PdfBox API, or otherwise complete execution so that client code calling upon this class doesn’t need to be aware of IKVM, or funky C#/Java hybrid types.

Or, I would build out my own “PdfUtilities” library project, within which objects are free to depend upon and intermix this Java hybrid. However, I would make sure public methods defined within the library itself accepted and returned only standard C# types.

In fact, that is precisely what I am doing, and I’ll look at that in a following post.

Links to resources:

 

Posted on January 30 2013 07:09 PM by John Atten     

Comments (2)

About the author

My name is John Atten, and my "handle" on many of my online accounts is xivSolutions. I am Fascinated by all things technology and software development. I work mostly with C#, JavaScript/Node, and databases of many flavors. Actively learning always. I dig web development. I am always looking for new information, and value your feedback (especially where I got something wrong!). You can email me at:

jatten at typecastexception dot com

Web Hosting by