How to Convert a Document to PDF in Java?
In software projects, there is quite often a requirement for conversion of a given file (HTML/TXT/etc.,) to a PDF file and similarly, any PDF file needs to get converted to HTML/TXT/etc., files. Even PDFs need to be stored as images of type PNG or GIF etc., Via a sample maven project, let us see the same. As it is the maven project, necessary dependencies need to be added in pom.xml
Essential Library is PDF2Dom:
<!-- To load the selected PDF file --> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox-tools</artifactId> <version>2.0.25</version> </dependency> <!-- To load the selected PDF file --> <!-- Required for conversion --> <dependency> <groupId>net.sf.cssbox</groupId> <artifactId>pdf2dom</artifactId> <version>2.0.1</version> </dependency>
A few more dependencies are also needed. iText is needed to extract the text from a given PDF file. POI is needed to create the .docx document.
<dependency> <groupId>com.itextpdf</groupId> <artifactId>itextpdf</artifactId> <version>5.5.10</version> </dependency> <dependency> <groupId>com.itextpdf.tool</groupId> <artifactId>xmlworker</artifactId> <version>5.5.10</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.15</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.15</version> </dependency>
Example Maven Project
Let us start with the project structure and pom.xml and then will look for the required source code to convert from PDF to other formats as well as from other formats to HTML
pom.xml
XML
<? xml version = "1.0" ?> < project xmlns = "http://maven.apache.org/POM/4.0.0" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> < modelVersion >4.0.0</ modelVersion > < artifactId >pdf</ artifactId > < name >pdf</ name > < url >http://maven.apache.org</ url > < parent > < groupId >com.gfg</ groupId > < artifactId >parent-modules</ artifactId > < version >1.0.0-SNAPSHOT</ version > </ parent > < dependencies > < dependency > < groupId >org.apache.pdfbox</ groupId > < artifactId >pdfbox-tools</ artifactId > < version >${pdfbox-tools.version}</ version > < exclusions > < exclusion > < artifactId >commons-logging</ artifactId > < groupId >commons-logging</ groupId > </ exclusion > </ exclusions > </ dependency > < dependency > < groupId >net.sf.cssbox</ groupId > < artifactId >pdf2dom</ artifactId > < version >${pdf2dom.version}</ version > < exclusions > < exclusion > < artifactId >commons-logging</ artifactId > < groupId >commons-logging</ groupId > </ exclusion > </ exclusions > </ dependency > < dependency > < groupId >com.itextpdf</ groupId > < artifactId >itextpdf</ artifactId > < version >${itextpdf.version}</ version > </ dependency > < dependency > < groupId >com.itextpdf.tool</ groupId > < artifactId >xmlworker</ artifactId > < version >${xmlworker.version}</ version > </ dependency > < dependency > < groupId >org.apache.poi</ groupId > < artifactId >poi-scratchpad</ artifactId > < version >${poi-scratchpad.version}</ version > </ dependency > < dependency > < groupId >org.apache.xmlgraphics</ groupId > < artifactId >batik-transcoder</ artifactId > < version >${batik-transcoder.version}</ version > </ dependency > < dependency > < groupId >org.apache.poi</ groupId > < artifactId >poi-ooxml</ artifactId > < version >${poi-ooxml.version}</ version > </ dependency > < dependency > < groupId >org.thymeleaf</ groupId > < artifactId >thymeleaf</ artifactId > < version >${thymeleaf.version}</ version > </ dependency > < dependency > < groupId >org.xhtmlrenderer</ groupId > < artifactId >flying-saucer-pdf</ artifactId > < version >${flying-saucer-pdf.version}</ version > </ dependency > < dependency > < groupId >org.xhtmlrenderer</ groupId > < artifactId >flying-saucer-pdf-openpdf</ artifactId > < version >${flying-saucer-pdf-openpdf.version}</ version > </ dependency > < dependency > < groupId >org.jsoup</ groupId > < artifactId >jsoup</ artifactId > < version >${jsoup.version}</ version > </ dependency > < dependency > < groupId >com.openhtmltopdf</ groupId > < artifactId >openhtmltopdf-core</ artifactId > < version >${open-html-pdf-core.version}</ version > </ dependency > < dependency > < groupId >com.openhtmltopdf</ groupId > < artifactId >openhtmltopdf-pdfbox</ artifactId > < version >${open-html-pdfbox.version}</ version > </ dependency > </ dependencies > < build > < finalName >pdf</ finalName > < resources > < resource > < directory >src/main/resources</ directory > < filtering >true</ filtering > </ resource > </ resources > </ build > < properties > < pdfbox-tools.version >2.0.25</ pdfbox-tools.version > < pdf2dom.version >2.0.1</ pdf2dom.version > < itextpdf.version >5.5.10</ itextpdf.version > < xmlworker.version >5.5.10</ xmlworker.version > < poi-scratchpad.version >3.15</ poi-scratchpad.version > < batik-transcoder.version >1.8</ batik-transcoder.version > < poi-ooxml.version >3.15</ poi-ooxml.version > < thymeleaf.version >3.0.11.RELEASE</ thymeleaf.version > < flying-saucer-pdf.version >9.1.20</ flying-saucer-pdf.version > < open-html-pdfbox.version >1.0.6</ open-html-pdfbox.version > < open-html-pdf-core.version >1.0.6</ open-html-pdf-core.version > < flying-saucer-pdf-openpdf.version >9.1.22</ flying-saucer-pdf-openpdf.version > < jsoup.version >1.14.2</ jsoup.version > </ properties > </ project > |
Let us see important key files
1. PDF and HTML conversion
ConversionOfPDF2HTMLExample.java
In the below program, both methods are handled i.e.
a. generationOfHTMLFromPDF
Note: Conversion of PDF to HTML cannot be predicted 100%, pixel-to-pixel result oriented. If the complexity of the PDF file is more, accuracy varies.
b. generationOfPDFFromHTML
Note: In html file, all tags need to properly closed and then only PDF can be generated
Java
import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.PrintWriter; import java.io.Writer; import javax.xml.parsers.ParserConfigurationException; import org.apache.pdfbox.pdmodel.PDDocument; import org.fit.pdfdom.PDFDomTree; import com.itextpdf.text.Document; import com.itextpdf.text.DocumentException; import com.itextpdf.text.pdf.PdfWriter; import com.itextpdf.tool.xml.XMLWorkerHelper; public class ConversionOfPDF2HTMLExample { private static final String PDF = "src/main/resources/pdf.pdf" ; private static final String HTML = "src/main/resources/html.html" ; public static void main(String[] args) { try { generationOfHTMLFromPDF(PDF); generationOfPDFFromHTML(HTML); } catch (IOException | ParserConfigurationException | DocumentException e) { e.printStackTrace(); } } private static void generationOfHTMLFromPDF(String filename) throws ParserConfigurationException, IOException { PDDocument pdf = PDDocument.load( new File(filename)); PDFDomTree parser = new PDFDomTree(); Writer output = new PrintWriter( "src/output/pdf.html" , "utf-8" ); parser.writeText(pdf, output); output.close(); if (pdf != null ) { pdf.close(); } } private static void generationOfPDFFromHTML(String filename) throws ParserConfigurationException, IOException, DocumentException { Document document = new Document(); PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream( "src/output/html.pdf" )); document.open(); XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream(filename)); document.close(); } } |
2. PDF and Image Conversions
PDF can be converted to Images in many ways and one important way is Apache PDFBox again from image to PDF can be converted by using iText
ConversionOfPDF2ImageExample.java
In the below program, the following methods are handled
- generationOfPDFFromImage
- Images are of type jpeg, jpg, gif, tiff, or png and can be loaded from disk
- generationOfImageFromPDF
- Apache PDFBox is an advanced tool. Each page of PDF has to be rendered by using PDFRenderer as a BufferedImage. Then ImageIOUtil is used to write the image as of types like JPEG, GIF, PNG, etc.,
Java
import java.awt.image.BufferedImage; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.net.URL; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.rendering.ImageType; import org.apache.pdfbox.rendering.PDFRenderer; import org.apache.pdfbox.tools.imageio.ImageIOUtil; import com.itextpdf.text.BadElementException; import com.itextpdf.text.Document; import com.itextpdf.text.DocumentException; import com.itextpdf.text.Image; import com.itextpdf.text.pdf.PdfWriter; public class ConversionOfPDF2ImageExample { private static final String PDF = "src/main/resources/pdf.pdf" ; private static final String JPG = "http://cdn2.gfg.netdna-cdn.com/wp-content/uploads/2016/05/gfg-rest-widget-main-1.2.0" ; private static final String GIF = "https://media.giphy.com/media/l3V0x6kdXUW9M4ONq/giphy" ; public static void main(String[] args) { try { generationOfImageFromPDF(PDF, "png" ); generationOfImageFromPDF(PDF, "jpeg" ); generationOfImageFromPDF(PDF, "gif" ); generationOfPDFFromImage(JPG, "jpg" ); generationOfPDFFromImage(GIF, "gif" ); } catch (IOException | DocumentException e) { e.printStackTrace(); } } private static void generationOfImageFromPDF(String filename, String extension) throws IOException { PDDocument document = PDDocument.load( new File(filename)); PDFRenderer pdfRenderer = new PDFRenderer(document); for ( int page = 0 ; page < document.getNumberOfPages(); ++page) { BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300 , ImageType.RGB); ImageIOUtil.writeImage(bim, String.format( "src/output/pdf-%d.%s" , page + 1 , extension), 300 ); } document.close(); } private static void generationOfPDFFromImage(String filename, String extension) throws IOException, BadElementException, DocumentException { Document document = new Document(); String input = filename + "." + extension; String output = "src/output/" + extension + ".pdf" ; FileOutputStream fos = new FileOutputStream(output); PdfWriter writer = PdfWriter.getInstance(document, fos); writer.open(); document.open(); document.add(Image.getInstance(( new URL(input)))); document.close(); writer.close(); } } |
3. PDF and Text Conversions
For this also Apache PDFBox is needed to get the text from PDF files and iText is required for text-to-pdf conversion.
Note: cannot preserve the formatting in a plain text file as it has text only
ConversionOfPDF2TextExample.java
Java
import java.io.BufferedReader; import java.io.File; import java.io.FileOutputStream; import java.io.FileReader; import java.io.IOException; import java.io.PrintWriter; import org.apache.pdfbox.cos.COSDocument; import org.apache.pdfbox.io.RandomAccessFile; import org.apache.pdfbox.pdfparser.PDFParser; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import com.itextpdf.text.Document; import com.itextpdf.text.DocumentException; import com.itextpdf.text.Element; import com.itextpdf.text.Font; import com.itextpdf.text.PageSize; import com.itextpdf.text.Paragraph; import com.itextpdf.text.pdf.PdfWriter; public class ConversionOfPDF2TextExample { private static final String PDF = "src/main/resources/pdf.pdf" ; private static final String TXT = "src/main/resources/txt.txt" ; public static void main(String[] args) { try { generationOfTxtFromPDF(PDF); generationOfPDFFromTxt(TXT); } catch (IOException | DocumentException e) { e.printStackTrace(); } } private static void generationOfTxtFromPDF(String filename) throws IOException { File f = new File(filename); String parsedText; PDFParser parser = new PDFParser( new RandomAccessFile(f, "r" )); parser.parse(); COSDocument cosDoc = parser.getDocument(); PDFTextStripper pdfStripper = new PDFTextStripper(); PDDocument pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); if (cosDoc != null ) cosDoc.close(); if (pdDoc != null ) pdDoc.close(); PrintWriter pw = new PrintWriter( "src/output/pdf.txt" ); pw.print(parsedText); pw.close(); } private static void generationOfPDFFromTxt(String filename) throws IOException, DocumentException { Document pdfDoc = new Document(PageSize.A4); PdfWriter.getInstance(pdfDoc, new FileOutputStream( "src/output/txt.pdf" )) .setPdfVersion(PdfWriter.PDF_VERSION_1_7); pdfDoc.open(); Font myfont = new Font(); myfont.setStyle(Font.NORMAL); myfont.setSize( 11 ); pdfDoc.add( new Paragraph( "\n" )); BufferedReader br = new BufferedReader( new FileReader(filename)); String strLine; while ((strLine = br.readLine()) != null ) { Paragraph para = new Paragraph(strLine + "\n" , myfont); para.setAlignment(Element.ALIGN_JUSTIFIED); pdfDoc.add(para); } pdfDoc.close(); br.close(); } } |
4. PDF and DocX Conversions
Two libraries are needed. i.e.
- iText: Extract text from PDF
- POI: To create the .docx document
ConversionOfPDF2WordExample.java
Java
import java.io.FileOutputStream; import java.io.IOException; import org.apache.poi.xwpf.usermodel.BreakType; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFParagraph; import org.apache.poi.xwpf.usermodel.XWPFRun; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfReaderContentParser; import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy; import com.itextpdf.text.pdf.parser.TextExtractionStrategy; public class ConversionOfPDF2WordExample { private static final String FILENAME = "src/main/resources/pdf.pdf" ; public static void main(String[] args) { try { generationOfDocFromPDF(FILENAME); } catch (IOException e) { e.printStackTrace(); } } private static void generationOfDocFromPDF(String filename) throws IOException { XWPFDocument doc = new XWPFDocument(); String pdf = filename; PdfReader reader = new PdfReader(pdf); PdfReaderContentParser parser = new PdfReaderContentParser(reader); for ( int i = 1 ; i <= reader.getNumberOfPages(); i++) { TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); String text = strategy.getResultantText(); XWPFParagraph p = doc.createParagraph(); XWPFRun run = p.createRun(); run.setText(text); run.addBreak(BreakType.PAGE); } FileOutputStream out = new FileOutputStream( "src/output/pdf.docx" ); doc.write(out); out.close(); reader.close(); doc.close(); } } |
Code Explanation Video:
Conclusion
In many stages of software projects, there are requirements for conversion of text, and image to PDF, and similarly conversion of data from PDF to text, image, and Docx format. The above examples help the best way to do this in Java.