Reading data from PDF file using JAVA. User would be able to perform operations on PDF files like reading all the content from PDF file, searching any text in the PDF file, getting total number of lines etc.

Pre-requisites
To read the PDF file, first of all, you should download the Apache PDFBox library which contains all the Class and Methods to perform various operations on any PDF File.
- Download “pdfbox-app-2.0.7.jar” from the following URL:
- https://pdfbox.apache.org/download.cgi
- Add the jar file in your Project using Build Path.
Step by Step PDF Operations
- Locate the target PDF document using File Object.
- Load the PDF document in memory using PDDocument object.
- Now, load all the PDF content as a plain text in PDFTextStripper object. This object also provides various methods to access the PDF document as per user requirement.
- User can also set the page range to perform various operations. This can be achieved using the setStartPage(index) and setStartPage(index) methods of PDFTextStripper object.
- To get the total number of lines in a given range to PDF pages, the user can create its own user-defined method. In this example, we have created a method getTotalLineCount() for the same.
- To get all the content from PDF document for given range of pages, use getText(PDDocument doc) methods of PDFTextStripper object.
- In addition to above methods we have created few of the user-defined functions as per user requirements:
- getTotalLineCount() | Get total number of lines in given range of PDF pages.
- getTextByLineNumber(index) | Get the text from specified row number. This method will ignore any blank line.
- searchText(String searchKey) | Search any text in given range of PDF pages.
- searchTextIgnoreCase(String searchKey) | Search any text in given range of PDF pages with ignoring letter case.
- searchTextByLineNumber(String searchKey,int lineNumber) | Search any text in a given line number.
- After performing all the required operations on a PDF file, release the memory by closing the PDF file using close() method of PDDocument object.
Related Links:
- Basic Java – 1 || Understand Java before start learning JAVA.
- Basic Java – 2 || Variables and Data Types used in JAVA.
- Basic Java – 3 || Understanding Class, Objects, Methods in Java.
- Basic Java – 4 || More on methods(Return Type and Parameters)
- Basic Java – 5 || Methods- Call by Value and Call by Reference in Java.
- Basic Java – 6 || Understanding of Constructor and Destructor in JAVA.
- Basic Java – 7 || Static Variables and Methods.
- Basic Java – 8 || Lets learn about Arrays in Java.
- Basic Java – 9 || Performing multiple operations using Java Operators.
- Basic Java – 10 || Conditions (If and Switch) in JAVA.
- Basic Java – 11 || for and for-each in Java. (Loops Part-1)
- Basic Java – 12 || Alternate looping concepts while and do-while. (Loops Part-2)
- Basic Java – 13 || Decimal values v/s Octal base(8) values in JAVA.
- Basic Java – 14 || Learn about String literals in Java.
- Basic Java – 15 || Runtime User Input using Scanner Class (Part-1).
- Basic Java – 16 || Runtime User Input using BufferedReader Class (Part-2).
- Basic Java – 17 || Runtime User Input using Console Class (Part-3).
- Basic Java – 18 || Difference between break and continue keywords.
- Basic Java – 19 || Sending Email using Java (Part-1).
- Basic Java – 20 || Sending Email with attachment using Java (Part-2).
- Basic Java – 21 || Stack memory and Heap memory in Java.
- Basic Java – 22 || Let’s learn more about String.
- Basic Java – 23 || String, StringBuffer & StringBuilder in Java.
- Basic Java – 24 || Exception Handling using Try Catch.
- File Handling | Reading data from word document(.doc or .docx) in JAVA.
- File Handling | Reading data from Excel files (.xls or .xlsx) using JAVA.
- File Handling | Writing data into an Excel(.XLSX or .XLS) File.
- File Handling | Implement formatting in Excel using Java.
- File Handling | Copy existing data from one workbook to another workbook in Java.
- File Handling | Reading data from PDF file using JAVA.
- File Handling || Traverse folders and subfolders in Java.
- File Handling || Reading and Writing data from a text file.
- File Handling || Multiple file creation using template based input data.
- Framework || Simple example of Key Driven Framework using excel sheet in Selenium(JAVA).
- QnA || How to use Constructors in Abstract class?
- QnA | Difference between Integer and int keywords.
- QnA | Can main method be overloaded?
- QnA | How do I reverse a String/Sentence in Java?
- QnA | Perform Multiplication and Division without * or / or % operators.
- QnA | How do I get the default value of data type?
- QnA | How to split String if it contains period symbol (.) in between?
- Different ways to Reverse a String in Java.
- Copy formatting & style of cells from one sheet to another.
- Getting IP address and Hostname using InetAddress Class.
- User inputs via Command Prompt using arguments of main() method of a class.
- Program for List and ArrayList in Java.
- Useful methods and implementation under Scanner Class.
- Swapping two variable values without using any third variable.
- Difference between int x= 10 and y=010 in Java.
- Parameterized Constructors v/s Setter and Getter function in JAVA.
- Override a Static Method.
JAVA Code
In this example, we have used two class files PDFMethods.java and PDFHandling.java. PDFMethods.java file contains all the predefined and user-defined methods to access the content of PDF File. On the other hand, PDFHandling.java file has been used to call the methods and provide a flow of the program as per user requirement.
PDFMethods.java
package pdf; import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException; import org.apache.pdfbox.text.PDFTextStripper; public class PDFMethods { public static File pdfFile; public static PDDocument doc; public static PDFTextStripper textStripper; public static void getPDFFile(String filePath) throws InvalidPasswordException, IOException { // Locate the PDF document to read. pdfFile=new File(filePath); // Load PDF file to memory to access the content. doc=PDDocument.load(pdfFile); // PDFTextStripper class loads the pdf content with ignoring all the formatting. // Also, provides methods to access the content within PDF File. textStripper =new PDFTextStripper(); } public static void setPageRange(int startPage,int endPage) { // Set number of pages or range of pages for reading the content from PDF. textStripper.setStartPage(startPage); textStripper.setEndPage(endPage); } public static int getTotalLineCount() throws IOException { // User defined method to get the total number of lines within the given page range. // This methods will ignore blank line(if any) while counting the lines. String pdfContent=textStripper.getText(doc); int counter=0; String[] pdfLines=pdfContent.split("\\n"); for (String string : pdfLines) { if(!string.trim().isEmpty()) { counter++; } } return counter; } public static String getAllPdfContent() throws IOException { // Load all the content of PDF(according to page range) file into String object String pdfContent = textStripper.getText(doc); return pdfContent; } public static String getTextByLineNumber(int lineNumber) throws IOException { // This methods will return text from a specified line number. String pdfContent = textStripper.getText(doc); int pdfLineNumber=1; String[] pdfLines=pdfContent.split("\\n"); String currentLine=null; for (String string: pdfLines) { if(!string.trim().isEmpty()) { if(pdfLineNumber==lineNumber) { currentLine=string.trim(); break; } pdfLineNumber++; } } return currentLine; } public static boolean searchText(String searchKey) throws IOException { // This methods will search any text in the PDF pages. String pdfContent = textStripper.getText(doc); if(pdfContent.contains(searchKey)) { System.out.println("Keyword '"+searchKey+"' Found."); return true; } else { System.out.println("Keyword '"+searchKey+"' Not Found."); return false; } } public static boolean searchTextIgnoreCase(String searchKey) throws IOException { // This method will search text in PDF Pages with ignoring the letter case. String pdfContent = textStripper.getText(doc); if(pdfContent.toLowerCase().contains(searchKey.toLowerCase())) { System.out.println("Keyword '"+searchKey+"' Found with ignore case."); return true; } else { System.out.println("Keyword '"+searchKey+"' Not Found."); return false; } } public static boolean searchTextByLineNumber(String searchKey,int lineNumber) throws IOException { // This method will search any text in the specified line number in PDF. String pdfContent = textStripper.getText(doc); int pdfLineNumber=1; boolean searchResult=false; String[] pdfLines=pdfContent.split("\\n"); String currentLine=null; for (String string: pdfLines) { if(!string.trim().isEmpty()) { if(pdfLineNumber==lineNumber) { currentLine=string.trim(); if(currentLine.contains(searchKey)) { searchResult=true; System.out.println("Keyword '"+searchKey+"' Found in line number "+pdfLineNumber); break; } break; } pdfLineNumber++; } } if(!searchResult) { System.out.println("Keyword '"+searchKey+"' Not Found in line number "+pdfLineNumber); } return searchResult; } public static void closePDF() throws IOException { // Close the PDF file after program execution is finished. doc.close(); } }
PDFHandling.java
package pdf; import java.io.IOException; import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException; public class PDFHandling { public static void main(String[] args) throws InvalidPasswordException, IOException { PDFMethods.getPDFFile("input_pdf\\Demo_file.pdf"); // Specify the range of pages for access in the PDF file. int startIndex=1; int endIndex=1; PDFMethods.setPageRange(startIndex, endIndex); System.out.println("Page Range:>> from "+startIndex+" to "+endIndex); // Get total number of lines in the target range of PDF document. int totalLinesCount=PDFMethods.getTotalLineCount(); System.out.println("Total Lines:>> "+totalLinesCount); /*String pdfContent=PDFMethods.getAllPdfContent(); System.out.println("All content from specified Pages :>>\n "+pdfContent);*/ // Get the text from a specified line number. int targetLineNumber=4; String pdfLine=PDFMethods.getTextByLineNumber(targetLineNumber); System.out.println("Text from '"+targetLineNumber+"' Line number:>> "+pdfLine); // Search any text/keyword in the PDF Document. // Each search method will return boolean value which can be used in further programming. String searchKey="PDF"; boolean searchResult=PDFMethods.searchText(searchKey); boolean searchResult2=PDFMethods.searchTextIgnoreCase(searchKey); boolean searchResult3=PDFMethods.searchTextByLineNumber(searchKey, 2); } }