File Handling | Reading data from PDF file using JAVA.

Reading data from PDF file using JAVA. User would be able to perform operations on PDF files like reading all the content from PDF file, searching any text in the PDF file, getting total number of lines etc.

Reading PDF Content
                                                                         Reading PDF Content

Pre-requisites

To read the PDF file, first of all, you should download the Apache PDFBox library which contains all the Class and Methods to perform various operations on any PDF File.

  1. Download “pdfbox-app-2.0.7.jar” from the following URL:
  2. https://pdfbox.apache.org/download.cgi
  3. Add the jar file in your Project using Build Path.



Step by Step PDF Operations

  1. Locate the target PDF document using File Object.
  2. Load the PDF document in memory using PDDocument object.
  3. Now, load all the PDF content as a plain text in PDFTextStripper object. This object also provides various methods to access the PDF document as per user requirement.
  4. User can also set the page range to perform various operations. This can be achieved using the setStartPage(index) and setStartPage(index) methods of PDFTextStripper object.
  5. To get the total number of lines in a given range to PDF pages, the user can create its own user-defined method. In this example, we have created a method getTotalLineCount() for the same.
  6. To get all the content from PDF document for given range of pages, use getText(PDDocument doc) methods of PDFTextStripper object.
  7. In addition to above methods we have created few of the user-defined functions as per user requirements:
    1. getTotalLineCount()  |  Get total number of lines in given range of PDF pages.
    2. getTextByLineNumber(index)  |  Get the text from specified row number. This method will ignore any blank line.
    3. searchText(String searchKey)  |  Search any text in given range of PDF pages.
    4. searchTextIgnoreCase(String searchKey)  |  Search any text in given range of PDF pages with ignoring letter case.
    5. searchTextByLineNumber(String searchKey,int lineNumber)  |  Search any text in a given line number.
  8. After performing all the required operations on a PDF file, release the memory by closing the PDF file using close() method of PDDocument object.

Related Links:

Java File Handling:

OOPs Concept:

Java Question And Answer:

Java Programs:


JAVA Code

In this example, we have used two class files PDFMethods.java and PDFHandling.javaPDFMethods.java file contains all the predefined and user-defined methods to access the content of PDF File. On the other hand, PDFHandling.java file has been used to call the methods and provide a flow of the program as per user requirement.

PDFMethods.java
package pdf;

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFMethods 
{
	public static File pdfFile;
	public static PDDocument doc;
	public static PDFTextStripper textStripper;
	
	public static void getPDFFile(String filePath) throws InvalidPasswordException, IOException
	{
		// Locate the PDF document to read.
		pdfFile=new File(filePath);
		
		// Load PDF file to memory to access the content.
		doc=PDDocument.load(pdfFile);
		
		// PDFTextStripper class loads the pdf content with ignoring all the formatting.
		// Also, provides methods to access the content within PDF File.
		textStripper =new PDFTextStripper();
	}
	
	public static void setPageRange(int startPage,int endPage)
	{
		// Set number of pages or range of pages for reading the content from PDF.
		textStripper.setStartPage(startPage);
		textStripper.setEndPage(endPage);
	}
	
	public static int getTotalLineCount() throws IOException
	{
		// User defined method to get the total number of lines within the given page range.
		// This methods will ignore blank line(if any) while counting the lines.
		String pdfContent=textStripper.getText(doc);
		int counter=0;
		String[] pdfLines=pdfContent.split("\\n");
		for (String string : pdfLines) 
		{
			if(!string.trim().isEmpty())
			{
				counter++;
			}
		}
		return counter;
	}
	
	public static String getAllPdfContent() throws IOException
	{
		// Load all the content of PDF(according to page range) file into String object
		String pdfContent = textStripper.getText(doc);
		return pdfContent;
	}
	
	public static String getTextByLineNumber(int lineNumber) throws IOException
	{
		// This methods will return text from a specified line number.
        String pdfContent = textStripper.getText(doc);
		int pdfLineNumber=1;
		String[] pdfLines=pdfContent.split("\\n");
		String currentLine=null;
		for (String string: pdfLines) 
		{
			if(!string.trim().isEmpty())
			{
				if(pdfLineNumber==lineNumber)
				{
				   currentLine=string.trim();
				   break;
				}
				pdfLineNumber++;
			}
			
		}
		
		return currentLine;
	}
	
	public static boolean searchText(String searchKey) throws IOException
	{
		// This methods will search any text in the PDF pages.
		String pdfContent = textStripper.getText(doc);
		if(pdfContent.contains(searchKey))
		{
			System.out.println("Keyword '"+searchKey+"' Found.");
			return true;
		}
		else
		{
			System.out.println("Keyword '"+searchKey+"' Not Found.");
			return false;
		}
		
	}
	
	public static boolean searchTextIgnoreCase(String searchKey) throws IOException
	{
		// This method will search text in PDF Pages with ignoring the letter case.
		String pdfContent = textStripper.getText(doc);
		if(pdfContent.toLowerCase().contains(searchKey.toLowerCase()))
		{
			System.out.println("Keyword '"+searchKey+"' Found with ignore case.");
			return true;
		}
		else
		{
			System.out.println("Keyword '"+searchKey+"' Not Found.");
			return false;
		}
		
	}
	
	public static boolean searchTextByLineNumber(String searchKey,int lineNumber) throws IOException
	{
		// This method will search any text in the specified line number in PDF.
        String pdfContent = textStripper.getText(doc);
		int pdfLineNumber=1;
		boolean searchResult=false;
		String[] pdfLines=pdfContent.split("\\n");
		String currentLine=null;
		for (String string: pdfLines) 
		{
			if(!string.trim().isEmpty())
			{
				if(pdfLineNumber==lineNumber)
				{
				   currentLine=string.trim();
				   if(currentLine.contains(searchKey))
				   {
					   searchResult=true;
					   System.out.println("Keyword '"+searchKey+"' Found in line number "+pdfLineNumber);
					   break;
				   }
				    break;
				}
				pdfLineNumber++;
			}
			
		}
		if(!searchResult)
		{
			System.out.println("Keyword '"+searchKey+"' Not Found in line number "+pdfLineNumber);
		}
		
		return searchResult;
	}
	
	public static void closePDF() throws IOException
	{
		// Close the PDF file after program execution is finished.
		doc.close();
	}

}

PDFHandling.java
package pdf;

import java.io.IOException;

import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;

public class PDFHandling {

	public static void main(String[] args) throws InvalidPasswordException, IOException 
	{
		PDFMethods.getPDFFile("input_pdf\\Demo_file.pdf");
		
		// Specify the range of pages for access in the PDF file.
		int startIndex=1;
		int endIndex=1;
		PDFMethods.setPageRange(startIndex, endIndex);
		System.out.println("Page Range:>> from "+startIndex+" to "+endIndex);
		
		// Get total number of lines in the target range of PDF document.
		int totalLinesCount=PDFMethods.getTotalLineCount();
		System.out.println("Total Lines:>> "+totalLinesCount);
		
		/*String pdfContent=PDFMethods.getAllPdfContent();
		System.out.println("All content from specified Pages :>>\n "+pdfContent);*/
		
		// Get the text from a specified line number.
		int targetLineNumber=4;
		String pdfLine=PDFMethods.getTextByLineNumber(targetLineNumber);
		System.out.println("Text from '"+targetLineNumber+"' Line number:>> "+pdfLine);
		
		// Search any text/keyword in the PDF Document.
		// Each search method will return boolean value which can be used in further programming. 
		String searchKey="PDF";
		boolean searchResult=PDFMethods.searchText(searchKey);
		boolean searchResult2=PDFMethods.searchTextIgnoreCase(searchKey);
		boolean searchResult3=PDFMethods.searchTextByLineNumber(searchKey, 2);
		
	}

}



2657total visits,2visits today

Leave a Reply

Your email address will not be published. Required fields are marked *