Tuesday, 3 November 2020

How to Extract Text from a PDF document in Java

In this article, I’ll introduce how to extract/read text from a PDF document in Java program using a free third-party library Free Spire.PDF for Java. The library is a professional PDF API which enables developers to create, manipulate, read, convert and print PDF documents without installing Adobe Acrobat.

The tutorial will be separated into the following three aspects:

1.  Extract all text from a PDF document

2.  Extract text from a particular page of a PDF document

3.  Extract text from a specific rectangle area of a PDF document

 Maven dependency:

Create a Maven project in your IDEA, type the following codes in the pom.xml file, and then click the button “Import Changes”. For non-Maven users, please download the package from the link, and manually import Spire.Pdf.jar in the “lib” folder to IDEA.

<repositories>
        <repository>
            <id>com.e-iceblue</id>
            <name>e-iceblue</name>
            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
        </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf.free</artifactId>
        <version>3.9.0</version>
    </dependency>

</dependencies>

Using the code

1. Extract all text from a PDF document

Here are some steps to extract text from all pages of a PDF document using Free Spire.PDF for Java.

Step 1: Create a PdfDocument instance and then use document.LoadFromFile() method to load the

PDF document which we want to extract text from.

Step 2: Declare a new StringBuilder content, and append extracted text from PDF in StringBuilder by

using Append() method.

Step 3: Create a new .txt file and write text in it.

import com.spire.pdf.PdfDocument;

import com.spire.pdf.PdfPageBase;

import java.io.*;
public class ExtractAllText {
   
public static void main(String[] args) {
       
//Create a PdfDocument instance
       
PdfDocument doc=new PdfDocument();
       
//Load the PDF file
       
doc.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");
       
//Create a StringBuilder instance
       
StringBuilder sb=new StringBuilder();
        PdfPageBase page;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
//Loop through PDF pages and get text of each page
       
for(int i=0;i<doc.getPages().getCount();i++){
            page=doc.getPages().get(i);
            sb.append(page.extractText(
true));
        }
        FileWriter writer;
       
try {
           
//Write text into a .txt file
           
writer = new FileWriter("output/ExtractText.txt");
            writer.write(sb.toString());
            writer.flush();
        }
catch (IOException e) {
            e.printStackTrace();
        }
        doc.close();
    }
}

2. Extract text from a particular page of PDF

import com.spire.pdf.*;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromParticularPage {
   
public static void main(String[] args) throws IOException {
       
//Load the PDF file
       
PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(
"C:\\Users\\Test1\\Desktop\\Sample.pdf");
       
//Create a new txt file to save the extracted text
       
String result = "output/extractTextFromParticularPage.txt";
        File file=
new File(result);
        
if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=
new FileWriter(file,true);
        BufferedWriter bw=
new BufferedWriter(fw);
       
//Get the third page
       
PdfPageBase page = pdf.getPages().get(2);

        // Extract text from page keeping white space
       
String text = page.extractText(true);
       
// Extract text from page without keeping white space
        //String text = page.extractText(false);
       
bw.write(text);
        bw.flush();
        bw.close();
        fw.close();
    }
}

3. Extract text from a specific area of PDF
In addition to supporting extracting text from all pages or a particular page of a PDF document, 
Free Spire.PDF for Java can extract text from a specific rectangular area of a PDF document. 
The following are steps to do it.
Step 1: Initialize an object of PdfDocument class and load the PDF file.
Step 2: Get the page which text will be extracted from.
Step 3: Extract text from a specific rectangular area within the page, after that, save the text to a .txt file.
import com.spire.pdf.*;
import java.awt.geom.Rectangle2D;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromSpecificArea {
   
public static void main(String[] args) throws IOException {
       
//Load the PDF file
       
PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile(
"C:\\Users\\Test1\\Desktop\\Sample.pdf");
       
//Create a new .txt file to save the extracted text
       
File file=new File("output/extractTextFromSpecificArea.txt");
       
if(!file.exists()){
            file.delete();
        }
        file.createNewFile();
        FileWriter fw=
new FileWriter(file,true);
        BufferedWriter bw=
new BufferedWriter(fw);
       
//Get the first page
       
PdfPageBase page = pdf.getPages().get(0);
       
//Extract text from a specific rectangular area within the page
       
String text = page.extractText(new Rectangle2D.Float(80, 20, 500, 200));
        bw.write(text);
        bw.flush();
        bw.close();
        fw.close();     }
}

 


 

No comments:

Post a Comment

Change PDF Versions in Java

In daily work, you might need to change the version of a PDF document you have in order to ensure compatibility with another version which a...