Tina's blog: How to Extract Text from a PDF document in Java

In this article, I’ll introduce how to extract/read text from a PDF document in Java program using a free third-party library Free Spire.PDF for Java. The library is a professional PDF API which enables developers to create, manipulate, read, convert and print PDF documents without installing Adobe Acrobat.

The tutorial will be separated into the following three aspects:

1. Extract all text from a PDF document

2. Extract text from a particular page of a PDF document

3. Extract text from a specific rectangle area of a PDF document

Maven dependency:

Create a Maven project in your IDEA, type the following codes in the pom.xml file, and then click the button “Import Changes”. For non-Maven users, please download the package from the link, and manually import Spire.Pdf.jar in the “lib” folder to IDEA.

<repositories>

        <repository>

            <id>com.e-iceblue</id>

            <name>e-iceblue</name>

            <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>

        </repository>

</repositories>

<dependencies>

    <dependency>

        <groupId>e-iceblue</groupId>

        <artifactId>spire.pdf.free</artifactId>

        <version>3.9.0</version>

    </dependency>

</dependencies>

Using the code
1. Extract all text from a PDF document
Here are some steps to extract text from all pages of a PDF document using Free Spire.PDF for Java.
Step 1: Create a PdfDocument instance and then use document.LoadFromFile() method to load the 
PDF document which we want to extract text from.
Step 2: Declare a new StringBuilder content, and append extracted text from PDF in StringBuilder by 
using Append() method. 
Step 3: Create a new .txt file and write text in it.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;

public class ExtractAllText {

    public static void main(String[] args) {

        //Create a PdfDocument instance

        PdfDocument doc=new PdfDocument();

        //Load the PDF file

        doc.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");

        //Create a StringBuilder instance

        StringBuilder sb=new StringBuilder();

        PdfPageBase page;

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
        //Loop through PDF pages and get text of each page

        for(int i=0;i<doc.getPages().getCount();i++){

            page=doc.getPages().get(i);

            sb.append(page.extractText(true));

        }

        FileWriter writer;

        try {

            //Write text into a .txt file

            writer = new FileWriter("output/ExtractText.txt");

            writer.write(sb.toString());

            writer.flush();

        } catch (IOException e) {

            e.printStackTrace();

        }

        doc.close();

    }

}
2. Extract text from a particular page of PDF
import com.spire.pdf.*;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractTextFromParticularPage {

    public static void main(String[] args) throws IOException {

        //Load the PDF file

        PdfDocument pdf = new PdfDocument();

        pdf.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");

        //Create a new txt file to save the extracted text

        String result = "output/extractTextFromParticularPage.txt";

        File file=new File(result);

        if(!file.exists()){

            file.delete();

        }

        file.createNewFile();

        FileWriter fw=new FileWriter(file,true);

        BufferedWriter bw=new BufferedWriter(fw);

        //Get the third page

        PdfPageBase page = pdf.getPages().get(2);
        // Extract text from page keeping white space

        String text = page.extractText(true);

        // Extract text from page without keeping white space

        //String text = page.extractText(false);

        bw.write(text);

         bw.flush();

        bw.close();

        fw.close();

    }

}
3. Extract text from a specific area of PDF
In addition to supporting extracting text from all pages or a particular page of a PDF document, 
Free Spire.PDF for Java can extract text from a specific rectangular area of a PDF document. 
The following are steps to do it.
Step 1: Initialize an object of PdfDocument class and load the PDF file.
Step 2: Get the page which text will be extracted from.
Step 3: Extract text from a specific rectangular area within the page, after that, save the text to a .txt file.
import com.spire.pdf.*;
import java.awt.geom.Rectangle2D;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractTextFromSpecificArea {

    public static void main(String[] args) throws IOException {

        //Load the PDF file

        PdfDocument pdf = new PdfDocument();

        pdf.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");

        //Create a new .txt file to save the extracted text

        File file=new File("output/extractTextFromSpecificArea.txt");

        if(!file.exists()){

            file.delete();

        }

        file.createNewFile();

        FileWriter fw=new FileWriter(file,true);

        BufferedWriter bw=new BufferedWriter(fw);

        //Get the first page

        PdfPageBase page = pdf.getPages().get(0);

        //Extract text from a specific rectangular area within the page

        String text = page.extractText(new Rectangle2D.Float(80, 20, 500, 200));

        bw.write(text);

        bw.flush();

        bw.close();

        fw.close();
    }

}

Tina's blog

Tuesday, 3 November 2020

How to Extract Text from a PDF document in Java

No comments:

Post a Comment

Change PDF Versions in Java

Search This Blog