In this article, I’ll introduce how to extract/read text from a PDF document in Java program using a free third-party library Free Spire.PDF for Java. The library is a professional PDF API which enables developers to create, manipulate, read, convert and print PDF documents without installing Adobe Acrobat.
The tutorial will be separated into the following three aspects:
1. Extract all text from a PDF document
2. Extract text from a particular page of a PDF document
3. Extract text from a specific rectangle area of a PDF document
Create
a Maven project in your IDEA, type the following codes in the pom.xml file, and
then click the button “Import Changes”. For non-Maven users, please download
the package from the link, and manually import Spire.Pdf.jar in the “lib”
folder to IDEA.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf.free</artifactId>
<version>3.9.0</version>
</dependency>
</dependencies>
Using the code
1. Extract all text from a PDF document
Here are some steps to extract text from all pages of a PDF document using Free Spire.PDF for Java.
Step 1: Create a PdfDocument instance and then use document.LoadFromFile() method to load the
PDF document which we want to extract text from.
Step 2: Declare a new StringBuilder content, and append extracted text from PDF in StringBuilder by
using Append() method.
Step 3: Create a new .txt file and write text in it.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;
public class ExtractAllText {
public static void main(String[] args) {
//Create a PdfDocument instance
PdfDocument doc=new PdfDocument();
//Load the PDF file
doc.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");
//Create a StringBuilder instance
StringBuilder sb=new StringBuilder();
PdfPageBase page;
//Loop through PDF pages and get text of each page
for(int i=0;i<doc.getPages().getCount();i++){
page=doc.getPages().get(i);
sb.append(page.extractText(true));
}
FileWriter writer;
try {
//Write text into a .txt file
writer = new FileWriter("output/ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
doc.close();
}
}
2. Extract text from a particular page of PDF
import com.spire.pdf.*;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromParticularPage {
public static void main(String[] args) throws IOException {
//Load the PDF file
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");
//Create a new txt file to save the extracted text
String result = "output/extractTextFromParticularPage.txt";
File file=new File(result);
if(!file.exists()){
file.delete();
}
file.createNewFile();
FileWriter fw=new FileWriter(file,true);
BufferedWriter bw=new BufferedWriter(fw);
//Get the third page
PdfPageBase page = pdf.getPages().get(2);
// Extract text from page keeping white space
String text = page.extractText(true);
// Extract text from page without keeping white space
//String text = page.extractText(false);
bw.write(text);
bw.flush();
bw.close();
fw.close();
}
}
3. Extract text from a specific area of PDF
In addition to supporting extracting text from all pages or a particular page of a PDF document,
Free Spire.PDF for Java can extract text from a specific rectangular area of a PDF document.
The following are steps to do it.
Step 1: Initialize an object of PdfDocument class and load the PDF file.
Step 2: Get the page which text will be extracted from.
Step 3: Extract text from a specific rectangular area within the page, after that, save the text to a .txt file.
import com.spire.pdf.*;
import java.awt.geom.Rectangle2D;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTextFromSpecificArea {
public static void main(String[] args) throws IOException {
//Load the PDF file
PdfDocument pdf = new PdfDocument();
pdf.loadFromFile("C:\\Users\\Test1\\Desktop\\Sample.pdf");
//Create a new .txt file to save the extracted text
File file=new File("output/extractTextFromSpecificArea.txt");
if(!file.exists()){
file.delete();
}
file.createNewFile();
FileWriter fw=new FileWriter(file,true);
BufferedWriter bw=new BufferedWriter(fw);
//Get the first page
PdfPageBase page = pdf.getPages().get(0);
//Extract text from a specific rectangular area within the page
String text = page.extractText(new Rectangle2D.Float(80, 20, 500, 200));
bw.write(text);
bw.flush();
bw.close();
fw.close();
}
}
No comments:
Post a Comment