在Python中如何优雅地处理PDF文件（Python 文件处理）-eolink官网

在Python中如何优雅地处理PDF文件（Python 文件处理）

1. 引言

PDF文档是我们在日常工作中经常会遇到的文件格式，有时我们需要编辑并从中提取一些有用的数据。在本文中，我将向大家介绍如何使用Python中的PDF库从PDF文档中提取文本、表格和图像以及其他类型的数据。闲话少说，我们直接开始吧！

2. 从PDF文件中获取文本

在Python中有多种库可以帮助我们方便的从PDF文件中获取对应的文本，其中最为常用的是PyPdf2，我们不妨来举个栗子来看看相应的函数的使用方法。

样例代码如下：

# importing moduleimport PyPDF2 # create a pdf file objectpdfFileObj = open('file.pdf', 'rb') # create a pdf reader objectpdfReader = PyPDF2.PdfFileReader(pdfFileObj) # creating a page objectpageObj = pdfReader.getPage(0) # extracte text from pageprint(pageObj.extractText()) # closing the pdf file objectpdfFileObj.close()

在上述代码中，我们逐行来分析：

首先我们导入我们的第三方库PyPDF2接着我们使用函数open()以二进制方式读入我们的PDF文件将读入的文件对象传递给PdfFileReader函数获取PDF某个页面的对象，生成pageObj使用函数extractText()来提取文本信息最后我们使用close函数来将PdfFileObj关闭

最终，关闭文件是必须的。如果我们让它保持打开状态，并试图读取另一个文件，此时它会给我们提示一个文件读取的错误。上述代码展示了提取单个页面的逻辑，进而我们可以使用循环语句来读取所有的页面，样例代码如下：

# importing moduleimport PyPDF2 # create a pdf file objectpdfFileObj = open('file.pdf', 'rb') # create a pdf reader objectpdfReader = PyPDF2.PdfFileReader(pdfFileObj) for i in range(pdfReader.numPages): pageObj = pdfReader.getPage(i) print(pageObj.extractText()) # closing the pdf file objectpdfFileObj.close()

举例，假设我们需要的PDF文件如下：

则上述代码的运行结果如下：

A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...

3. 从PDF文件中获取表格

使用PyPDF2提取表不太方便，为了正确地从PDF文件中提取表格，我们需要采用计算机视觉的方法首先检测这些表格，然后进行机器学习计算，最后在将其提取出来。

为了完成这项任务，这里推荐一个第三方python模块，叫做Tabula，该模块专门用于从pdf中读取和提取表格，并以CSV格式存储。

样例代码如下：

import tabula# Read pdf into list of DataFramedf = tabula.read_pdf("test.pdf", pages='all')print(df)

上述代码的解析如下：

首先我们引入我们所需的第三方库tabula接着我们使用函数read_pdf来读取pdf文件，并提取所有页面中的表格最后我们使用打印函数将提取到的表格进行打印

当然，我们也可以将提取得到的数据以csv的方式进行存储，样例代码如下：

import tabula# convert PDF into CSV filetabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

4. 从PDF文件中获取图片

在Python中为了从PDF文件中提取图像，我们必须使用其他第三方模块。安装我们所需的第三方库PyMuPDF以及图像处理库Pillow，安装代码如下：

pip install PyMuPDF Pillow

从PDF文件中提取图片的示例代码如下：

import fitzimport iofrom PIL import Imagepdf_file = fitz.open("test2.pdf")# iterate over PDF pagesfor page_index in range(len(pdf_file)): # get the page itself page = pdf_file[page_index] image_list = page.getImageList() for image_index, img in enumerate(page.getImageList(), start=1): # get the XREF of the image xref = img[0] # extract the image bytes base_image = pdf_file.extractImage(xref) image_bytes = base_image["image"] # get the image extension image_ext = base_image["ext"] # load it to PIL image = Image.open(io.BytesIO(image_bytes)) # save it image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

假设我们的PDF文件内容如下：

我们测试上述代码，得到结果如下：

5. 总结

本文重点介绍了在Python中如何利用功能强大的第三方库来从PDF文件中获取文本表格和图像数据，并给出了相应的代码示例！您学废了吗？

Python接口自动化之文件上传/下载接口怎么实现

617 2022-09-06

在Python中如何优雅地处理PDF文件（Python 文件处理）

java中的接口是类吗

Spring中的aware接口详情

Python接口自动化之文件上传/下载接口怎么实现

推荐文章

接口调用是什么意思？几种常用接口调用方式

接口设计原则

8款在线 API 接口文档管理工具

api管理系统是什么？

什么是接口调试？接口调试的步骤有哪些？

api 接口管理系统有哪些？

接口测试有几种测试方法

API文档生成工具有哪些？

微服务和api网关区别

交换机配置步骤

最近发表

热评文章

在线接口文档管理工具推荐，支持在线测试，HTTP接口

开源的在线接口文档wiki工具Mindoc的介绍与使

如何优雅的进行接口设计？接口设计的六大原则是什么？

什么是API测试,api检测公司

遇到百度网址安全中心提醒您该页面可能存在钓鱼欺诈信息

软件接口设计怎么做？前后端分离软件接口设计思路