用Ghostscript合并PDF

Posted on 2020 年 6 月 30 日 by annhe 3K Views

这个需求的场景是需要将多个 PDF 合并成一个，并且加上目录。

Contents

1 核心命令
2 目录
3 PDF合法性判断
4 参考资料

核心命令

timeout 600 gs -q -sDEVICE=pdfwrite -dBATCH -sOUTPUTFILE=$BOOK_PATH -dNOPAUSE $INPUT $TOC

其中 $TOC 是目录文件。某些情况下，pdfwrite 无法正常生产 PDF，可以用 ps2write 先生成 ps 文件，再转换成 PDF：

timeout 600 gs -q -sDEVICE=ps2write -dBATCH -sOUTPUTFILE=/tmp/$TAG.ps -dNOPAUSE $INPUT && r=0 || r=1
[ $r -eq 0 ] && timeout 600 gs -q -sDEVICE=pdfwrite -dBATCH -sOUTPUTFILE=$BOOK_PATH -dNOPAUSE /tmp/$TAG.ps $TOC && r=0 || r=1

目录

pdfmark 目录的基本结构如下：

[/ModDate (D:20200411054704) /CreationDate (D:20200411054704) /Creator (Foxit PDF Creator Version 2.0.0 build 0725) /Producer (Foxit PDF Creator Version 2.0.0 build 0725) /DOCINFO pdfmark
[/Title <FEFF7B2C4E007AE0> /Count 0 /Page 1 /OUT pdfmark
[/Title <FEFF7B2C4E8C7AE0> /Count 0 /Page 2 /OUT pdfmark
[/Title <FEFF7B2C4E097AE0> /Count 0 /Page 3 /OUT pdfmark
[/Title <FEFF7B2C56DB7AE0> /Count 0 /Page 4 /OUT pdfmark
[/Title <FEFF7B2C4E947AE0> /Count 0 /Page 5 /OUT pdfmark

/Title 表示标题名称，如果含空格，应用括号括起来
/Count 标题下的分支数目，若无分支，则为 0。若大于 0，则此标题下对应数量的标题会被视为此标题的子标题
/Page 表示页码
/OUT 输出格式

以上 Title 分别为 第一章 .. 第五章。需要注意的是中文标题需要带有 BOM 的 UTF-16BE 编码，在 Python 中，核心代码如下：

def pdfmark(toc, count):
	title = toc.getText().encode('utf-16be')
	title = title.hex().upper()
	print("[/Title <FEFF" + title + "> /Count " + str(count) + " /Page " + toc.attrs['rel'][0] + " /OUT pdfmark")

PDF合法性判断

通过 pdfinfo 来判断是否是合法的 PDF。

function verifyPDF() {
        if [ $BOOKTYPE == "PDF" ];then
                pdfinfo $1 2>/dev/null|grep -i "PDF version" &>/dev/null
                if [ $? -eq 1 ];then
                        _warn "$1 not valid pdf. will delete. TAG:$TAG" "$LOGPRE"
                        rm -f "$1"
                        return 1
                fi
        fi
        return 0
}

参考资料

1. https://blog.csdn.net/u010252464/article/details/88932221
2. https://stackoverflow.com/questions/9188189/wrong-encode-when-update-pdf-meta-data-using-ghostscript-and-pdfmark

发表回复取消回复