Extract text from PDF (using OCR)（从PDF中提取文本（使用OCR））¶

Supported in: Batch, Faster

Extracts text from the pages in a PDF file using optical character recognition (OCR).

Expression categories: Media

Declared arguments¶

Languages to detect: Languages to detect in the input files.
Set\>
Media reference: The column containing media references to PDF files in a media set.
Expression\
OCR output format: Output will be an array of strings. Each entry corresponds to one page of the PDF.
Enum\
Scripts to detect: Scripts to detect in the input files.
Set\>
optional End page: The end of the page range (inclusive). Negative indexing is supported.
Expression\
optional Error handling: Determines the behavior of the pipeline for inputs that fail to process.
Enum\
optional Start page: The start of the page range. If no value is provided, it will default to the first page.
Expression\

Output type: Array\

Argument values:

mediaReference	Output
{"mimeType":"application/pdf","reference":{"type":"mediaSetItem","mediaSetItem":{"mediaSetRid":"ri.mio.main.media-set.a", "mediaItemRid":"ri.mio.main.media-item.a"}}}	[ This text came from the PDF document in the media set., So did this text. ]

支持：批处理、快速处理

通过光学字符识别（OCR）从PDF文件的页面中提取文本。

表达式类别： 媒体

要检测的语言： 输入文件中需要检测的语言。
集合\<枚举\<南非语、阿尔巴尼亚语、阿姆哈拉语、阿拉伯语、亚美尼亚语、阿萨姆语、阿塞拜疆语、阿塞拜疆语（西里尔文）、巴斯克语、白俄罗斯语等……>>
媒体引用： 包含媒体集中PDF文件媒体引用的列。
表达式\<媒体引用>
OCR输出格式： 输出将是一个字符串数组。每个条目对应PDF的一个页面。
枚举\<文本, hOCR>
要检测的文字系统： 输入文件中需要检测的文字系统。
集合\<枚举\<阿拉伯文、亚美尼亚文、孟加拉文、加拿大原住民音节文字、切罗基文、西里尔文、天城文、埃塞俄比亚文、哥特体、格鲁吉亚文等……>>
可选 结束页： 页面范围的结束（包含）。支持负索引。
表达式\<整数>
可选 错误处理： 决定管道对处理失败的输入的行为。
枚举\<失败, 空值>
可选 起始页： 页面范围的起始。如果未提供值，则默认为第一页。
表达式\<整数>

输出类型： 数组\<字符串>

参数值：

mediaReference	输出
{"mimeType":"application/pdf","reference":{"type":"mediaSetItem","mediaSetItem":{"mediaSetRid":"ri.mio.main.media-set.a", "mediaItemRid":"ri.mio.main.media-item.a"}}}	[ 这段文本来自媒体集中的PDF文档。, 这段文本也是。 ]