Process multimodal and embedding models（处理多模态模型与嵌入模型）¶

This page discusses some methods you can use to process multimodal and embedding models.

Multimodal models¶

If you want to answer questions based on diagrams, LLMs with the text-in-text-out architecture will be of no help. While GPT-4o and GPT-4o mini are able to take image inputs, there are other open-source options available for your consideration.

Pix2Struct ↗: Performed quite well during initial tests on quality assurance for a table in German. You can try it on huggingface ↗.
Microsoft UDOP (Universal document processing) ↗: Open source, but not available on huggingface.

In this setup, you can use the initial text extraction just to have something to (semantic-)search for, but then later run the multimodal model on top of the raw source page (image).

Embedding models¶

If you are working in English, you can try MSMARCO models from the sentence-transformers docs ↗.

MS MARCO ↗ is a collection of large scale information retrieval datasets that were created based on real user search queries using the Bing search engine. The provided models can be used for semantic search, in that given keywords, a search phrase, or a question, the model will find passages that are relevant for the search query.

This means these models were specifically trained to put queries and relevant passages close together in embedding space.

By this definition, embedding models may be a better fit for semantic search workflows that start from a user query than general-purpose OpenAI Ada. When you use Ada to embed a query directly and compare that to chunk embeddings, you are not comparing the same concept and may instead use asymmetric embedding models to bridge that gap. Alternatively, you can attempt using an LLM to get generate a hypothetical chunk first.

Ada, in turn would make more sense when your starting point is a chunk, and you are searching for similar chunks. Note that most non-ada embedding models only support 512 tokens, so you need to adapt your chunking strategy accordingly.

If you are working in German, for example, GPT is currently the only LLM that performs decently for the language. With a German document corpus, try ada.

中文翻译¶

处理多模态模型与嵌入模型¶

本页讨论可用于处理多模态模型和嵌入模型的一些方法。

多模态模型(Multimodal models)¶

如果您希望基于图表来回答问题，采用文本输入-文本输出架构的大语言模型(LLM)将无济于事。虽然GPT-4o和GPT-4o mini能够接受图像输入，但还有其他开源方案可供您考虑。

Pix2Struct ↗： 在针对德语表格的质量保证初始测试中表现相当出色。您可以在 huggingface ↗ 上尝试使用。
Microsoft UDOP（通用文档处理）↗： 开源，但未在 huggingface 上提供。

在这种设置下，您可以先进行初始文本提取，以便有可供（语义）搜索的内容，随后再对原始源页面（图像）运行多模态模型。

嵌入模型(Embedding models)¶

如果您使用英语工作，可以尝试来自 sentence-transformers 文档 ↗ 的 MSMARCO 模型。

MS MARCO ↗ 是一个大规模信息检索数据集集合，基于使用 Bing 搜索引擎的真实用户搜索查询创建。所提供的模型可用于语义搜索，即给定关键词、搜索短语或问题，模型将找到与搜索查询相关的段落。

这意味着这些模型是专门经过训练，能够将查询和相关段落紧密地映射到嵌入空间(embedding space)中。

根据这一定义，对于从用户查询开始的语义搜索工作流而言，嵌入模型可能比通用型 OpenAI Ada 更为适用。当您直接使用 Ada 嵌入查询并将其与分块嵌入进行比较时，您比较的并非同一概念，而应使用非对称嵌入模型(asymmetric embedding models)来弥合这一差距。或者，您可以尝试先使用 LLM 生成一个假设性分块。

反过来，当您的起点是一个分块，并且您正在搜索相似分块时，Ada 则更为合理。请注意，大多数非 Ada 嵌入模型仅支持 512 个令牌(token)，因此您需要相应调整分块策略。

例如，如果您使用德语工作，GPT 是目前唯一在该语言上表现尚可的大语言模型。对于德语文档语料库，请尝试使用 Ada。