跳转至

Import a Hugging Face model(导入 Hugging Face 模型)

:::callout{theme="neutral"} Users of the now deprecated Import an open source model functionality can use the language model adapters source code as a starting point to write adapters to wrap Hugging Face models. Linked adapters are no longer actively maintained by Palantir and are provided for reference only. Instead, replacement adapters should be published, maintained and extended at your discretion directly from Foundry adapter repositories. :::

In addition to the LLMs natively supported by Palantir AIP, Foundry enables users to integrate with popular language model frameworks such as Hugging Face ↗ or spaCy ↗. Such natural language frameworks are available for use in Foundry as packages distributed through Conda or PyPI.

However, these language model frameworks typically require users to download pre-trained models or corpora from the internet for further fine-tuning. These downloads often require additional steps to become fully functional within Foundry. This is a result of Foundry's security architecture, which by default denies access to the public internet for user-written Python code. The additional steps required depend on the particulars of the language model framework.

Hugging Face

To import a Hugging Face language model into Foundry, two options are available:

  1. Import the model files as raw datasets.
  2. Allowlist the Hugging Face domains.

Import the model files as a dataset

You can import any model from the Hugging Face model hub as a dataset. You can use that dataset in Code Repositories or Code Workspaces.

First, download ↗ the model files from Hugging Face. Then, upload the model as a schema-less dataset to bring the files into Foundry. These files can be uploaded either through a frontend upload (New > Dataset > Import > Select all files) or through a Data Connection sync if model files are stored on a shared drive in your private network.

The import dataset should contain all files from the Files and versions tab of the model details in Hugging Face. However, only one binary model file is required (for example, pytorch_model.bin or tf_model.h5). In most cases, we recommend using the PyTorch model as the binary model file.

Once the model files are stored in a dataset, you can use the dataset as an input in your transform. Depending on the size of your model, you may need to specify a Spark profile like DRIVER_MEMORY_MEDIUM to load the model into memory. The code below uses a utility from foundry-huggingface-adapters:

from palantir_models.transforms import ModelOutput
from transforms.api import transform, Input
from transformers import AutoTokenizer, AutoModel
from huggingface_adapters.utils import copy_model_to_driver


@transform(
    model_output=ModelOutput("/path/to/output/model_asset"),
    hugging_face_raw=Input("/path/to/input/dataset"),
)
def compute(model_output, hugging_face_raw):
    temp_dir = copy_model_to_driver(hugging_face_raw.filesystem())
    tokenizer = AutoTokenizer.from_pretrained(temp_dir)
    model = AutoModel.from_pretrained(temp_dir)

    # Wrap the model with a model adapter and save as a model
    # model_output.publish(...)

Depending on the use case, you can use one of the language model adapters like EmbeddingAdapter:

# other imports
from huggingface_adapters.embedding_adapter import EmbeddingAdapter
# ...
    model_output.publish(
        model_adapter=EmbeddingAdapter(tokenizer, model),
        change_type=ModelVersionChangeType.MINOR
    )

Allowlist Hugging Face domains

To download models from Hugging Face directly, you can allowlist the relevant domains in your Foundry enrollment by configuring a network egress policy. The relevant domains to allowlist are:

:::callout{theme="warning"} Domains from Hugging Face you use to download models can change without warning. If you are still not able to download a model after allowlisting the above domains, then you should attempt to identify which domains are still blocked by logging requests to assist in your debugging. You can reference detailed instructions on Stack Overflow ↗ to help you enable debugging at the httplib level. :::

In addition, the code repository that loads the model from Hugging Face must have the transforms-external-systems library added and be configured accordingly to use the newly created egress policy. Once the configuration is set up, open-source language models can be loaded in a Python transform.

:::callout{theme="neutral"} If you receive the error PermissionError: [Errno 13] Permission denied: '/.cache', you must pass in a cache directory during your model load as shown in the example below. :::

from palantir_models.transforms import ModelOutput
from transforms.api import transform, Input
from transforms.external.systems import use_external_systems, EgressPolicy, ExportControl
from transformers import AutoTokenizer, AutoModel
import tempfile

@use_external_systems(
    export_control=ExportControl(markings=['<marking ID>']),
    egress=EgressPolicy(<policy RID>),
)
@transform(
    model_output=ModelOutput('/path/to/output/model_asset'),
    text_input=Input('/path/to/input/dataset'),
)
def compute(export_control, egress, model_output, text_input):
    CACHE_DIR = tempfile.mkdtemp()
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", cache_dir=CACHE_DIR)
    model = AutoModel.from_pretrained("bert-base-cased", cache_dir=CACHE_DIR)

    # Wrap the model instance with a model adapter and save it in the Palantir platform
    # model_output.publish(...)

Usage in Foundry

Once you have access to the language model, either through a dataset or through the Hugging Face domains, you can integrate with it as a Palantir model by wrapping the language model with a model adapter as defined in the model training in Code Repositories documentation.


中文翻译


导入 Hugging Face 模型

:::callout{theme="neutral"} 已弃用的导入开源模型功能的用户,可以参考语言模型适配器的源代码,以此为基础编写适配器来封装 Hugging Face 模型。这些链接的适配器已不再由 Palantir 积极维护,仅供参考。建议您直接在 Foundry 适配器仓库中自行发布、维护和扩展替代适配器。 :::

除了 Palantir AIP 原生支持的大语言模型外,Foundry 还支持用户集成流行的语言模型框架,例如 Hugging Face ↗spaCy ↗。这些自然语言处理框架可通过 Conda 或 PyPI 分发的包在 Foundry 中使用。

然而,这些语言模型框架通常需要用户从互联网下载预训练模型或语料库以进行进一步微调。由于 Foundry 的安全架构默认拒绝用户编写的 Python 代码访问公共互联网,这些下载操作通常需要额外步骤才能在 Foundry 中完全生效。所需的具体步骤取决于语言模型框架的特性。

Hugging Face

要将 Hugging Face 语言模型导入 Foundry,有以下两种方式:

  1. 将模型文件作为原始数据集导入
  2. 将 Hugging Face 域名加入白名单

将模型文件作为数据集导入

您可以将 Hugging Face 模型中心(model hub)中的任何模型作为数据集导入,并在代码仓库(Code Repositories)或代码工作台(Code Workspaces)中使用该数据集。

首先,从 Hugging Face 下载 ↗ 模型文件。然后,将模型作为无模式数据集(schema-less dataset)上传至 Foundry。这些文件可以通过前端上传(新建 > 数据集 > 导入 > 选择所有文件)上传,如果模型文件存储在私有网络的共享驱动器上,也可以通过数据连接(Data Connection)同步上传。

导入的数据集应包含 Hugging Face 模型详情页中文件和版本选项卡下的所有文件。但只需一个二进制模型文件(例如 pytorch_model.bintf_model.h5)。大多数情况下,建议使用 PyTorch 模型作为二进制模型文件。

将模型文件存储到数据集后,您可以在转换(transform)中将其作为输入使用。根据模型大小,您可能需要指定 Spark 配置文件(如 DRIVER_MEMORY_MEDIUM)来将模型加载到内存中。以下代码使用了 foundry-huggingface-adapters 中的工具函数:

from palantir_models.transforms import ModelOutput
from transforms.api import transform, Input
from transformers import AutoTokenizer, AutoModel
from huggingface_adapters.utils import copy_model_to_driver


@transform(
    model_output=ModelOutput("/path/to/output/model_asset"),
    hugging_face_raw=Input("/path/to/input/dataset"),
)
def compute(model_output, hugging_face_raw):
    temp_dir = copy_model_to_driver(hugging_face_raw.filesystem())
    tokenizer = AutoTokenizer.from_pretrained(temp_dir)
    model = AutoModel.from_pretrained(temp_dir)

    # 使用模型适配器封装模型并保存为模型
    # model_output.publish(...)

根据使用场景,您可以使用 语言模型适配器 中的某个适配器,例如 EmbeddingAdapter

# 其他导入
from huggingface_adapters.embedding_adapter import EmbeddingAdapter
# ...
    model_output.publish(
        model_adapter=EmbeddingAdapter(tokenizer, model),
        change_type=ModelVersionChangeType.MINOR
    )

将 Hugging Face 域名加入白名单

要直接从 Hugging Face 下载模型,您可以通过配置网络出口策略将相关域名加入 Foundry 注册环境(enrollment)的白名单。需要加入白名单的域名包括:

:::callout{theme="warning"} Hugging Face 用于下载模型的域名可能会随时更改,恕不另行通知。如果在将上述域名加入白名单后仍无法下载模型,您应尝试通过记录请求来确定哪些域名仍被阻止,以协助调试。您可以参考 Stack Overflow ↗ 上的详细说明,在 httplib 级别启用调试。 :::

此外,从 Hugging Face 加载模型的代码仓库必须添加 transforms-external-systems,并进行相应配置以使用新创建的出口策略。配置完成后,即可在 Python 转换中加载开源语言模型。

:::callout{theme="neutral"} 如果遇到错误 PermissionError: [Errno 13] Permission denied: '/.cache',您必须在加载模型时传入缓存目录,如下例所示。 :::

from palantir_models.transforms import ModelOutput
from transforms.api import transform, Input
from transforms.external.systems import use_external_systems, EgressPolicy, ExportControl
from transformers import AutoTokenizer, AutoModel
import tempfile

@use_external_systems(
    export_control=ExportControl(markings=['<marking ID>']),
    egress=EgressPolicy(<policy RID>),
)
@transform(
    model_output=ModelOutput('/path/to/output/model_asset'),
    text_input=Input('/path/to/input/dataset'),
)
def compute(export_control, egress, model_output, text_input):
    CACHE_DIR = tempfile.mkdtemp()
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", cache_dir=CACHE_DIR)
    model = AutoModel.from_pretrained("bert-base-cased", cache_dir=CACHE_DIR)

    # 使用模型适配器封装模型实例并保存到 Palantir 平台
    # model_output.publish(...)

在 Foundry 中的使用

无论是通过数据集还是 Hugging Face 域名获取语言模型后,您都可以按照代码仓库中的模型训练文档所述,通过模型适配器封装语言模型,将其作为 Palantir 模型进行集成。