跳转至

Model adapter API(模型适配器 API)

The model adapter's api() method specifies the expected inputs and outputs in order to execute this model adapter's inference logic. Inputs and outputs are specified separately.

At runtime, the model adapter's predict() method is called with the specified inputs.

Example api() implementation

The following example shows an API specifying one input, named input_dataframe, and one output, named output_dataframe. Both the input and output objects are specified as Pandas dataframes, where the input dataframe has one column of float type named input_feature, and the output dataframe has two columns: (1) a column named input_feature of float type and (2) a column named output_feature of int type.

import palantir_models as pm


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)])
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    ...

The API definition can also be extended to support multiple inputs or outputs of arbitrary types:

import palantir_models as pm


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)]),
            "input_parameter": pm.Parameter(float, default=1.0)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    ...

API types

The types of inputs and outputs for the model adapter API can be specified with the following classes, defined in detail below:

  • pm.Pandas, for Pandas Dataframes (*)
  • pm.Spark, for Spark Dataframes (*)
  • pm.Parameter, for constant, single-valued parameters
  • pm.FileSystem, for Foundry Dataset filesystem access
  • pm.MediaReference, for use with Media References
  • pm.Object, for use with Ontology Objects
  • pm.ObjectSet, for use with Onology Object Sets
# The following classes are accessible via `palantir_models` or `pm`

class Pandas:
    def __init__(self, columns: List[Union[str, Tuple[str, type]]]):
        """
        Defines a Pandas Dataframe input or output. Column name and type definitions can be specified as a parameter of this type.
        """

class Spark:
    def __init__(self, columns: List[Union[str, Tuple[str, type]]] = []):
        """
        Defines a Spark Dataframe (pyspark.sql.Dataframe) input or output. Column name and type definitions can be specified as a parameter of this type.
        """

class Parameter:
    def __init__(self, type: type = Any, default = None):
        """
        Defines a constant single-valued parameter input or output. The type of this parameter (default Any) and default value can be specified as parameters of this type.
        """

class FileSystem:
    def __init__(self):
        """
        Defines a FileSystem access input or output object. This type is only usable if the model adapter's `transform()` or `transform_write()` method is called with Foundry Dataset objects.

        If used as an input, the FileSystem representation of the dataset is returned.

        If used as an output, an object containing an `open()` method is used to write files to the output dataset.

        Note that FileSystem outputs are only usable via calling `.transform_write()`.
        """

class MediaReference:
    def __init__(self):
        """
        Defines an input object to be of MediaReference type. This input expects either a stringified JSON representation or a dictionary representation of a media reference object.

        This type is not supported as an API output.
        """

class Object:
    def __init__(self):
        """
        Defines an input object to be of Object type. This input expects either a primary key of the specified Object type, or an instance of the Object type itself.

        This type is not supported as an API output.
        """

class ObjectSet:
    def __init__(self):
        """
        Defines an input object to be of ObjectSet type. This input expects either an object set rid for the specified Object type, or an instance of an ObjectSet itself.

        This type is not supported as an API output.
        """

(*) Review the requirements described below on using these types when publishing the model as a function.

Specifying tabular columns

For Pandas or Spark inputs and outputs, columns can be specified as either a list of strings specifying the column name, or a list of two-object tuples in the format (<name>, <type>) where <name> is a string representing the column name and <type> is a Python type representing the type of the data in the column. If a string is provided for a column definition, its type will default to Any.

Column types

The following types are supported for tabular columns:

  • str
  • int
  • float
  • bool
  • list (*)
  • dict(*)
  • datetime.date
  • datetime.time
  • datetime.datetime
  • typing.Any (*)
  • MediaReference [Beta]

Column types are generally not enforced for batch inference, unlike live inference, and are mostly meant as documentation for consumers of this model adapter. Refer to the model adapter reference for further details on specifications for API enforcement.

(*) Review the requirements described below on using these types when publishing the model as a function.

Parameter types

For Parameter inputs and outputs, the following types are supported:

  • str
  • int
  • float
  • bool
  • list(*)
  • dict(*)
  • typing.Any(*)
  • pm.NDArray(shape: list[int], dtype: numpy.typing.DTypeLike)

Parameter types are enforced, and any parameter input to model.transform() that does not correspond to the designated type will throw a runtime error.

(*) Review the requirements described below on using these types when publishing the model as a function.

NDArray Type

NumPy ndarray ↗ (N-dimensional array) types are supported in the Model API via the pm.NDArray(shape, dtype) parameter type definition. You must specify both a shape and dtype for your ndarray input.

Supported dtypes

The following NumPy dtypes are supported:

  • np.str_
  • np.bool_
  • np.int8
  • np.int16
  • np.int32
  • np.int64
  • np.float16
  • np.float32
  • np.float64

Shape specification

The shape is specified as a list[int] where each element represents a dimension. You can use -1 in any dimension to indicate that the shape check should be ignored for that specific dimension.

Example usage

import numpy as np
import palantir_models as pm


class NumpyModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "ndarray_in": pm.Parameter(
                type=pm.NDArray([1, 1, 2], np.int8),
                default=np.array([[[1, -5]]], dtype=np.int8)
            )
        }
        outputs = {
            "ndarray_out": pm.Parameter(type=pm.NDArray([1, 1, 2], np.int8))
        }
        return inputs, outputs

    def predict(self, ndarray_in):
        # Call your model on the ndarray
        return self.model.process(ndarray_in)

Using models with NDArray in transforms

In transforms, you can provide either np.ndarray objects or lists representing the ndarray as input:

# Using numpy array directly
ndarray = np.array([[[-95, -63]]], dtype=np.int8)

model_output = model.predict(ndarray_in=ndarray)
type(model_output["ndarray_out"])  # <class 'numpy.ndarray'>

# Using list representation (automatically converted)
model_output = model.transform(ndarray_in=ndarray.tolist())
type(model_output["ndarray_out"])  # <class 'numpy.ndarray'>

Using NDArray in Functions

When using models with NDArray inputs in Functions, you must always provide the input as a JSON-compatible nested list. Note that if the output is an np.ndarray, the output will also be returned as a nested list:

import { model } from "<...>/models";

@Function()
public async ndarray_model_example(input: Integer[][][]): Promise<Integer[][][]> {
    return await model({
        "ndarray_in": input
    });
}

Object and ObjectSet types

For Object or ObjectSet inputs, the object type is specified when defining the input in the model adapter API. This object type will be imported from an Ontology SDK generated for a chosen Ontology:

from ontology_sdk.ontology.objects import ExampleRestaurant


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_object": pm.Object(ExampleRestaurant),
            "input_object_set": pm.ObjectSet(ExampleRestaurant)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    def predict(self, input_object: ExampleRestaurant, input_object_set: ExampleRestaurantObjectSet):
        outputs = ...
        return outputs
    ...

Example predict() implementation

This example uses Pandas dataframes as inputs and outputs, alongside a parameter.

class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)]),
            "input_parameter": pm.Parameter(float, default=1.0)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    def predict(self, input_dataframe, input_parameter):
        outputs["prediction"] = self.model.predict(input_dataframe) * input_parameter
        return outputs
    ...
  • For tabular outputs, the list of columns specified in the API should only contain the required columns. At runtime, columns that were not specified in the API that are passed to the dataframes will be available in the predict method's input and can be returned as outputs. In the example above, passing a dataframe with columns input_feature and extra_column will return a dataframe with columns input_feature, extra_column and prediction.
  • Some models require that only columns used during training are passed to their prediction method. Therefore, we recommend only extracting the feature columns to pass to the model.
  • Some models require the ordering of columns to be preserved. When inputs are passed via REST API requests as JSON objects, the ordering of columns is not necessarily preserved. Therefore, we recommend reordering the columns before passing them to the model's inference method.

API definition requirements

Requirements for Objectives batch deployment

Direct setup of batch deployment and automatic model evaluation in the Modeling Objectives application is only compatible with models that have a single tabular dataset input. If your model adapter requires several inputs, you can set up batch inference in a Python transform.

Requirements for Objectives evaluation

Evaluation in Modeling Objectives expects a single tabular input and a single tabular output. Both should include label columns to enable the computation of metrics based on comparing predictions to the ground truth in the evaluation set. The following rules must be kept in mind to ensure that your model adapter is compatible with evaluation:

  1. Your API must only include a single tabular input and output.
  2. Any label or ground truth columns should not be included in the list of columns of the input or output tables.
  3. Since declared column types are enforced during live inference, this would prevent usage of the model for live inference in cases where the label column is not present.
  4. The predict method should not drop the label or ground truth columns.
  5. If these columns were dropped, the evaluation logic, which runs on the inference output from the predict method, would not have the label available for comparison.

As an example, consider the following adapter for a regression task:

import palantir_models as pm
from palantir_models.serializers import DillSerializer

class LinearRegressionModelAdapter(pm.ModelAdapter):

    @pm.auto_serialize(
        model=DillSerializer()
    )
    def __init__(self, model):
        self.model = model

    @classmethod
    def api(cls):
        input_columns = [
            ('median_income', float),
            ('housing_median_age', float),
            ('total_rooms', float),

            # Do not include the target variable explicitly in the API
            # since the adapter logic will throw an error in the live
            # inference case if the label is not found.
            # That will necessarily happen when the model is applied
            # to anything other than training or evaluation sets.

            # ('house_price', float) # should not be included
        ]
        output_columns = columns + [('predicted_house_price', float)]
        return {'df_in': pm.Pandas(input_columns)}, \
               {'df_out': pm.Pandas(output_columns)}

    def predict(self, df_in):
        cols_to_keep = ['median_income', 'housing_median_age', 'total_rooms']
        # If we didn't do this, the next line
        # would implicitly drop the 'house_price' column,
        # which represents the 'label' or target variable,
        # making this adapter unsuitable for use with Evaluation
        # in Modeling Objectives.
        if 'house_price' in df_in.columns:
            cols_to_keep.append('house_price')
        df_in['prediction'] = self.model.predict(
            df_in[cols_to_keep]
        )
        return df_in

Requirements for Direct Function Publishing and Model use in Functions

:::callout{theme="neutral"} Changing the API of a model with a function published will require an update of the consumers of said function, as described in the model functions guide. :::

For the model to support being published as a function, the following must be true:

  • The Model API can only contain Parameter, Tabular, Object, or ObjectSet inputs.
  • The API must not contain the Typing.Any type.
  • For tabular inputs, the API must specify the required input or output columns. Simply using pm.Pandas() or pm.Spark() is not allowed as it would be implicitly interpreted a TypeScript Array<any>.
  • Any collection type such as list or dict must specify element types (for example, using list[str] or dict[str, str]). Element types are otherwise interpreted as being of the any type.

中文翻译

模型适配器 API

模型适配器的 api() 方法用于指定执行该模型适配器推理逻辑所需的预期输入和输出。输入和输出需分别指定。

在运行时,模型适配器的 predict() 方法会使用指定的输入进行调用。

api() 实现示例

以下示例展示了一个 API,它指定了一个名为 input_dataframe 的输入和一个名为 output_dataframe 的输出。输入和输出对象均被指定为 Pandas 数据框(Dataframe),其中输入数据框包含一个名为 input_featurefloat 类型列,输出数据框包含两列:(1) 一个名为 input_featurefloat 类型列,以及 (2) 一个名为 output_featureint 类型列。

import palantir_models as pm


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)])
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    ...

API 定义也可以扩展以支持多个任意类型的输入或输出:

import palantir_models as pm


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)]),
            "input_parameter": pm.Parameter(float, default=1.0)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    ...

API 类型

模型适配器 API 的输入和输出类型可以使用以下类来指定,详细定义如下:

  • pm.Pandas,用于 Pandas 数据框(Dataframe) (*)
  • pm.Spark,用于 Spark 数据框(Dataframe) (*)
  • pm.Parameter,用于常量单值参数
  • pm.FileSystem,用于 Foundry 数据集文件系统访问
  • pm.MediaReference,用于媒体引用(Media References)
  • pm.Object,用于本体对象(Ontology Objects)
  • pm.ObjectSet,用于本体对象集(Onology Object Sets)
# 以下类可通过 `palantir_models` 或 `pm` 访问

class Pandas:
    def __init__(self, columns: List[Union[str, Tuple[str, type]]]):
        """
        定义 Pandas 数据框(Dataframe)输入或输出。列名和类型定义可作为此类型的参数指定。
        """

class Spark:
    def __init__(self, columns: List[Union[str, Tuple[str, type]]] = []):
        """
        定义 Spark 数据框(Dataframe)(pyspark.sql.Dataframe)输入或输出。列名和类型定义可作为此类型的参数指定。
        """

class Parameter:
    def __init__(self, type: type = Any, default = None):
        """
        定义常量单值参数输入或输出。此参数的类型(默认为 Any)和默认值可作为此类型的参数指定。
        """

class FileSystem:
    def __init__(self):
        """
        定义文件系统访问输入或输出对象。此类型仅在模型适配器的 `transform()` 或 `transform_write()` 方法使用 Foundry 数据集对象调用时可用。

        如果用作输入,则返回数据集的 FileSystem 表示。

        如果用作输出,则使用包含 `open()` 方法的对象将文件写入输出数据集。

        请注意,FileSystem 输出只能通过调用 `.transform_write()` 使用。
        """

class MediaReference:
    def __init__(self):
        """
        定义输入对象为 MediaReference 类型。此输入期望接收媒体引用对象的字符串化 JSON 表示或字典表示。

        此类型不支持作为 API 输出。
        """

class Object:
    def __init__(self):
        """
        定义输入对象为 Object 类型。此输入期望接收指定 Object 类型的主键,或 Object 类型本身的实例。

        此类型不支持作为 API 输出。
        """

class ObjectSet:
    def __init__(self):
        """
        定义输入对象为 ObjectSet 类型。此输入期望接收指定 Object 类型的对象集 rid,或 ObjectSet 本身的实例。

        此类型不支持作为 API 输出。
        """

(*) 在将模型发布为函数时,请查看下文所述的要求中关于使用这些类型的说明。

指定表格列

对于 PandasSpark 输入和输出,列可以指定为列名的 字符串(strings) 列表,或格式为 (<名称>, <类型>) 的双对象 元组(tuples) 列表,其中 <名称> 是表示列名的字符串,<类型> 是表示列中数据类型的 Python 类型。如果为列定义提供了字符串,其类型将默认为 Any

列类型

表格列支持以下类型:

  • str
  • int
  • float
  • bool
  • list (*)
  • dict(*)
  • datetime.date
  • datetime.time
  • datetime.datetime
  • typing.Any (*)
  • MediaReference [Beta]

与实时推理(Live inference)不同,列类型在批量推理(Batch inference)中通常强制执行,主要作为该模型适配器使用者的文档说明。有关 API 强制执行的详细规范,请参阅模型适配器参考文档

(*) 在将模型发布为函数时,请查看下文所述的要求中关于使用这些类型的说明。

参数类型

对于 Parameter 输入和输出,支持以下类型:

  • str
  • int
  • float
  • bool
  • list(*)
  • dict(*)
  • typing.Any(*)
  • pm.NDArray(shape: list[int], dtype: numpy.typing.DTypeLike)

参数类型会被强制执行,任何输入到 model.transform() 的参数如果与指定类型不符,将抛出运行时错误。

(*) 在将模型发布为函数时,请查看下文所述的要求中关于使用这些类型的说明。

NDArray 类型

NumPy ndarray ↗(N 维数组)类型通过 pm.NDArray(shape, dtype) 参数类型定义在模型 API 中得到支持。您必须为 ndarray 输入同时指定形状(shape)和数据类型(dtype)。

支持的数据类型

支持以下 NumPy 数据类型:

  • np.str_
  • np.bool_
  • np.int8
  • np.int16
  • np.int32
  • np.int64
  • np.float16
  • np.float32
  • np.float64

形状规范

形状被指定为 list[int],其中每个元素代表一个维度。您可以在任何维度中使用 -1 来表示应忽略该特定维度的形状检查。

使用示例

import numpy as np
import palantir_models as pm


class NumpyModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "ndarray_in": pm.Parameter(
                type=pm.NDArray([1, 1, 2], np.int8),
                default=np.array([[[1, -5]]], dtype=np.int8)
            )
        }
        outputs = {
            "ndarray_out": pm.Parameter(type=pm.NDArray([1, 1, 2], np.int8))
        }
        return inputs, outputs

    def predict(self, ndarray_in):
        # 在 ndarray 上调用您的模型
        return self.model.process(ndarray_in)

在转换(transforms)中使用带有 NDArray 的模型

在转换中,您可以提供 np.ndarray 对象或表示 ndarray 的列表作为输入:

# 直接使用 numpy 数组
ndarray = np.array([[[-95, -63]]], dtype=np.int8)

model_output = model.predict(ndarray_in=ndarray)
type(model_output["ndarray_out"])  # <class 'numpy.ndarray'>

# 使用列表表示(自动转换)
model_output = model.transform(ndarray_in=ndarray.tolist())
type(model_output["ndarray_out"])  # <class 'numpy.ndarray'>

在函数(Functions)中使用 NDArray

在函数中使用带有 NDArray 输入的模型时,您必须始终以 JSON 兼容的嵌套列表形式提供输入。请注意,如果输出是 np.ndarray,输出也将作为嵌套列表返回:

import { model } from "<...>/models";

@Function()
public async ndarray_model_example(input: Integer[][][]): Promise<Integer[][][]> {
    return await model({
        "ndarray_in": input
    });
}

Object 和 ObjectSet 类型

对于 ObjectObjectSet 输入,对象类型在模型适配器 API 中定义输入时指定。此对象类型将从为选定本体(Ontology)生成的本体 SDK 中导入:

from ontology_sdk.ontology.objects import ExampleRestaurant


class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_object": pm.Object(ExampleRestaurant),
            "input_object_set": pm.ObjectSet(ExampleRestaurant)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    def predict(self, input_object: ExampleRestaurant, input_object_set: ExampleRestaurantObjectSet):
        outputs = ...
        return outputs
    ...

predict() 实现示例

此示例使用 Pandas 数据框作为输入和输出,同时使用一个参数。

class ExampleModelAdapter(pm.ModelAdapter):
    ...

    @classmethod
    def api(cls):
        inputs = {
            "input_dataframe": pm.Pandas(columns=[("input_feature", float)]),
            "input_parameter": pm.Parameter(float, default=1.0)
        }
        outputs = {
            "output_dataframe": pm.Pandas(columns=[("input_feature", float), ("prediction", float)])
        }
        return inputs, outputs

    def predict(self, input_dataframe, input_parameter):
        outputs["prediction"] = self.model.predict(input_dataframe) * input_parameter
        return outputs
    ...
  • 对于表格输出,API 中指定的列列表应仅包含必需的列。在运行时,未在 API 中指定但传递给数据框的列将在 predict 方法的输入中可用,并且可以作为输出返回。在上面的示例中,传递一个包含 input_featureextra_column 列的数据框将返回一个包含 input_featureextra_columnprediction 列的数据框。
  • 某些模型要求仅将在训练期间使用的列传递给其预测方法。因此,我们建议仅提取特征列传递给模型。
  • 某些模型要求保留列的顺序。当输入通过 REST API 请求作为 JSON 对象传递时,列的顺序不一定能保留。因此,我们建议在将列传递给模型的推理方法之前重新排序。

API 定义要求

建模目标(Objectives)批量部署的要求

在建模目标(Modelling Objectives)应用程序中直接设置批量部署自动模型评估仅与具有单个表格数据集输入的模型兼容。如果您的模型适配器需要多个输入,您可以在 Python 转换中设置批量推理。

建模目标(Objectives)评估的要求

建模目标中的评估期望一个单一的表格输入和一个单一的表格输出。两者都应包含标签列,以便能够基于将预测与评估集中的真实值进行比较来计算指标。必须牢记以下规则,以确保您的模型适配器与评估兼容:

  1. 您的 API 必须仅包含一个表格输入和输出。
  2. 任何标签或真实值列不应包含在输入或输出表的列列表中。
  3. 由于声明的列类型在实时推理期间会被强制执行,这将阻止模型在标签列不存在的情况下用于实时推理。
  4. predict 方法不应删除标签或真实值列。
  5. 如果这些列被删除,则在 predict 方法的推理输出上运行的评估逻辑将无法获得用于比较的标签。

例如,考虑以下用于回归任务的适配器:

import palantir_models as pm
from palantir_models.serializers import DillSerializer

class LinearRegressionModelAdapter(pm.ModelAdapter):

    @pm.auto_serialize(
        model=DillSerializer()
    )
    def __init__(self, model):
        self.model = model

    @classmethod
    def api(cls):
        input_columns = [
            ('median_income', float),
            ('housing_median_age', float),
            ('total_rooms', float),

            # 不要显式在 API 中包含目标变量
            # 因为在实时推理情况下,如果找不到标签,
            # 适配器逻辑将抛出错误。
            # 当模型应用于训练集或评估集以外的任何数据时,
            # 这必然会发生。

            # ('house_price', float) # 不应包含
        ]
        output_columns = columns + [('predicted_house_price', float)]
        return {'df_in': pm.Pandas(input_columns)}, \
               {'df_out': pm.Pandas(output_columns)}

    def predict(self, df_in):
        cols_to_keep = ['median_income', 'housing_median_age', 'total_rooms']
        # 如果我们不这样做,下一行
        # 将隐式删除 'house_price' 列,
        # 该列代表 '标签' 或目标变量,
        # 从而使此适配器不适用于建模目标中的评估。
        if 'house_price' in df_in.columns:
            cols_to_keep.append('house_price')
        df_in['prediction'] = self.model.predict(
            df_in[cols_to_keep]
        )
        return df_in

直接函数发布和在函数中使用模型的要求

:::callout{theme="neutral"} 更改已发布函数的模型的 API 将需要更新该函数的使用者,如模型函数指南中所述。 :::

要使模型支持作为函数发布,必须满足以下条件:

  • 模型 API 只能包含参数(Parameter)表格(Tabular)对象(Object)对象集(ObjectSet) 输入。
  • API 不得包含 Typing.Any 类型。
  • 对于表格输入,API 必须指定所需的输入或输出列。仅使用 pm.Pandas()pm.Spark() 是不允许的,因为这将被隐式解释为 TypeScript 的 Array<any>
  • 任何集合类型(如 listdict)必须指定元素类型(例如,使用 list[str]dict[str, str])。否则,元素类型将被解释为 any 类型。