Train a binary classification model with scikit-learn in Code Repositories（在代码仓库中使用 scikit-learn 训练二分类模型）¶

The following documentation provides an example on how to train a scikit-learn binary classification model using the open source UCI ML Breast Cancer Wisconsin (Diagnostic) ↗ dataset in the Code Repositories application using the Model Training Template.

For a detailed walkthrough of the following steps, including how to author a model adapter and write Python transforms for model training, refer to our documentation on how to train a model in Code Repositories.

1. Author a model adapter¶

First, author a model adapter using the Model Training Template in Code Repositories.

The example logic below assumes the following:

This model adapter is initialized with a scikit-learn model.
The data being provided to this model is tabular.
The output of this model is tabular with all columns from columns, prediction, probability_0, and probability_1, where,
prediction is 0 or 1, with 0 being no cancer detected, and 1 being cancer detected.
probability_0 is the probability that cancer was not detected.
probability_1 is the probability that cancer was detected.
The following dependencies have been added to the repository: foundry-transforms-lib-python,pandas 1.5.3, scikit-learn 1.3.2, and dill 0.3.7

import palantir_models as pm
from palantir_models.serializers import *


class SklearnClassificationAdapter(pm.ModelAdapter):

    @pm.auto_serialize(
        model=DillSerializer()
    )
    def __init__(self, model):
        self.model = model

    @classmethod
    def api(cls):
        columns = [
            'mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
            'mean_smoothness', 'mean_compactness', 'mean_concavity',
            'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
            'radius_error', 'texture_error', 'perimeter_error', 'area_error',
            'smoothness_error', 'compactness_error', 'concavity_error',
            'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
            'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
            'worst_smoothness', 'worst_compactness', 'worst_concavity',
            'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension'
        ]
        inputs = {"df_in": pm.Pandas(columns=columns)}
        outputs = {"df_out": pm.Pandas(columns= columns + [
                                            ("prediction", int),
                                            ("probability_0", float),
                                            ("probability_1", float)
                                        ])}
        return inputs, outputs

    def predict(self, df_in):
        X = df_in.copy()

        predictions = self.model.predict(X)
        probabilities = self.model.predict_proba(X)

        df_in['prediction'] = predictions
        for idx, label in enumerate(self.model.classes_):
            df_in[f"probability_{label}"] = probabilities[:, idx]

        return df_in

2. Write a Python transform to train a model¶

In the same repository in model_training/model_training.py, author the model training logic.

This example uses the open source UCI ML Breast Cancer Wisconsin (Diagnostic) dataset ↗ provided in the scikit-learn library.

from transforms.api import transform
from palantir_models.transforms import ModelOutput
from main.model_adapters.adapter import SklearnClassificationAdapter
from sklearn.datasets import load_breast_cancer
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier


@transform.using(
    model_output=ModelOutput("/path/to/model_asset"),
)
def compute(model_output):
    X_train, y_train = load_breast_cancer(as_frame=True, return_X_y=True)
    X_train.columns = X_train.columns.str.replace(' ', '_')
    columns = X_train.columns

    numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]
    )

    preprocessor = make_column_transformer(
        (numeric_transformer, columns),
        remainder="passthrough"
    )

    model = Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("classifier", RandomForestClassifier(n_estimators=50, max_depth=3))
        ]
    )
    model.fit(X_train, y_train)

    foundry_model = SklearnClassificationAdapter(model)
    model_output.publish(model_adapter=foundry_model)

3. Consume the Model¶

Run inference in a Python transform¶

You can run inference with your model in a Python transform. For example, once your model has been trained, copy the below inference logic into the model_training/run_inference.py file and select Build.

:::callout{theme="neutral"} To run a model within a transform repository in which the model was not defined, set use_sidecar = True in ModelInput. This will automatically import the model adapter and its dependencies while running them in a separate environment to prevent dependency conflicts. Review the ModelInput class reference for more details. This feature is available for non-Spark transforms (using the @lightweight or @transform.usingdecorator) from palantir_models version 0.2010.0 onwards.

If use_sidecar is not set to True, the model adapter and its dependencies must be imported into or defined within the current code repository. :::

from transforms.api import transform, Output, LightweightOutput
from palantir_models.transforms import ModelInput
from palantir_models import ModelAdapter
from sklearn.datasets import load_breast_cancer


@transform.using(
    model=ModelInput(
        "ri.models.main.model.cfc11519-28be-4f3e-9176-9afe91ecf3e1",
        # use_sidecar=True is recommended for models defined outside the current transform repository
        ),
    inference_output=Output("ri.foundry.main.dataset.5dd9907f-79bc-4ae9-a106-1fa87ff021c3"),
)
def compute(model: ModelAdapter, inference_output: LightweightOutput):
    X, y = load_breast_cancer(as_frame=True, return_X_y=True)
    X.columns = X.columns.str.replace(' ', '_')

    inference_results = model.transform(X)
    inference_output.write_pandas(inference_results.df_out)

Perform live inference in a modeling objective¶

A Palantir model can be submitted to a modeling objective for the following:

After submitting this model to a modeling objective, you can create a release to host this model for live inference. Once the deployment is ready, you can perform live inference and connect this model to an operational application.

The example below shows input for the binary classification model using the single I/O endpoint:

[
  {
    "mean_radius": 15.09,
    "mean_texture": 23.71,
    "mean_perimeter": 92.65,
    "mean_area": 944.07,
    "mean_smoothness": 0.53,
    "mean_compactness": 0.21,
    "mean_concavity": 0.76,
    "mean_concave_points": 0.39,
    "mean_symmetry": 0.08,
    "mean_fractal_dimension": 0.14,
    "radius_error": 0.49,
    "texture_error": 0.82,
    "perimeter_error": 2.51,
    "area_error": 17.22,
    "smoothness_error": 0.07,
    "compactness_error": 0.01,
    "concavity_error": 0.05,
    "concave_points_error": 0.05,
    "symmetry_error": 0.01,
    "fractal_dimension_error": 0.08,
    "worst_radius": 12.95,
    "worst_texture": 20.66,
    "worst_perimeter": 185.41,
    "worst_area": 624.87,
    "worst_smoothness": 0.18,
    "worst_compactness": 0.26,
    "worst_concavity": 0.01,
    "worst_concave_points": 0.05,
    "worst_symmetry": 0.29,
    "worst_fractal_dimension": 0.05
  }
]

中文翻译¶

在代码仓库中使用 scikit-learn 训练二分类模型¶

以下文档提供了如何在代码仓库应用中，使用模型训练模板，基于开源 UCI ML 乳腺癌威斯康星（诊断）数据集 ↗ 训练 scikit-learn 二分类模型的示例。

关于以下步骤的详细说明，包括如何编写模型适配器（model adapter）以及编写用于模型训练的 Python 转换（transform），请参阅我们的文档如何在代码仓库中训练模型。

1. 编写模型适配器¶

首先，在代码仓库中使用模型训练模板编写一个模型适配器。

以下示例逻辑基于以下假设：

该模型适配器使用 scikit-learn 的 model 进行初始化。
提供给该模型的数据是表格形式。
该模型的输出为表格形式，包含 columns、prediction、probability_0 和 probability_1 中的所有列，其中：
prediction 为 0 或 1，0 表示未检测到癌症，1 表示检测到癌症。
probability_0 表示未检测到癌症的概率。
probability_1 表示检测到癌症的概率。
已在仓库中添加了以下依赖项：foundry-transforms-lib-python、pandas 1.5.3、scikit-learn 1.3.2 和 dill 0.3.7

import palantir_models as pm
from palantir_models.serializers import *


class SklearnClassificationAdapter(pm.ModelAdapter):

    @pm.auto_serialize(
        model=DillSerializer()
    )
    def __init__(self, model):
        self.model = model

    @classmethod
    def api(cls):
        columns = [
            'mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
            'mean_smoothness', 'mean_compactness', 'mean_concavity',
            'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
            'radius_error', 'texture_error', 'perimeter_error', 'area_error',
            'smoothness_error', 'compactness_error', 'concavity_error',
            'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
            'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
            'worst_smoothness', 'worst_compactness', 'worst_concavity',
            'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension'
        ]
        inputs = {"df_in": pm.Pandas(columns=columns)}
        outputs = {"df_out": pm.Pandas(columns= columns + [
                                            ("prediction", int),
                                            ("probability_0", float),
                                            ("probability_1", float)
                                        ])}
        return inputs, outputs

    def predict(self, df_in):
        X = df_in.copy()

        predictions = self.model.predict(X)
        probabilities = self.model.predict_proba(X)

        df_in['prediction'] = predictions
        for idx, label in enumerate(self.model.classes_):
            df_in[f"probability_{label}"] = probabilities[:, idx]

        return df_in

2. 编写 Python 转换以训练模型¶

在同一仓库的 model_training/model_training.py 文件中，编写模型训练逻辑。

此示例使用了 scikit-learn 库中提供的开源 UCI ML 乳腺癌威斯康星（诊断）数据集 ↗。

from transforms.api import transform
from palantir_models.transforms import ModelOutput
from main.model_adapters.adapter import SklearnClassificationAdapter
from sklearn.datasets import load_breast_cancer
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier


@transform.using(
    model_output=ModelOutput("/path/to/model_asset"),
)
def compute(model_output):
    X_train, y_train = load_breast_cancer(as_frame=True, return_X_y=True)
    X_train.columns = X_train.columns.str.replace(' ', '_')
    columns = X_train.columns

    numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]
    )

    preprocessor = make_column_transformer(
        (numeric_transformer, columns),
        remainder="passthrough"
    )

    model = Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("classifier", RandomForestClassifier(n_estimators=50, max_depth=3))
        ]
    )
    model.fit(X_train, y_train)

    foundry_model = SklearnClassificationAdapter(model)
    model_output.publish(model_adapter=foundry_model)

3. 使用模型¶

在 Python 转换中运行推理¶

您可以在 Python 转换中使用模型进行推理。例如，模型训练完成后，将以下推理逻辑复制到 model_training/run_inference.py 文件中，然后选择构建。

:::callout{theme="neutral"} 要在未定义模型的转换仓库中运行模型，请在 ModelInput 中设置 use_sidecar = True。这将自动导入模型适配器及其依赖项，并在独立环境中运行，以避免依赖冲突。有关更多详细信息，请参阅 ModelInput 类参考。此功能适用于从 palantir_models 0.2010.0 版本开始的非 Spark 转换（使用 @lightweight 或 @transform.using 装饰器）。

如果未将 use_sidecar 设置为 True，则必须将模型适配器及其依赖项导入或定义在当前代码仓库中。 :::

from transforms.api import transform, Output, LightweightOutput
from palantir_models.transforms import ModelInput
from palantir_models import ModelAdapter
from sklearn.datasets import load_breast_cancer


@transform.using(
    model=ModelInput(
        "ri.models.main.model.cfc11519-28be-4f3e-9176-9afe91ecf3e1",
        # 对于在当前转换仓库之外定义的模型，建议使用 use_sidecar=True
        ),
    inference_output=Output("ri.foundry.main.dataset.5dd9907f-79bc-4ae9-a106-1fa87ff021c3"),
)
def compute(model: ModelAdapter, inference_output: LightweightOutput):
    X, y = load_breast_cancer(as_frame=True, return_X_y=True)
    X.columns = X.columns.str.replace(' ', '_')

    inference_results = model.transform(X)
    inference_output.write_pandas(inference_results.df_out)

在建模目标中执行实时推理¶

Palantir 模型可以提交到建模目标，用于以下目的：

将此模型提交到建模目标后，您可以创建一个发布版来托管此模型以进行实时推理。部署就绪后，您可以执行实时推理并将此模型连接到操作型应用。

以下示例展示了使用单一 I/O 端点为二分类模型提供的输入：

[
  {
    "mean_radius": 15.09,
    "mean_texture": 23.71,
    "mean_perimeter": 92.65,
    "mean_area": 944.07,
    "mean_smoothness": 0.53,
    "mean_compactness": 0.21,
    "mean_concavity": 0.76,
    "mean_concave_points": 0.39,
    "mean_symmetry": 0.08,
    "mean_fractal_dimension": 0.14,
    "radius_error": 0.49,
    "texture_error": 0.82,
    "perimeter_error": 2.51,
    "area_error": 17.22,
    "smoothness_error": 0.07,
    "compactness_error": 0.01,
    "concavity_error": 0.05,
    "concave_points_error": 0.05,
    "symmetry_error": 0.01,
    "fractal_dimension_error": 0.08,
    "worst_radius": 12.95,
    "worst_texture": 20.66,
    "worst_perimeter": 185.41,
    "worst_area": 624.87,
    "worst_smoothness": 0.18,
    "worst_compactness": 0.26,
    "worst_concavity": 0.01,
    "worst_concave_points": 0.05,
    "worst_symmetry": 0.29,
    "worst_fractal_dimension": 0.05
  }
]