Serialization for models(模型序列化)¶
To reuse models across workflows, Palantir needs to be able to deserialize stored weights. Since the serialization process can be specific to model types, the author of the model adapter is expected to detail how this should happen. Note that this is not a concern for container models, where serialization is typically a part of the container lifecycle, or external models, which have externally hosted weights.
Auto serialization¶
To simplify serialization and deserialization for model weights within the platform, Palantir provides a number of default serialization methods for saving and loading models:
How to use a default serializer¶
A default serializer can be used by annotating the __init__ method on a model adapter with the @auto_serialize annotation. By default, Palantir will automatically save and load each of your __init__ method's inputs using the Dill serializer. By providing arguments to the@auto_serialize annotation, you can specify the serializer method to be used for each of the arguments to your __init__ method. When providing serializer arguments this way, every argument in your __init__ method must have an equivalent argument in your @auto_serialize definition.
:::callout{theme="information" title="Python Dependencies"}
Note that to avoid possible conflicts, the palantir_models.serializers package does not contain dependencies on the serialization frameworks below, except for dill. You must still add a dependency on the related serialization package when using other serializers from the palantir_models.serializers package.
:::
Example model definition using auto_serialization¶
The below is a valid Python transform that trains and publishes a model. Since the AutoSerializationAdapter uses the @auto_serialize decorator without arguments to serialize the model, the model argument to the __init__ method is serialized using Dill.
:::callout{theme="neutral"}
Note that when using versions of palantir_models below 0.1536.0, all serializers will need to be defined as arguments to the @auto_serialize decorator.
:::
from transforms.api import transform
from palantir_models.transforms import ModelOutput
from sklearn.linear_model import LinearRegression
import numpy as np
@transform(
model_output=ModelOutput("/Foundry/path/to/model_asset"),
)
def compute(model_output):
x_values = [i for i in range(100)]
y_values = [2 * i for i in x_values]
X = np.array(x_values, dtype=np.float32).reshape(-1, 1)
y = np.array(y_values, dtype=np.float32).reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)
model_output.publish(
model_adapter=AutoSerializationAdapter(
model,
{'prediction_column': 'prediction'}
)
)
import palantir_models as pm
class AutoSerializationAdapter(pm.ModelAdapter):
@pm.auto_serialize
def __init__(self, model):
self.model = model
@classmethod
def api(cls):
inputs = {"df_in": pm.Pandas()}
outputs = {"df_out": pm.Pandas()}
return inputs, outputs
def predict(self, df_in):
df_in['prediction'] = self.model.predict(df_in)
return df_in
Example model definition using multiple auto_serialization¶
The following model adapter defines different serializer methods for each input of its __init__ method.
import palantir_models as pm
from palantir_models.serializers import DillSerializer, JsonSerializer
class AutoSerializationAdapter(pm.ModelAdapter):
@pm.auto_serialize(
model=DillSerializer(),
config=JsonSerializer()
)
def __init__(self, model, config={}):
self.model = model
self.prediction_column = config['prediction_column'] if 'prediction_column' in config else 'prediction'
@classmethod
def api(cls):
inputs = {"df_in": pm.Pandas()}
outputs = {"df_out": pm.Pandas()}
return inputs, outputs
def predict(self, df_in):
df_in[self.prediction_column] = self.model.predict(df_in)
return df_in
Default serializers¶
Dill¶
The palantir_models.serializers.DillSerializer will serialize a Python object with dill ↗ by calling dill.dump with dill.load to save and load your object to disk.
The DillSerializer class can be used to serialize many Python objects, including scikit-learn and statsmodels.
Cloudpickle¶
The palantir_models.serializers.CloudPickleSerializer will serialize a Python object with Cloudpickle ↗ by calling cloudpickle.dump and cloudpickle.load to save and load your object to disk.
The CloudPickleSerializer class can be used to serialize many Python objects, including scikit-learn and statsmodels.
JSON¶
The palantir_models.serializers.JsonSerializer will serialize a Python Dictionary as JSON by calling yaml.safe_dump and json.safe_load on your Python Dictionary.
YAML¶
The palantir_models.serializers.YamlSerializer will serialize a Python Dictionary with JSON by calling yaml.safe_dump and yaml.safe_load on your Python Dictionary.
Hugging Face¶
The palantir_models.serializers library currently provides three default serializers for Hugging Face models ↗: HfPipelineSerializer, HfAutoTokenizerSerializer, and HfAutoModelSerializer. All three Hugging Face serializers require that the transformers library be added as a dependency to your Python environment.
HfPipelineSerializer¶
The palantir_models.serializers.HfPipelineSerializer will serialize a transformers.pipeline ↗ object with save_pretrained and reinstantiate your pipeline object at load.
The HfPipelineSerializer has one required string parameter representing the task ↗ of the pipeline. Any additional kwargs will be used for loading the pipeline.
import palantir_models as pm
from palantir_models.serializers import HfPipelineSerializer
from transformers import pipeline
import pandas as pd
class HFNerAdapter(pm.ModelAdapter):
@pm.auto_serialize(
pipeline=HfPipelineSerializer("ner"),
)
def __init__(self, pipeline):
self.pipeline = pipeline
@classmethod
def api(cls):
inputs = {"df_in" : pm.Pandas([("text", str)])}
outputs = {"df_out" : pm.Pandas([("text", str), ("generation", str)])}
return inputs, outputs
def predict(self, df_in: pd.DataFrame):
result = self.pipeline(df_in["text"].tolist())
df_in["generation"] = result
return df_in
HfAutoModelSerializer and HfAutoTokenizerSerializer¶
The palantir_models.serializers.HfAutoModelSerializer and palantir_models.serializers.HfAutoTokenizerSerializer will serialize a transformers.AutoModel and transformers.AutoTokenizer model with save_pretrained and from_pretrained.
Note, for some models and tokenizers it is recommended to use the specific model or tokenizer classes rather than the generic ones; in these cases the specific classes can be passed to the serializers as the first model_class argument.
import palantir_models as pm
from palantir_models.serializers import HfAutoModelSerializer, HfAutoTokenizerSerializer
from transformers import AutoModelForSeq2SeqLM
import pandas as pd
import torch
class HFTextGenerationAdapter(pm.ModelAdapter):
@pm.auto_serialize(
model=HfAutoModelSerializer(AutoModelForSeq2SeqLM),
tokenizer=HfAutoTokenizerSerializer()
)
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
@classmethod
def api(cls):
inputs = {"df_in" : pm.Pandas([("text", str)])}
outputs = {"df_out" : pm.Pandas([("text", str), ("generation", str)])}
return inputs, outputs
def predict(self, df_in: pd.DataFrame):
input_ids = self.tokenizer(
df_in["text"].tolist(),
return_tensors="pt",
padding=True
).input_ids
outputs = self.model.generate(input_ids)
result = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
df_in["generation"] = result
return df_in
PyTorch¶
The palantir_models.serializers.PytorchStateSerializer will serialize a torch.nn.Module ↗ with torch.save and torch.load to save and load your object to disk.
TensorFlow¶
The palantir_models.serializers.TensorflowKerasSerializer will serialize a tensorflow.keras.Model ↗ with obj.save and tensorflow.keras.models.load_model to save and load your model to disk. When your model is being deserialized, Foundry will call obj.compile().
XGBoost¶
The palantir_models.serializers.XGBoostSerializer will serialize a xgboost.sklearn.XGBModel ↗ with save_model and load_model to save and load your model to disk. The XGBModel class is the base class for many XGBoost models including xgboost.XGBClassifier, xgboost.XGBRegressor, and xgboost.XGBRanker.
Spark MLlib¶
The palantir_models.serializers.SparkMLAutoSerializer will serialize a pyspark.ml.pipeline.Pipeline ↗ object using the pyspark.ml.PipelineModel.write().overwrite().save() and pyspark.ml.PipelineModel.read().load() methods. Given that any Spark MLlib model can be wrapped in a Pipeline object, this serializer is intended to serve as a general-purpose serializer for Spark.
Writing your own AutoSerializer¶
See the full implementation of each of the serializers and documentation on how to add a new default serializer.
If you believe there should be an additional default serializer that is not listed above, contact your Palantir representative.
Custom Serialization¶
In cases where the default serializers are insufficient, users can implement the load() and save() methods. Refer to the API reference for additional details and examples.
中文翻译¶
模型序列化¶
为了在工作流中复用模型,Palantir 需要能够反序列化存储的权重。由于序列化过程可能因模型类型而异,模型适配器(model adapter)的作者需要详细说明序列化方式。需要注意的是,容器模型(container models)和外部模型(external models)无需考虑此问题,因为前者的序列化通常是容器生命周期的一部分,后者的权重则托管在外部。
自动序列化¶
为简化平台内模型权重的序列化和反序列化,Palantir 提供了多种默认序列化方法用于保存和加载模型:
如何使用默认序列化器¶
通过在模型适配器的 __init__ 方法上添加 @auto_serialize 注解,即可使用默认序列化器。默认情况下,Palantir 会使用 Dill 序列化器自动保存和加载 __init__ 方法的每个输入参数。通过向 @auto_serialize 注解提供参数,您可以为 __init__ 方法的每个参数指定具体的序列化方法。以这种方式提供序列化器参数时,__init__ 方法中的每个参数都必须在 @auto_serialize 定义中有对应的参数。
:::callout{theme="information" title="Python 依赖"}
请注意,为避免可能的冲突,palantir_models.serializers 包除 dill 外,不包含以下序列化框架的依赖。当使用 palantir_models.serializers 包中的其他序列化器时,您仍需添加相关序列化包的依赖。
:::
使用 auto_serialization 的模型定义示例¶
以下是一个有效的 Python 转换示例,用于训练和发布模型。由于 AutoSerializationAdapter 在序列化模型时使用了不带参数的 @auto_serialize 装饰器,因此 __init__ 方法的 model 参数将使用 Dill 进行序列化。
:::callout{theme="neutral"}
请注意,当使用 palantir_models 版本低于 0.1536.0 时,所有序列化器都需要作为 @auto_serialize 装饰器的参数进行定义。
:::
from transforms.api import transform
from palantir_models.transforms import ModelOutput
from sklearn.linear_model import LinearRegression
import numpy as np
@transform(
model_output=ModelOutput("/Foundry/path/to/model_asset"),
)
def compute(model_output):
x_values = [i for i in range(100)]
y_values = [2 * i for i in x_values]
X = np.array(x_values, dtype=np.float32).reshape(-1, 1)
y = np.array(y_values, dtype=np.float32).reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)
model_output.publish(
model_adapter=AutoSerializationAdapter(
model,
{'prediction_column': 'prediction'}
)
)
import palantir_models as pm
class AutoSerializationAdapter(pm.ModelAdapter):
@pm.auto_serialize
def __init__(self, model):
self.model = model
@classmethod
def api(cls):
inputs = {"df_in": pm.Pandas()}
outputs = {"df_out": pm.Pandas()}
return inputs, outputs
def predict(self, df_in):
df_in['prediction'] = self.model.predict(df_in)
return df_in
使用多个 auto_serialization 的模型定义示例¶
以下模型适配器为其 __init__ 方法的每个输入定义了不同的序列化方法。
import palantir_models as pm
from palantir_models.serializers import DillSerializer, JsonSerializer
class AutoSerializationAdapter(pm.ModelAdapter):
@pm.auto_serialize(
model=DillSerializer(),
config=JsonSerializer()
)
def __init__(self, model, config={}):
self.model = model
self.prediction_column = config['prediction_column'] if 'prediction_column' in config else 'prediction'
@classmethod
def api(cls):
inputs = {"df_in": pm.Pandas()}
outputs = {"df_out": pm.Pandas()}
return inputs, outputs
def predict(self, df_in):
df_in[self.prediction_column] = self.model.predict(df_in)
return df_in
默认序列化器¶
Dill¶
palantir_models.serializers.DillSerializer 使用 dill ↗ 序列化 Python 对象,通过调用 dill.dump 和 dill.load 将对象保存到磁盘或从磁盘加载。
DillSerializer 类可用于序列化多种 Python 对象,包括 scikit-learn 和 statsmodels。
Cloudpickle¶
palantir_models.serializers.CloudPickleSerializer 使用 Cloudpickle ↗ 序列化 Python 对象,通过调用 cloudpickle.dump 和 cloudpickle.load 将对象保存到磁盘或从磁盘加载。
CloudPickleSerializer 类可用于序列化多种 Python 对象,包括 scikit-learn 和 statsmodels。
JSON¶
palantir_models.serializers.JsonSerializer 将 Python 字典序列化为 JSON,通过调用 yaml.safe_dump 和 json.safe_load 处理您的 Python 字典。
YAML¶
palantir_models.serializers.YamlSerializer 将 Python 字典序列化为 YAML,通过调用 yaml.safe_dump 和 yaml.safe_load 处理您的 Python 字典。
Hugging Face¶
palantir_models.serializers 库目前为 Hugging Face 模型 ↗ 提供了三个默认序列化器:HfPipelineSerializer、HfAutoTokenizerSerializer 和 HfAutoModelSerializer。所有三个 Hugging Face 序列化器都要求将 transformers 库作为依赖添加到您的 Python 环境中。
HfPipelineSerializer¶
palantir_models.serializers.HfPipelineSerializer 使用 save_pretrained 序列化 transformers.pipeline ↗ 对象,并在加载时重新实例化您的 pipeline 对象。
HfPipelineSerializer 有一个必需的字符串参数,表示 pipeline 的任务类型 ↗。任何额外的 kwargs 将用于加载 pipeline。
import palantir_models as pm
from palantir_models.serializers import HfPipelineSerializer
from transformers import pipeline
import pandas as pd
class HFNerAdapter(pm.ModelAdapter):
@pm.auto_serialize(
pipeline=HfPipelineSerializer("ner"),
)
def __init__(self, pipeline):
self.pipeline = pipeline
@classmethod
def api(cls):
inputs = {"df_in" : pm.Pandas([("text", str)])}
outputs = {"df_out" : pm.Pandas([("text", str), ("generation", str)])}
return inputs, outputs
def predict(self, df_in: pd.DataFrame):
result = self.pipeline(df_in["text"].tolist())
df_in["generation"] = result
return df_in
HfAutoModelSerializer 和 HfAutoTokenizerSerializer¶
palantir_models.serializers.HfAutoModelSerializer 和 palantir_models.serializers.HfAutoTokenizerSerializer 使用 save_pretrained 和 from_pretrained 序列化 transformers.AutoModel 和 transformers.AutoTokenizer 模型。
请注意,对于某些模型和分词器,建议使用特定的模型或分词器类而非通用类;在这些情况下,可以将特定类作为第一个 model_class 参数传递给序列化器。
import palantir_models as pm
from palantir_models.serializers import HfAutoModelSerializer, HfAutoTokenizerSerializer
from transformers import AutoModelForSeq2SeqLM
import pandas as pd
import torch
class HFTextGenerationAdapter(pm.ModelAdapter):
@pm.auto_serialize(
model=HfAutoModelSerializer(AutoModelForSeq2SeqLM),
tokenizer=HfAutoTokenizerSerializer()
)
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
@classmethod
def api(cls):
inputs = {"df_in" : pm.Pandas([("text", str)])}
outputs = {"df_out" : pm.Pandas([("text", str), ("generation", str)])}
return inputs, outputs
def predict(self, df_in: pd.DataFrame):
input_ids = self.tokenizer(
df_in["text"].tolist(),
return_tensors="pt",
padding=True
).input_ids
outputs = self.model.generate(input_ids)
result = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
df_in["generation"] = result
return df_in
PyTorch¶
palantir_models.serializers.PytorchStateSerializer 使用 torch.save 和 torch.load 序列化 torch.nn.Module ↗,将对象保存到磁盘或从磁盘加载。
TensorFlow¶
palantir_models.serializers.TensorflowKerasSerializer 使用 obj.save 和 tensorflow.keras.models.load_model 序列化 tensorflow.keras.Model ↗,将模型保存到磁盘或从磁盘加载。当模型被反序列化时,Foundry 会调用 obj.compile()。
XGBoost¶
palantir_models.serializers.XGBoostSerializer 使用 save_model 和 load_model 序列化 xgboost.sklearn.XGBModel ↗,将模型保存到磁盘或从磁盘加载。XGBModel 类是许多 XGBoost 模型的基类,包括 xgboost.XGBClassifier、xgboost.XGBRegressor 和 xgboost.XGBRanker。
Spark MLlib¶
palantir_models.serializers.SparkMLAutoSerializer 使用 pyspark.ml.PipelineModel.write().overwrite().save() 和 pyspark.ml.PipelineModel.read().load() 方法序列化 pyspark.ml.pipeline.Pipeline ↗ 对象。由于任何 Spark MLlib 模型都可以封装在 Pipeline 对象中,此序列化器旨在作为 Spark 的通用序列化器。
编写自定义 AutoSerializer¶
请参阅每个序列化器的完整实现以及关于如何添加新的默认序列化器的文档。
如果您认为应该存在上述列表之外的额外默认序列化器,请联系您的 Palantir 代表。
自定义序列化¶
在默认序列化器无法满足需求的情况下,用户可以自行实现 load() 和 save() 方法。有关更多详细信息和示例,请参阅 API 参考。