Media set transforms API（媒体集转换 API(Media set transforms API)）¶

The media set transforms API provides methods for transforming media sets in Python transforms. This API enables various operations on media sets, such as extracting text using optical character recognition (OCR), resizing images, converting documents to images, and more.

Methods by schema type¶

Schema type	Available methods
Image	`resize` • `crop` • `binarize` • `rotate` • `grayscale` • `equalize` • `rayleigh` • `convert_image_to_document` • `generate_image_embeddings` • `tile` • `ocr` • `encrypt` • `decrypt`
Document	`ocr` • `extract_raw_text` • `convert_document_to_images` • `slice_document` • `extract_form_fields` • `extract_table_of_contents` • `get_pdf_page_dimensions`
Video	`extract_audio` • `extract_scene_frames` • `chunk` • `extract_first_frame` • `extract_frames_at_timestamp` • `transcode` • `get_scene_frame_timestamps`
Audio	`transcribe` • `chunk` • `transcode` • `get_waveform`
DICOM	`render_dicom_layer`
Spreadsheet	`extract_content_from_spreadsheets`
Email	`extract_email_body` • `extract_email_attachments`
Multimodal	`filter_to`

Getting started¶

To use the media set transforms API, access the transform functionality from a media set input as shown in the example below.

from transforms.mediasets.inputs import MediaSetInputParam
from transforms.api import transform, Output, TransformOutput
from transforms.mediasets import MediaSetInput

@transform(
    media_input=MediaSetInput("/path/to/media_set"),
    dataset_output=Output("/path/to/output")
)
def compute(ctx, media_input: MediaSetInputParam, dataset_output: TransformOutput):
    # Create a MediaSetInputTransform instance
    transform = media_input.transform()

    # Apply transformations
    result = transform.ocr()

    # Write the result to output
    dataset_output.write_dataframe(result)

transform()¶

def transform(self, deduplicate_by_path=True):

Returns a MediaSetInputTransform instance. This class enables fluent method chaining for media transformations on a media set input.

Parameters:

deduplicate_by_path (bool, optional): If True, only the most recent item at each path will be included. Defaults to True.

Returns:

MediaSetInputTransform: A user-facing class that provides methods for media set transformations.

Example:

df = media_set.transform().ocr()

Writing a `MediaSetInputTransform` to a media set output¶

For media set to media set transformations, write the transformed media to a media set as shown in the example below.

from transforms.mediasets.inputs import MediaSetInputParam
from transforms.mediasets.outputs import MediaSetOutputParam
from transforms.api import transform
from transforms.mediasets import MediaSetInput, MediaSetOutput


@transform(
    video_input=MediaSetInput(
        "/path/to/input_media_set"
    ),
    multimodal_output=MediaSetOutput(
        "/path/to/output_media_set"
    ),
)
def compute(
    ctx,
    video_input: MediaSetInputParam,
    multimodal_output: MediaSetOutputParam,
):
    def output_path_tar(input_path, page=None):
        return input_path.replace("mp4", "tar")

    multimodal_output.write(
        video_input.transform().extract_scene_frames(
            scene_sensitivity="LESS_SENSITIVE"
        ),
        output_path_tar,
    )

write()¶

def write(
    self,
    media_transform: MediaSetInputTransform,
    output_path: Callable[[str, Optional[int]], str] = prefix_path,
    suppress_errors: bool = False,
    return_dataframe: bool = False,
) -> Union[str, DataFrame, None]:

Write a MediaSetInputTransform to an output media set.

Parameters:

media_transform (MediaSetInputTransform): A media set transform that has been applied to a media set input. For example: media_transform = img_input.transform().resize()
output_path (Callable[[str, Optional[int]], str], optional): A function that determines the output path for each media item. It takes the original path and an optional page number as input and returns the new path. Defaults to prepend "transformed" to the original path. For media transformations that output multiple items per a input media item, the item's path will automatically be appended with an identifying value. For example, crop transformations that apply multiple crops to an image will automatically include the crop parameters (x_offset, y_offset, width, height) in the path.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the dataframe's media_reference column if return_dataframe is True. If False, any errors will not be caught and the build will fail. Defaults to False. Only applicable to transformations on the entire media set.
return_dataframe (bool, optional): Only applicable for media set level transformation. If True, returns a dataframe which contains media_item_rid, path and media_reference columns. The dataframe represents the media items written to the output media set. Defaults to False. If True, the media set transformations will be lazily evaluated.

Returns:

None, a DataFrame or a string.
str: Applicable to transformations on an individual media item. Returns the resulting media reference string.
None: Applicable to transformations on an entire media set. Returns None if return_dataframe is set to False.
DataFrame: Applicable to transformations on an entire media set and when return_dataframe is set to True. Columns are media_item_rid, path, media_reference.

Example:

# Media transformation on the whole media set
media_transform = img_input.transform().resize()
img_out.write(media_transform)

# Media transformation on an individual media item
media_transform = img_input.transform().resize(media_item_rid = "rid1")
med_ref_str = img_out.write(media_transform)

API reference¶

ocr()¶

Extracts text from PDFs or images using OCR and returns the extracted text as a string. Recommended for images and scanned documents.

Parameters:

languages (list[str]): List of languages to be used for OCR. Defaults to English. All valid codes can be found in the Tesseract documentation ↗ under languages.
scripts (Optional[list[str]]): List of scripts to be used for OCR. Defaults to None. All valid codes can be found in the Tesseract documentation ↗ under scripts.
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
start_page (Optional[int]): The zero-indexed start page for OCR. Only applicable for PDF media sets. Defaults to 0 (the first page).
end_page (Optional[int]): The zero-indexed end page for OCR (exclusive). Only applicable for PDF media sets. Defaults to None (the final page).
return_structure (str): item_per_row or page_per_row. Only applicable to transformations on an entire PDF media set. Defaults to item_per_row.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame, list of strings, or a single string.
str: Transformations on a single image.
list[str]: Transformations on a single PDF.
DataFrame: Transformations on the entire media set.
- For PDF (item_per_row): Columns are media_item_rid, path, media_reference, extracted_text (list[str]).
- For PDF (page_per_row) or image sets: Columns are media_item_rid, path, media_reference, page_number, extracted_text (str).

Example:

df = media_set.transform().ocr()
dataset_output.write_dataframe(df)

extract_raw_text()¶

Extracts raw text from PDFs. Recommended for documents that have been electronically generated.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
start_page (Optional[int]): The zero-indexed start page for text extraction. Defaults to 0 (the first page).
end_page (Optional[int]): The zero-indexed end page for text extraction (exclusive). Defaults to None (the final page).
return_structure (str): item_per_row or page_per_row. Only applicable to transformations on an entire PDF media set. Defaults to item_per_row.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or list of strings.
str: Applicable to transformations on a single image.
list[str]: Applicable to transformations on a single PDF.
DataFrame: Applicable to transformations on the entire media set.
- PDF (item_per_row): Columns are media_item_rid, path, media_reference, extracted_text (list[str]).
- PDF (page_per_row) or image sets: Columns are media_item_rid, path, media_reference, page_number, extracted_text (str).

Example:

df = media_set.transform().extract_raw_text()
dataset_output.write_dataframe(df)

resize()¶

Resizes images to the specified dimensions.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
width (Optional[int]): The target width for the resized images. Defaults to 1024. Must be provided if height is not provided.
height (Optional[int]): The target height for the resized images. Defaults to 1024. Must be provided if width is not provided.
maintain_aspect_ratio (bool): Specifies whether to maintain the original aspect ratio of the images. If True, images will be resized to fit within the specified dimensions while preserving the aspect ratio. Defaults to True.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the resize transformation, allowing for further transformations.

Example:

transform = image_input.transform().resize()
image_output.write(transform)

convert_document_to_images()¶

Converts document pages to images with the specified dimensions.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
start_page (Optional[int]): The zero-indexed start page for conversion. Defaults to 0 (the first page).
end_page (Optional[int]): The zero-indexed end page for conversion (exclusive). Defaults to None (the last page).
width (Optional[int]): The width of the output images. Defaults to 1024.
height (Optional[int]): The height of the output images. Defaults to 1024.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the document to image transformation, allowing for further transformations.

Example:

transform = pdf_input.transform().convert_document_to_images()
image_output.write(transform)

slice_document()¶

Slices documents to a specified range of pages.

Parameters:

start_page (int): The zero-indexed start page for the slice operation.
end_page (int): The zero-indexed end page for the slice operation (exclusive).
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
strictly_enforce_end_page (bool): Specifies behavior if the end_page exceeds the number of pages in the document. If True, an error will be raised. If False, the last page of the document is used instead. Defaults to True.

Returns:

An instance of MediaSetInputTransform containing the slice transformation, allowing for further transformations.

Example:

transform = pdf_input.transform().slice_document(0, 5)
pdf_output.write(transform)

tile()¶

Generates Slippy map tiles (EPSG 3857) from images. Only supports geo-embedded images in TIFF or NITF format, with a maximum size of 100 million square pixels.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
zoom (Union[int, Column]): Zoom level of the tile. Must be a non-negative integer. At zoom level 0, the entire world fits into a single tile. Each increment doubles the spatial resolution and quadruples the number of tiles. Defaults to 0.
x (Union[int, Column]): Tile column index at the specified zoom level. Increases from west to east. Valid range: 0 <= x < 2**zoom. Defaults to 0.
y (Union[int, Column]): Tile row index at the specified zoom level. Increases from north to south. Valid range: 0 <= y < 2**zoom. Defaults to 0.
df (Optional[DataFrame]): Specifies the DataFrame to join on when passing column inputs. Tiling will only be applied to media items present in both the input media set and provided DataFrame. Must be provided together with the on parameter.
on (Optional[Literal["media_item_rid", "media_reference"]]): The column name to join on when df is specified. This aligns the tiling operation with the correct media item.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the tile transformation, allowing for further transformations.

Example:

# Dynamically select tiling parameters from the input_df columns.
# Only tiles media items in both the media set and provided DataFrame.
transform1 = image_input.transform().tile(input_df.zoom, input_df.x, input_df.y, df=input_df, on="media_item_rid")

# All tiles will be generated with the same parameters.
# Generates a tile for all media items in the media set.
transform2 = image_input.transform().tile(zoom=2, x=1, y=1)

# Write the transformation to output media set
image_output.write(transform)

encrypt()¶

Encrypts specified regions of images using the provided cipher image key.

Parameters:

polygons (Union[list[api.Polygon], Column]): The regions to encrypt, specified as polygons.
cipher_license_rid (Union[str, Column]): The cipher license RID to use for encryption.
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
df (Optional[DataFrame]): Specifies the DataFrame to join on when passing column inputs. Encryption will only be applied to media items present in both the input media set and provided DataFrame. Must be provided together with the on parameter.
on (Optional[Literal["media_item_rid", "media_reference"]]): The column name to join on when df is specified. This aligns the encryption operation with the correct media item.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the encrypt transformation, allowing for further transformations.

Example:

polygon = [api.Coordinate(x=10, y=10), api.Coordinate(x=100, y=10),
           api.Coordinate(x=100, y=100), api.Coordinate(x=10, y=100)]
transform = image_input.transform().encrypt([polygon], "cipher_license_rid_123", media_item_rid="rid1")
image_output.write(transform)

decrypt()¶

Decrypts specified regions of images using the provided cipher image key.

Parameters:

polygons (Union[list[api.Polygon], Column]): The regions to decrypt, specified as a list of polygons.
cipher_license_rid (Union[str, Column]): The resource identifier for the cipher license to use for decryption.
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
df (Optional[DataFrame]): Specifies the DataFrame to join on when passing column inputs. Decryption will only be applied to media items present in both the input media set and provided DataFrame. Must be provided together with the on parameter.
on (Optional[Literal["media_item_rid", "media_reference"]]): The column name to join on when df is specified. This aligns the decryption operation with the correct media item.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the decrypt transformation, allowing for further transformations.

Example:

polygon = [api.Coordinate(x=10, y=10), api.Coordinate(x=100, y=10),
           api.Coordinate(x=100, y=100), api.Coordinate(x=10, y=100)]
transform = image_input.transform().decrypt([polygon], "cipher_license_rid_123", media_item_rid="rid1")
image_output.write(transform)

crop()¶

Crops images using specified dimensions and offsets.

Parameters:

width (Union[int, Column]): The width of the cropped image.
height (Union[int, Column]): The height of the cropped image.
x_offset (Union[int, Column]): The x-coordinate of the top-left corner of the crop area. Defaults to 0.
y_offset (Union[int, Column]): The y-coordinate of the top-left corner of the crop area. Defaults to 0.
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
df (Optional[DataFrame]): Specifies the DataFrame to join on when passing column inputs. Cropping will only be applied to media items present in both the input media set and provided DataFrame. Must be provided together with the on parameter.
on (Optional[Literal["media_item_rid", "media_reference"]]): The column name to join on when df is specified. This aligns the cropping operation with the correct media item.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the crop transformation, allowing for further transformations.

Examples:

# All media items will be cropped with the same parameters.
# Crops all media items in the media set.
transform = image_input.transform().crop(100, 100, 10, 10)
image_output.write(transform)

# Dynamically select cropping parameters from the input_df columns.
# Only crops media items in both the media set and provided DataFrame.
transform1 = image_input.transform().crop(input_df.x2 - input_df.x1, input_df.y2 - input_df.y1,
    input_df.x1, input_df.y1, df=input_df, on="media_item_rid")

# Width and height are dynamically selected from the input_df columns, while the offsets are static.
# Only crops media items in both the media set and provided DataFrame.
transform2 = image_input.transform().crop(30, 50,
    input_df.x1, input_df.y1, df=input_df, on="media_item_rid")

# All media items will be cropped with the same parameters.
# Only crops media items in both the media set and provided DataFrame.
transform3 = image_input.transform().crop(30, 50,
    20, 60, df=input_df, on="media_item_rid")

# All media items will be cropped with the same parameters.
# Crops all media items in the media set.
transform4 = image_input.transform().crop(30, 50, 20, 60)

# Write the transformation to output media set
image_output.write(transform1)

binarize()¶

Converts images to binary using the specified threshold.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
threshold (Optional[int]): Values above or equal to the threshold will be assigned a value of 255 and values below will be assigned a value of 0. Defaults to computing the threshold based on the input image.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the binarize transformation, allowing for further transformations.

Example:

transform = image_input.transform().binarize(threshold=150)
image_output.write(transform)

rotate()¶

Rotates images by the specified angle.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
angle (Literal["DEGREE_90", "DEGREE_180", "DEGREE_270"]): The angle to rotate the images. Defaults to DEGREE_90.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the rotate transformation, allowing for further transformations.

Example:

transform = image_input.transform().rotate(angle="DEGREE_180")
image_output.write(transform)

grayscale()¶

Converts images to grayscale.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the grayscale transformation, allowing for further transformations.

Example:

transform = image_input.transform().grayscale()
image_output.write(transform)

equalize()¶

Improves the clarity of low-contrast images by performing histogram equalization on grayscale images.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the equalize transformation, allowing for further transformations.

Example:

transform = image_input.transform().equalize()
image_output.write(transform)

rayleigh()¶

Adjusts the grayscale intensity values so the image's histogram (the distribution of pixel brightness) matches the Rayleigh distribution (roughly a bell curve that is always negative). This can improve clarity in low-contrast images.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
sigma (float): The scaling parameter for the Rayleigh distribution. Must be a floating point numeral between 0 and 1. Defaults to 0.5.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the rayleigh transformation, allowing for further transformations.

Example:

transform = image_input.transform().rayleigh(sigma=0.7)
image_output.write(transform)

convert_image_to_document()¶

Converts images to PDFs.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.

Returns:

An instance of MediaSetInputTransform containing the convert image to document transformation, allowing for further transformations.

Example:

transform = image_input.transform().convert_image_to_document()
pdf_output.write(transform)

transcode()¶

Transcodes audio or video to the specified format.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
encode_format (Optional[str]): Specifies the format in which the output media will be encoded. Defaults to MP4 for video and MP3 for audio.

Returns:

An instance of MediaSetInputTransform containing the transcode transformation, allowing for further transformations.

Example:

transform = video_input.transform().transcode(encode_format="mov")
video_output.write(transform)

extract_audio()¶

Extracts audio from video files.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
output_format (str): The format of the output audio (e.g., mp3, wav). Defaults to mp3.

Returns:

An instance of MediaSetInputTransform containing the audio extraction transformation, allowing for further transformations.

Example:

transform = video_input.transform().extract_audio(output_format="wav")
audio_output.write(transform)

extract_scene_frames()¶

Extracts all scene frames from videos as images. A scene frame is a video frame that marks the beginning of a new scene or a significant visual transition in the video content.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
scene_sensitivity (Literal["MORE_SENSITIVE", "STANDARD", "LESS_SENSITIVE"]): The sensitivity level for scene detection. Defaults to STANDARD.
output_format (str): Specifies the encoding format for extracted frames. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the scene frame extraction transformation. The output images for each video will be stored in a TAR archive file.

Example:

transform = video_input.transform().extract_scene_frames(scene_sensitivity="MORE_SENSITIVE")
multimodal_output.write(transform)

chunk()¶

Chunks audio or video files into smaller segments of the specified duration.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
chunk_duration_milliseconds (int): The duration of each chunk in milliseconds. Defaults to 10000 (10 seconds). Must be a positive integer.
output_format (Optional[str]): The format of the output audio chunks. Defaults to MP4 for video and and TS for audio. Note that audio only supports TS as output format.

Returns:

An instance of MediaSetInputTransform containing the chunking transformation, allowing for further transformations.

Example:

transform = video_input.transform().chunk(chunk_duration_milliseconds=5000)
video_output.write(transform)

extract_first_frame()¶

Extracts the first full scene frame from videos as an image with the specified dimensions, or the original dimensions if not provided.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
width (Optional[int]): The target width for the extracted frames. If None, width will be scaled based on the provided height. Defaults to None.
height (Optional[int]): The target height for the extracted frames. If None, height will be scaled based on the provided width. Defaults to None.
output_format (str): The format of the output images. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the first frame extraction transformation, allowing for further transformations.

Example:

transform = video_input.transform().extract_first_frame(width=800, height=600)
image_output.write(transform)

extract_frames_at_timestamp()¶

Extracts frames from videos at a specified timestamp, using the specified dimensions, or the original dimensions if not provided.

Parameters:

timestamps (Union[float, Column]): The timestamp in seconds at which to extract frames.
media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
df (Optional[DataFrame]): Specifies the DataFrame to join on when passing column inputs. Frame extraction will only be applied to media items present in both the input media set and provided DataFrame. Must be provided together with the on parameter.
on (Optional[Literal["media_item_rid", "media_reference"]]): The column name to join on when df is specified. This aligns the frame extraction operation with the correct media item.
width (Optional[int]): The target width for the extracted frames. If None, scales width based on the provided height. Defaults to None.
height (Optional[int]): The target height for the extracted frames. If None, scales height based on provided width. Defaults to None.
output_format (str): The format of the output images. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the frame extraction transformation, allowing for further transformations.

Example:

# Dynamically select timestamp parameters from the input_df columns.
# Only extracts frames from media items in both the media set and provided DataFrame.
transform1 = video_input.transform().extract_frames_at_timestamp(input_df.timestamp, df=input_df, on="media_item_rid")

# All frames will be extracted at the same timestamp for all items in the media set.
transform2 = video_input.transform().extract_frames_at_timestamp(30)

# Write the transformation to output media set
image_output.write(transform1)

render_dicom_layer()¶

Renders a frame of a DICOM file as an image, using the specified dimensions, or the original dimensions if not provided.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
layer_number (Optional[int]): The layer number to render from the DICOM image. Defaults to the middle layer.
width (Optional[int]): The target width for the rendered image. If None, and height is provided, the aspect ratio will be preserved. Must be provided if height is not provided.
height (Optional[int]): The target height for the rendered image. If None, and width is provided, the aspect ratio will be preserved. Must be provided if width is not provided.
output_format (str): The format of the output images, for example PNG or JPEG. Defaults to PNG.

Returns:

An instance of MediaSetInputTransform containing the render DICOM layer transformation, allowing for further transformations.

Example:

transform = dicom_input.transform().render_dicom_layer(layer_number=2)
image_output.write(transform)

extract_form_fields()¶

Extracts form fields from documents.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing form fields. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, form_fields (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().extract_form_fields()
dataset_output.write_dataframe(df)

extract_table_of_contents()¶

Extracts the table of contents from documents.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing table of contents. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, table_of_contents (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().extract_table_of_contents()
dataset_output.write_dataframe(df)

get_pdf_page_dimensions()¶

Returns PDF page dimensions.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing a list of dictionaries with keys width and height. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, page_dimensions (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().get_pdf_page_dimensions()
dataset_output.write_dataframe(df)

generate_image_embeddings()¶

Generates embeddings for images.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
model_id (Optional[str]): The model to use to generate image embeddings. Defaults to GOOGLE_SIGLIP_2.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing vector embeddings. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, embeddings (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().generate_image_embeddings()
dataset_output.write_dataframe(df)

get_waveform()¶

Returns waveform amplitudes for audio files.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
peaks_per_second (Optional[int]): Number of peaks per second to return. Defaults to 100. Must be a positive non-zero integer up to 1000.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing a list of doubles representing audio amplitudes and normalized between 0 and 1. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, waveform (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().get_waveform()
dataset_output.write_dataframe(df)

transcribe()¶

Transcribes audio.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
language (Optional[str]): The language to use for transcription. Defaults to None, in which case it will be auto-detected. Valid languages can be found in the Whisper GitHub repo ↗ under LANGUAGES.
performance_mode (Literal["more_economical", "more_performant"]): The performance mode to use for transcription. Defaults to more_economical.
output_format (Literal["text", "segments"]): The format of the output. Defaults to text.
add_timestamps (Optional[bool]): Control whether timestamps are added to the transcription. Defaults to False. Only applicable when output_format is text.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: Applicable for transformations on a single item.
- text: The transcribed text.
- segments: JSON object containing the transcribed segments including timestamps, segment confidence and more details.
DataFrame: Columns are media_item_rid, path, media_reference, transcription (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().transcribe(output_format="segments")
dataset_output.write_dataframe(df)

get_scene_frame_timestamps()¶

Returns timestamps for scene frames from video files.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
scene_sensitivity (Literal["MORE_SENSITIVE", "STANDARD", "LESS_SENSITIVE"]): The sensitivity level for scene detection. Defaults to STANDARD.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing a list of scene frames with keys timestamp and sceneScore in the frames field. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, scene_frames (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().get_scene_frame_timestamps()
dataset_output.write_dataframe(df)

extract_email_body()¶

Extracts the body of emails.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
output_format (Literal["TEXT", "HTML"]): Specifies the output format for the extracted content. Defaults to TEXT.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: Email body in the specified format. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, email_body (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().extract_email_body()
dataset_output.write_dataframe(df)

extract_email_attachments()¶

Extracts attachments from emails.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
schema_type (Optional[Literal["audio", "image", "video", "document", "spreadsheet", "dicom", "email"]]): Only extract attachments of the specified schema type. Defaults to None, which extracts all attachments.

Returns:

An instance of MediaSetInputTransform containing the extract email attachments transformation, allowing for further transformations.

Example:

transform = email_input.transform().extract_email_attachments()
multimodal_output.write(transform)

filter_to()¶

Filters the media set to only include items of the specified schema type and (optional) formats.

Parameters:

schema_type (Literal["audio", "image", "video", "document", "spreadsheet", "dicom", "email"]): The schema type on which to filter.
formats (Optional[list[str]]): Optional list of specific formats on which to filter. If not provided, all formats within the schema type are included. Available formats by schema type:
document: pdf, docx, pptx, txt
image: png, jpeg, jpg, tiff, bmp, webp, jp2, jp2k, nitf
audio: mp3, wav, flac, ogg, opus, m4a, webm_audio
video: mp4, ts, mov, mkv
spreadsheet: xlsx
email: eml
dicom: Does not support format narrowing

Returns:

An instance of MediaSetInputTransform containing the filter transformation, allowing for further transformations.

Example:

# Filter to all documents
transform = media_set.transform().filter_to("document").slice_document(0, 5)

# Filter to only PDFs and perform OCR
transform = media_set.transform().filter_to("document", formats=["pdf"]).ocr()

# Filter to PDFs and DOCX files
transform = media_set.transform().filter_to("document", formats=["pdf", "docx"])
pdf_output.write(transform)

extract_content_from_spreadsheets()¶

Extracts content from spreadsheet files.

Parameters:

media_item_rid (Optional[str]): If specified, will run the transformation on the specified item instead of the entire media set. Defaults to None.
suppress_errors (bool): Specifies error handling behavior. If True, errors are caught and the error message will be returned in the output. If False, any errors will not be caught and the build will fail. Defaults to True. Only applicable to transformations on the entire media set.

Returns:

A DataFrame or a single string.
str: JSON object containing a key for each sheet name, and its value having fields table and merged_cells. Applicable for transformations on a single item.
DataFrame: Columns are media_item_rid, path, media_reference, extracted_content (str). Applicable for transformations on the entire media set.

Example:

df = media_set.transform().extract_content_from_spreadsheets()
dataset_output.write_dataframe(df)

中文翻译¶

媒体集转换 API(Media set transforms API)¶

媒体集转换 API(Media set transforms API) 提供了在 Python 转换(transforms)中处理媒体集的方法。该 API 支持对媒体集执行多种操作，例如使用光学字符识别(OCR)提取文本、调整图像大小、将文档转换为图像等。

按模式类型划分的方法(Schema type)¶

模式类型(Schema type)	可用方法(Available methods)
图像(Image)	`resize` • `crop` • `binarize` • `rotate` • `grayscale` • `equalize` • `rayleigh` • `convert_image_to_document` • `generate_image_embeddings` • `tile` • `ocr` • `encrypt` • `decrypt`
文档(Document)	`ocr` • `extract_raw_text` • `convert_document_to_images` • `slice_document` • `extract_form_fields` • `extract_table_of_contents` • `get_pdf_page_dimensions`
视频(Video)	`extract_audio` • `extract_scene_frames` • `chunk` • `extract_first_frame` • `extract_frames_at_timestamp` • `transcode` • `get_scene_frame_timestamps`
音频(Audio)	`transcribe` • `chunk` • `transcode` • `get_waveform`
DICOM	`render_dicom_layer`
电子表格(Spreadsheet)	`extract_content_from_spreadsheets`
电子邮件(Email)	`extract_email_body` • `extract_email_attachments`
多模态(Multimodal)	`filter_to`

快速入门(Getting started)¶

要使用媒体集转换 API(media set transforms API)，请按照以下示例从媒体集输入中访问转换功能。

from transforms.mediasets.inputs import MediaSetInputParam
from transforms.api import transform, Output, TransformOutput
from transforms.mediasets import MediaSetInput

@transform(
    media_input=MediaSetInput("/path/to/media_set"),
    dataset_output=Output("/path/to/output")
)
def compute(ctx, media_input: MediaSetInputParam, dataset_output: TransformOutput):
    # 创建一个 MediaSetInputTransform 实例
    transform = media_input.transform()

    # 应用转换
    result = transform.ocr()

    # 将结果写入输出
    dataset_output.write_dataframe(result)

transform()¶

def transform(self, deduplicate_by_path=True):

返回一个 MediaSetInputTransform 实例。该类支持对媒体集输入进行流畅的方法链式调用，以实现媒体转换。

参数(Parameters):

deduplicate_by_path (bool, 可选): 如果为 True，则仅包含每个路径下最新的项目。默认为 True。

返回(Returns):

MediaSetInputTransform: 一个面向用户的类，提供媒体集转换的方法。

示例(Example):

df = media_set.transform().ocr()

将 `MediaSetInputTransform` 写入媒体集输出(Writing a `MediaSetInputTransform` to a media set output)¶

对于媒体集到媒体集的转换，请按照以下示例将转换后的媒体写入媒体集。

from transforms.mediasets.inputs import MediaSetInputParam
from transforms.mediasets.outputs import MediaSetOutputParam
from transforms.api import transform
from transforms.mediasets import MediaSetInput, MediaSetOutput


@transform(
    video_input=MediaSetInput(
        "/path/to/input_media_set"
    ),
    multimodal_output=MediaSetOutput(
        "/path/to/output_media_set"
    ),
)
def compute(
    ctx,
    video_input: MediaSetInputParam,
    multimodal_output: MediaSetOutputParam,
):
    def output_path_tar(input_path, page=None):
        return input_path.replace("mp4", "tar")

    multimodal_output.write(
        video_input.transform().extract_scene_frames(
            scene_sensitivity="LESS_SENSITIVE"
        ),
        output_path_tar,
    )

write()¶

def write(
    self,
    media_transform: MediaSetInputTransform,
    output_path: Callable[[str, Optional[int]], str] = prefix_path,
    suppress_errors: bool = False,
    return_dataframe: bool = False,
) -> Union[str, DataFrame, None]:

将 MediaSetInputTransform 写入输出媒体集。

参数(Parameters):

media_transform (MediaSetInputTransform): 已应用于媒体集输入的媒体集转换。例如：media_transform = img_input.transform().resize()
output_path (Callable[[str, Optional[int]], str], 可选): 一个函数，用于确定每个媒体项目的输出路径。它接收原始路径和可选的页码作为输入，并返回新路径。默认为在原始路径前添加 "transformed"。对于每个输入媒体项目输出多个项目的媒体转换，项目路径将自动附加一个标识值。例如，对图像应用多个裁剪的 crop 转换将自动在路径中包含裁剪参数（x_offset, y_offset, width, height）。
suppress_errors (bool): 指定错误处理行为。如果为 True，则捕获错误，并且如果 return_dataframe 为 True，错误消息将返回到数据帧的 media_reference 列中。如果为 False，则不会捕获任何错误，构建将失败。默认为 False。仅适用于对整个媒体集的转换。
return_dataframe (bool, 可选): 仅适用于媒体集级别的转换。如果为 True，则返回一个包含 media_item_rid、path 和 media_reference 列的数据帧。该数据帧表示写入输出媒体集的媒体项目。默认为 False。如果为 True，媒体集转换将被惰性求值。

返回(Returns):

无、一个数据帧或一个字符串。
str: 适用于对单个媒体项目的转换。返回生成的媒体引用字符串。
None: 适用于对整个媒体集的转换。如果 return_dataframe 设置为 False，则返回 None。
DataFrame: 适用于对整个媒体集的转换，并且当 return_dataframe 设置为 True 时。列包括 media_item_rid、path、media_reference。

示例(Example):

# 对整个媒体集的媒体转换
media_transform = img_input.transform().resize()
img_out.write(media_transform)

# 对单个媒体项目的媒体转换
media_transform = img_input.transform().resize(media_item_rid = "rid1")
med_ref_str = img_out.write(media_transform)

API 参考(API reference)¶

ocr()¶

使用 OCR 从 PDF 或图像中提取文本，并将提取的文本作为字符串返回。推荐用于图像和扫描文档。

参数(Parameters):

languages (list[str]): 用于 OCR 的语言列表。默认为英语。所有有效代码可在 Tesseract 文档 ↗ 的 languages 部分找到。
scripts (Optional[list[str]]): 用于 OCR 的脚本列表。默认为 None。所有有效代码可在 Tesseract 文档 ↗ 的 scripts 部分找到。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
start_page (Optional[int]): OCR 的起始页码（从零开始索引）。仅适用于 PDF 媒体集。默认为 0（第一页）。
end_page (Optional[int]): OCR 的结束页码（从零开始索引，不包含）。仅适用于 PDF 媒体集。默认为 None（最后一页）。
return_structure (str): item_per_row 或 page_per_row。仅适用于对整个 PDF 媒体集的转换。默认为 item_per_row。
suppress_errors (bool): 指定错误处理行为。如果为 True，则捕获错误，错误消息将返回到输出中。如果为 False，则不会捕获任何错误，构建将失败。默认为 True。仅适用于对整个媒体集的转换。

返回(Returns):

一个数据帧、字符串列表或单个字符串。
str: 对单个图像的转换。
list[str]: 对单个 PDF 的转换。
DataFrame: 对整个媒体集的转换。
- 对于 PDF (item_per_row): 列包括 media_item_rid、path、media_reference、extracted_text (list[str])。
- 对于 PDF (page_per_row) 或图像集: 列包括 media_item_rid、path、media_reference、page_number、extracted_text (str)。

示例(Example):

df = media_set.transform().ocr()
dataset_output.write_dataframe(df)

extract_raw_text()¶

从 PDF 中提取原始文本。推荐用于电子生成的文档。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
start_page (Optional[int]): 文本提取的起始页码（从零开始索引）。默认为 0（第一页）。
end_page (Optional[int]): 文本提取的结束页码（从零开始索引，不包含）。默认为 None（最后一页）。
return_structure (str): item_per_row 或 page_per_row。仅适用于对整个 PDF 媒体集的转换。默认为 item_per_row。
suppress_errors (bool): 指定错误处理行为。如果为 True，则捕获错误，错误消息将返回到输出中。如果为 False，则不会捕获任何错误，构建将失败。默认为 True。仅适用于对整个媒体集的转换。

返回(Returns):

一个数据帧或字符串列表。
str: 适用于对单个图像的转换。
list[str]: 适用于对单个 PDF 的转换。
DataFrame: 适用于对整个媒体集的转换。
- PDF (item_per_row): 列包括 media_item_rid、path、media_reference、extracted_text (list[str])。
- PDF (page_per_row) 或图像集: 列包括 media_item_rid、path、media_reference、page_number、extracted_text (str)。

示例(Example):

df = media_set.transform().extract_raw_text()
dataset_output.write_dataframe(df)

resize()¶

将图像调整为指定尺寸。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
width (Optional[int]): 调整后图像的目标宽度。默认为 1024。如果未提供 height，则必须提供此参数。
height (Optional[int]): 调整后图像的目标高度。默认为 1024。如果未提供 width，则必须提供此参数。
maintain_aspect_ratio (bool): 指定是否保持图像的原始宽高比。如果为 True，图像将在保持宽高比的同时调整大小以适合指定尺寸。默认为 True。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含调整大小转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().resize()
image_output.write(transform)

convert_document_to_images()¶

将文档页面转换为指定尺寸的图像。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
start_page (Optional[int]): 转换的起始页码（从零开始索引）。默认为 0（第一页）。
end_page (Optional[int]): 转换的结束页码（从零开始索引，不包含）。默认为 None（最后一页）。
width (Optional[int]): 输出图像的宽度。默认为 1024。
height (Optional[int]): 输出图像的高度。默认为 1024。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含文档到图像转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = pdf_input.transform().convert_document_to_images()
image_output.write(transform)

slice_document()¶

将文档切片到指定的页码范围。

参数(Parameters):

start_page (int): 切片操作的起始页码（从零开始索引）。
end_page (int): 切片操作的结束页码（从零开始索引，不包含）。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
strictly_enforce_end_page (bool): 指定如果 end_page 超过文档页数时的行为。如果为 True，将引发错误。如果为 False，则使用文档的最后一页。默认为 True。

返回(Returns):

一个包含切片转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = pdf_input.transform().slice_document(0, 5)
pdf_output.write(transform)

tile()¶

从图像生成 Slippy 地图瓦片(EPSG 3857)。仅支持 TIFF 或 NITF 格式的地理嵌入图像，最大尺寸为 1 亿平方像素。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
zoom (Union[int, Column]): 瓦片的缩放级别。必须是非负整数。在缩放级别 0 时，整个世界适合单个瓦片。每增加一级，空间分辨率加倍，瓦片数量翻四倍。默认为 0。
x (Union[int, Column]): 指定缩放级别下的瓦片列索引。从西向东递增。有效范围：0 <= x < 2**zoom。默认为 0。
y (Union[int, Column]): 指定缩放级别下的瓦片行索引。从北向南递增。有效范围：0 <= y < 2**zoom。默认为 0。
df (Optional[DataFrame]): 指定在传递列输入时要连接的数据帧。仅对输入媒体集和提供的数据帧中都存在的媒体项目应用切片。必须与 on 参数一起提供。
on (Optional[Literal["media_item_rid", "media_reference"]]): 指定 df 时要连接的列名。这会将切片操作与正确的媒体项目对齐。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含切片转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

# 从 input_df 列动态选择切片参数。
# 仅对媒体集和提供的数据帧中都存在的媒体项目进行切片。
transform1 = image_input.transform().tile(input_df.zoom, input_df.x, input_df.y, df=input_df, on="media_item_rid")

# 所有瓦片将使用相同的参数生成。
# 为媒体集中的所有媒体项目生成一个瓦片。
transform2 = image_input.transform().tile(zoom=2, x=1, y=1)

# 将转换写入输出媒体集
image_output.write(transform)

encrypt()¶

使用提供的密码图像密钥加密图像的指定区域。

参数(Parameters):

polygons (Union[list[api.Polygon], Column]): 要加密的区域，以多边形形式指定。
cipher_license_rid (Union[str, Column]): 用于加密的密码许可证 RID。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
df (Optional[DataFrame]): 指定在传递列输入时要连接的数据帧。仅对输入媒体集和提供的数据帧中都存在的媒体项目应用加密。必须与 on 参数一起提供。
on (Optional[Literal["media_item_rid", "media_reference"]]): 指定 df 时要连接的列名。这会将加密操作与正确的媒体项目对齐。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含加密转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

polygon = [api.Coordinate(x=10, y=10), api.Coordinate(x=100, y=10),
           api.Coordinate(x=100, y=100), api.Coordinate(x=10, y=100)]
transform = image_input.transform().encrypt([polygon], "cipher_license_rid_123", media_item_rid="rid1")
image_output.write(transform)

decrypt()¶

使用提供的密码图像密钥解密图像的指定区域。

参数(Parameters):

polygons (Union[list[api.Polygon], Column]): 要解密的区域，以多边形列表形式指定。
cipher_license_rid (Union[str, Column]): 用于解密的密码许可证的资源标识符。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
df (Optional[DataFrame]): 指定在传递列输入时要连接的数据帧。仅对输入媒体集和提供的数据帧中都存在的媒体项目应用解密。必须与 on 参数一起提供。
on (Optional[Literal["media_item_rid", "media_reference"]]): 指定 df 时要连接的列名。这会将解密操作与正确的媒体项目对齐。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含解密转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

polygon = [api.Coordinate(x=10, y=10), api.Coordinate(x=100, y=10),
           api.Coordinate(x=100, y=100), api.Coordinate(x=10, y=100)]
transform = image_input.transform().decrypt([polygon], "cipher_license_rid_123", media_item_rid="rid1")
image_output.write(transform)

crop()¶

使用指定的尺寸和偏移量裁剪图像。

参数(Parameters):

width (Union[int, Column]): 裁剪图像的宽度。
height (Union[int, Column]): 裁剪图像的高度。
x_offset (Union[int, Column]): 裁剪区域左上角的 x 坐标。默认为 0。
y_offset (Union[int, Column]): 裁剪区域左上角的 y 坐标。默认为 0。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
df (Optional[DataFrame]): 指定在传递列输入时要连接的数据帧。仅对输入媒体集和提供的数据帧中都存在的媒体项目应用裁剪。必须与 on 参数一起提供。
on (Optional[Literal["media_item_rid", "media_reference"]]): 指定 df 时要连接的列名。这会将裁剪操作与正确的媒体项目对齐。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含裁剪转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Examples):

# 所有媒体项目将使用相同的参数进行裁剪。
# 裁剪媒体集中的所有媒体项目。
transform = image_input.transform().crop(100, 100, 10, 10)
image_output.write(transform)

# 从 input_df 列动态选择裁剪参数。
# 仅裁剪媒体集和提供的数据帧中都存在的媒体项目。
transform1 = image_input.transform().crop(input_df.x2 - input_df.x1, input_df.y2 - input_df.y1,
    input_df.x1, input_df.y1, df=input_df, on="media_item_rid")

# 宽度和高度从 input_df 列动态选择，而偏移量是静态的。
# 仅裁剪媒体集和提供的数据帧中都存在的媒体项目。
transform2 = image_input.transform().crop(30, 50,
    input_df.x1, input_df.y1, df=input_df, on="media_item_rid")

# 所有媒体项目将使用相同的参数进行裁剪。
# 仅裁剪媒体集和提供的数据帧中都存在的媒体项目。
transform3 = image_input.transform().crop(30, 50,
    20, 60, df=input_df, on="media_item_rid")

# 所有媒体项目将使用相同的参数进行裁剪。
# 裁剪媒体集中的所有媒体项目。
transform4 = image_input.transform().crop(30, 50, 20, 60)

# 将转换写入输出媒体集
image_output.write(transform1)

binarize()¶

使用指定的阈值将图像转换为二值图像。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
threshold (Optional[int]): 大于或等于阈值的值将被赋值为 255，低于阈值的值将被赋值为 0。默认为根据输入图像计算阈值。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含二值化转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().binarize(threshold=150)
image_output.write(transform)

rotate()¶

按指定角度旋转图像。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
angle (Literal["DEGREE_90", "DEGREE_180", "DEGREE_270"]): 旋转图像的角度。默认为 DEGREE_90。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含旋转转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().rotate(angle="DEGREE_180")
image_output.write(transform)

grayscale()¶

将图像转换为灰度图像。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含灰度转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().grayscale()
image_output.write(transform)

equalize()¶

通过对灰度图像执行直方图均衡化来改善低对比度图像的清晰度。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含均衡化转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().equalize()
image_output.write(transform)

rayleigh()¶

调整灰度强度值，使图像的直方图（像素亮度的分布）匹配瑞利分布（大致为始终为负的钟形曲线）。这可以改善低对比度图像的清晰度。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
sigma (float): 瑞利分布的缩放参数。必须是介于 0 和 1 之间的浮点数。默认为 0.5。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含瑞利转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().rayleigh(sigma=0.7)
image_output.write(transform)

convert_image_to_document()¶

将图像转换为 PDF。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。

返回(Returns):

一个包含图像到文档转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = image_input.transform().convert_image_to_document()
pdf_output.write(transform)

transcode()¶

将音频或视频转码为指定格式。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
encode_format (Optional[str]): 指定输出媒体的编码格式。视频默认为 MP4，音频默认为 MP3。

返回(Returns):

一个包含转码转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = video_input.transform().transcode(encode_format="mov")
video_output.write(transform)

extract_audio()¶

从视频文件中提取音频。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
output_format (str): 输出音频的格式（例如 mp3、wav）。默认为 mp3。

返回(Returns):

一个包含音频提取转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = video_input.transform().extract_audio(output_format="wav")
audio_output.write(transform)

extract_scene_frames()¶

从视频中提取所有场景帧作为图像。场景帧是标记新场景开始或视频内容中显著视觉过渡的视频帧。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
scene_sensitivity (Literal["MORE_SENSITIVE", "STANDARD", "LESS_SENSITIVE"]): 场景检测的灵敏度级别。默认为 STANDARD。
output_format (str): 指定提取帧的编码格式。默认为 PNG。

返回(Returns):

一个包含场景帧提取转换的 MediaSetInputTransform 实例。每个视频的输出图像将存储在 TAR 存档文件中。

示例(Example):

transform = video_input.transform().extract_scene_frames(scene_sensitivity="MORE_SENSITIVE")
multimodal_output.write(transform)

chunk()¶

将音频或视频文件分割成指定时长的较小片段。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
chunk_duration_milliseconds (int): 每个片段的时长（毫秒）。默认为 10000（10 秒）。必须是正整数。
output_format (Optional[str]): 输出音频片段的格式。视频默认为 MP4，音频默认为 TS。请注意，音频仅支持 TS 作为输出格式。

返回(Returns):

一个包含分割转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = video_input.transform().chunk(chunk_duration_milliseconds=5000)
video_output.write(transform)

extract_first_frame()¶

从视频中提取第一个完整场景帧作为图像，如果未提供尺寸，则使用原始尺寸。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
width (Optional[int]): 提取帧的目标宽度。如果为 None，宽度将根据提供的高度进行缩放。默认为 None。
height (Optional[int]): 提取帧的目标高度。如果为 None，高度将根据提供的宽度进行缩放。默认为 None。
output_format (str): 输出图像的格式。默认为 PNG。

返回(Returns):

一个包含首帧提取转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = video_input.transform().extract_first_frame(width=800, height=600)
image_output.write(transform)

extract_frames_at_timestamp()¶

在指定时间戳从视频中提取帧，如果未提供尺寸，则使用原始尺寸。

参数(Parameters):

timestamps (Union[float, Column]): 提取帧的时间戳（秒）。
media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
df (Optional[DataFrame]): 指定在传递列输入时要连接的数据帧。仅对输入媒体集和提供的数据帧中都存在的媒体项目应用帧提取。必须与 on 参数一起提供。
on (Optional[Literal["media_item_rid", "media_reference"]]): 指定 df 时要连接的列名。这会将帧提取操作与正确的媒体项目对齐。
width (Optional[int]): 提取帧的目标宽度。如果为 None，根据提供的高度缩放宽度。默认为 None。
height (Optional[int]): 提取帧的目标高度。如果为 None，根据提供的宽度缩放高度。默认为 None。
output_format (str): 输出图像的格式。默认为 PNG。

返回(Returns):

一个包含帧提取转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

# 从 input_df 列动态选择时间戳参数。
# 仅从媒体集和提供的数据帧中都存在的媒体项目提取帧。
transform1 = video_input.transform().extract_frames_at_timestamp(input_df.timestamp, df=input_df, on="media_item_rid")

# 所有帧将在相同的时间戳从媒体集中的所有项目提取。
transform2 = video_input.transform().extract_frames_at_timestamp(30)

# 将转换写入输出媒体集
image_output.write(transform1)

render_dicom_layer()¶

将 DICOM 文件的帧渲染为图像，如果未提供尺寸，则使用原始尺寸。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
layer_number (Optional[int]): 要从 DICOM 图像渲染的层号。默认为中间层。
width (Optional[int]): 渲染图像的目标宽度。如果为 None，并且提供了高度，则将保持宽高比。如果未提供 height，则必须提供此参数。
height (Optional[int]): 渲染图像的目标高度。如果为 None，并且提供了宽度，则将保持宽高比。如果未提供 width，则必须提供此参数。
output_format (str): 输出图像的格式，例如 PNG 或 JPEG。默认为 PNG。

返回(Returns):

一个包含 DICOM 层渲染转换的 MediaSetInputTransform 实例，允许进一步转换。

示例(Example):

transform = dicom_input.transform().render_dicom_layer(layer_number=2)
image_output.write(transform)

extract_form_fields()¶

从文档中提取表单字段。

参数(Parameters):

media_item_rid (Optional[str]): 如果指定，则对指定项目运行转换，而不是对整个媒体集。默认为 None。
suppress_errors (bool): 指定错误处理行为。如果为 True，则捕获错误，错误消息将返回到输出中。如果为 False，则不会捕获任何错误，构建将失败。默认为 True。仅适用于对整个媒体集的转换。

返回(Returns):

一个数据帧或单个字符串。
str: 包含表单字段的 JSON 对象。适用于对单个项目的转换。
DataFrame: 列包括 media_item_rid、path、media_reference、`

Media set transforms API（媒体集转换 API(Media set transforms API)）¶

Methods by schema type¶

Getting started¶

transform()¶

Writing a MediaSetInputTransform to a media set output¶

write()¶

API reference¶

ocr()¶

extract_raw_text()¶

resize()¶

convert_document_to_images()¶

slice_document()¶

tile()¶

encrypt()¶

decrypt()¶

crop()¶

binarize()¶

rotate()¶

grayscale()¶

equalize()¶

rayleigh()¶

convert_image_to_document()¶

transcode()¶

extract_audio()¶

extract_scene_frames()¶

chunk()¶

extract_first_frame()¶

extract_frames_at_timestamp()¶

render_dicom_layer()¶

extract_form_fields()¶

extract_table_of_contents()¶

get_pdf_page_dimensions()¶

generate_image_embeddings()¶

get_waveform()¶

transcribe()¶

get_scene_frame_timestamps()¶

extract_email_body()¶

extract_email_attachments()¶

filter_to()¶

extract_content_from_spreadsheets()¶

中文翻译¶

媒体集转换 API(Media set transforms API)¶

按模式类型划分的方法(Schema type)¶

快速入门(Getting started)¶

transform()¶

将 MediaSetInputTransform 写入媒体集输出(Writing a MediaSetInputTransform to a media set output)¶

write()¶

API 参考(API reference)¶

ocr()¶

extract_raw_text()¶

resize()¶

convert_document_to_images()¶

slice_document()¶

tile()¶

encrypt()¶

decrypt()¶

crop()¶

binarize()¶

rotate()¶

grayscale()¶

equalize()¶

rayleigh()¶

convert_image_to_document()¶

transcode()¶

extract_audio()¶

extract_scene_frames()¶

chunk()¶

extract_first_frame()¶

extract_frames_at_timestamp()¶

render_dicom_layer()¶

extract_form_fields()¶

Writing a `MediaSetInputTransform` to a media set output¶

将 `MediaSetInputTransform` 写入媒体集输出(Writing a `MediaSetInputTransform` to a media set output)¶