Chunk string(字符串分块(Chunk string))¶
Supported in: Batch, Streaming
Chunk string into chunks of a specified size and on specified separators.
Expression categories: String
Declared arguments¶
- Expression: The column containing the document text to chunk.
Expression\ - optional Chunk overlap: Allow chunks to overlap in content by around this number. Must be greater than or equal to 0 and less than the selected chunk size.
Literal\ - optional Chunk size: Create chunks of size around this number. Must be greater than 0.
Literal\ - optional Keep separator: Include the separator in the output chunks.
Literal\ - optional Separators: Chunk string on these provided separators. The default separators has the effect of trying to keep all paragraphs, and then sentences, and then words together for as long as possible.
List\>
Output type: Array\
Examples¶
Example 1: Base case¶
Argument values:
- Expression:
string - Chunk overlap: null
- Chunk size: 10
- Keep separator: null
- Separators: null
| string | Output |
|---|---|
| hello | [ hello ] |
| hello world. the quick brown fox jumps over the fence. | [ hello, world., the quick, brown fox, jumps, over the, fence. ] |
| hello world. the quick brown fox jumps over the fence. |
[ hello, world., the quick, brown fox, jumps, over the, fence. ] |
| hello world. the quick brown fox jumps over the fence. |
[ hello, world., the quick, brown fox, jumps, over the, fence. ] |
Example 2: Base case¶
Argument values:
- Expression: A quick-brown-fox-jumps over the lazy dog
- Chunk overlap: null
- Chunk size: 10
- Keep separator: false
- Separators: [
,
, ]
Output: [ A, quick-brown-fox-jumps, over the, lazy dog ]
Example 3: Base case¶
Argument values:
- Expression: A quick brown fox jumps over the lazy dog
- Chunk overlap: 5
- Chunk size: 10
- Keep separator: null
- Separators: null
Output: [ A quick, brown fox, fox jumps, over the, the lazy, lazy dog ]
Example 4: Base case¶
Argument values:
- Expression: Text1|Text2||Text3
- Chunk overlap: null
- Chunk size: 10
- Keep separator: false
- Separators: [||, |]
Output: [ Text1, Text2, Text3 ]
Example 5: Base case¶
Argument values:
- Expression: Text1|Text2||Text3
- Chunk overlap: null
- Chunk size: 10
- Keep separator: null
- Separators: [||, |]
Output: [ Text1, |Text2, ||Text3 ]
Example 6: Base case¶
Argument values:
- Expression: Text1, Text2
Text3
Text4 - Chunk overlap: null
- Chunk size: 256
- Keep separator: null
- Separators: null
Output: [ Text1, Text2
Text3
Text4 ]
Example 7: Base case¶
Argument values:
- Expression: Text1 Text2
Text3
Text4 - Chunk overlap: null
- Chunk size: 10
- Keep separator: null
- Separators: null
Output: [ Text1, Text2, Text3, Text4 ]
中文翻译¶
字符串分块(Chunk string)¶
支持:批处理(Batch)、流处理(Streaming)
将字符串按指定大小和分隔符分割成多个块。
表达式类别: 字符串(String)
声明的参数¶
- 表达式(Expression): 包含待分块文档文本的列。
表达式\<字符串> - 可选 块重叠(Chunk overlap): 允许块之间内容重叠大约此数值。必须大于等于0且小于所选块大小。
字面量\<整数> - 可选 块大小(Chunk size): 创建大小约为此数值的块。必须大于0。
字面量\<整数> - 可选 保留分隔符(Keep separator): 在输出块中保留分隔符。
字面量\<布尔值> - 可选 分隔符(Separators): 按提供的这些分隔符对字符串进行分块。默认分隔符的效果是尽可能将段落、句子和单词保持在一起。
列表\<字面量\<字符串>>
输出类型: 数组\<字符串>
示例¶
示例1:基础情况¶
参数值:
- 表达式:
string - 块重叠: null
- 块大小: 10
- 保留分隔符: null
- 分隔符: null
| string | 输出 |
|---|---|
| hello | [ hello ] |
| hello world. the quick brown fox jumps over the fence. | [ hello, world., the quick, brown fox, jumps, over the, fence. ] |
| hello world. the quick brown fox jumps over the fence. |
[ hello, world., the quick, brown fox, jumps, over the, fence. ] |
| hello world. the quick brown fox jumps over the fence. |
[ hello, world., the quick, brown fox, jumps, over the, fence. ] |
示例2:基础情况¶
参数值:
- 表达式: A quick-brown-fox-jumps over the lazy dog
- 块重叠: null
- 块大小: 10
- 保留分隔符: false
- 分隔符: [
,
, ]
输出: [ A, quick-brown-fox-jumps, over the, lazy dog ]
示例3:基础情况¶
参数值:
- 表达式: A quick brown fox jumps over the lazy dog
- 块重叠: 5
- 块大小: 10
- 保留分隔符: null
- 分隔符: null
输出: [ A quick, brown fox, fox jumps, over the, the lazy, lazy dog ]
示例4:基础情况¶
参数值:
- 表达式: Text1|Text2||Text3
- 块重叠: null
- 块大小: 10
- 保留分隔符: false
- 分隔符: [||, |]
输出: [ Text1, Text2, Text3 ]
示例5:基础情况¶
参数值:
- 表达式: Text1|Text2||Text3
- 块重叠: null
- 块大小: 10
- 保留分隔符: null
- 分隔符: [||, |]
输出: [ Text1, |Text2, ||Text3 ]
示例6:基础情况¶
参数值:
- 表达式: Text1, Text2
Text3
Text4 - 块重叠: null
- 块大小: 256
- 保留分隔符: null
- 分隔符: null
输出: [ Text1, Text2
Text3
Text4 ]
示例7:基础情况¶
参数值:
- 表达式: Text1 Text2
Text3
Text4 - 块重叠: null
- 块大小: 10
- 保留分隔符: null
- 分隔符: null
输出: [ Text1, Text2, Text3, Text4 ]