Skip to main content
In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: multimodal-Parent-Child and multimodal-General. For the Knowledge Base node to recognize and embed a tool plugin’s multimodal output (such as text, images, audio, or video), complete two configurations:
  • In the tool code file: Call the tool session interface to upload files and construct the files object.
  • In the tool provider YAML file: Declare the output_schema as either multimodal-Parent-Child or multimodal-General.

Upload Files and Construct File Objects

When processing multimodal data such as images, first upload the file through Dify’s tool session to obtain the file metadata. The following example, taken from the official Dify Extractor plugin, shows how to upload a file and construct a files object.
# Upload the file using the tool session
file_res = self._tool.session.file.upload(
    file_name,   # filename
    file_blob,   # file binary data
    mime_type,   # MIME type, e.g., "image/png"
)

# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"
The upload interface returns an UploadFileResponse object containing the file information:
from enum import Enum
from pydantic import BaseModel

class UploadFileResponse(BaseModel):
    class Type(str, Enum):
        DOCUMENT = "document"
        IMAGE = "image"
        VIDEO = "video"
        AUDIO = "audio"

        @classmethod
        def from_mime_type(cls, mime_type: str):
            if mime_type.startswith("image/"):
                return cls.IMAGE
            if mime_type.startswith("video/"):
                return cls.VIDEO
            if mime_type.startswith("audio/"):
                return cls.AUDIO
            return cls.DOCUMENT
    id: str
    name: str
    size: int
    extension: str
    mime_type: str
    type: Type | None = None
    preview_url: str | None = None
Map the file information (name, size, extension, mime_type, and so on) to the files field in the multimodal output structure.
{
    "$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
    "$schema": "http://json-schema.org/draft-07/schema#",
    "version": "1.0.0",
    "type": "object",
    "title": "Multimodal Parent-Child Structure",
    "description": "Schema for multimodal parent-child structure (v1)",
    "properties": {
        "parent_mode": {
        "type": "string",
        "description": "The mode of parent-child relationship"
        },
        "parent_child_chunks": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
            "parent_content": {
                "type": "string",
                "description": "The parent content"
            },
            "files": {
                "type": "array",
                "items": {
                "type": "object",
                "properties": {
                    "name": {
                    "type": "string",
                    "description": "file name"
                    },
                    "size": {
                    "type": "number",
                    "description": "file size"
                    },
                    "extension": {
                    "type": "string",
                    "description": "file extension"
                    },
                    "type": {
                    "type": "string",
                    "description": "file type"
                    },
                    "mime_type": {
                    "type": "string",
                    "description": "file mime type"
                    },
                    "transfer_method": {
                    "type": "string",
                    "description": "file transfer method"
                    },
                    "url": {
                    "type": "string",
                    "description": "file url"
                    },
                    "related_id": {
                    "type": "string",
                    "description": "file related id"
                    }
                },
                "required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
                },
                "description": "List of files"
            },
            "child_contents": {
                "type": "array",
                "items": {
                "type": "string"
                },
                "description": "List of child contents"
            }
            },
            "required": ["parent_content", "child_contents"]
        },
        "description": "List of parent-child chunk pairs"
        }
    },
    "required": ["parent_mode", "parent_child_chunks"]
}

Declare Multimodal Output Structure

Dify’s official JSON schemas define the structure of multimodal data. To let the Knowledge Base node recognize the plugin’s multimodal output type, point the result field under output_schema in the plugin’s provider YAML file to the corresponding official schema URL.
output_schema:
  type: object
  properties:
    result:
      # multimodal-Parent-Child
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
      
      # multimodal-General
      # $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
For example, a complete YAML configuration using multimodal-Parent-Child looks like this:
identity:
  name: multimodal_tool
  author: langgenius
  label:
    en_US: multimodal tool
    zh_Hans: 多模态提取器
    pt_BR: multimodal tool
description:
  human:
    en_US: Process documents into multimodal-Parent-Child chunk structures
    zh_Hans: 将文档处理为多模态父子分块结构
    pt_BR: Processar documentos em estruturas de divisão pai-filho
  llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures

parameters:
  - name: input_text
    human_description:
      en_US: The text you want to chunk.
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    label:
      en_US: Input Content
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    llm_description: The text you want to chunk.
    required: true
    type: string
    form: llm

output_schema:
  type: object
  properties:
    result:
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
  python:
    source: tools/parent_child_chunk.py