Skip to main content
A custom model is an LLM that you deploy or configure yourself. This guide uses Xinference as an example to show how to integrate a custom model into your model plugin. By default, a custom model automatically includes two parameters, its model type and model name, so the provider YAML file needs no additional definitions. You do not need to implement validate_provider_credential in your provider configuration file. At runtime, Dify calls the corresponding model layer’s validate_credentials method based on the model type and model name the user selects.

Integrate a Custom Model Plugin

Integrating a custom model takes four steps:
  1. Create a model provider file: Identify the model types your custom model will include.
  2. Create code files by model type: Create separate code files for each model type (e.g., llm or text_embedding). Keeping each model type in its own logical layer simplifies maintenance and future expansion.
  3. Develop the model invocation logic: Within each model-type module, create a Python file named for that model type (for example, llm.py). Define a class in the file that implements the model logic, conforming to the system’s model interface specifications.
  4. Debug the plugin: Write unit and integration tests for the new provider functionality, ensuring that all components work as intended.

1. Create the Model Provider File

In your plugin’s /provider directory, create a xinference.yaml file. The Xinference family of models supports LLM, Text Embedding, and Rerank model types, so your xinference.yaml must include all three. Example:
provider: xinference  # Identifies the provider
label:                # Display name; can set both en_US (English) and zh_Hans (Chinese). If zh_Hans is not set, en_US is used by default.
  en_US: Xorbits Inference
icon_small:           # Small icon; store in the _assets folder of this provider’s directory. The same multi-language logic applies as with label.
  en_US: icon_s_en.svg
icon_large:           # Large icon
  en_US: icon_l_en.svg
help:                 # Help information
  title:
    en_US: How to deploy Xinference
    zh_Hans: 如何部署 Xinference
  url:
    en_US: https://github.com/xorbitsai/inference

supported_model_types:  # Model types Xinference supports: LLM/Text Embedding/Rerank
- llm
- text-embedding
- rerank

configurate_methods:     # Xinference is locally deployed and does not offer predefined models. Refer to its documentation to learn which model to use. Thus, we choose a customizable-model approach.
- customizable-model

provider_credential_schema:
  credential_form_schemas:
Next, define the provider_credential_schema. Since Xinference supports text-generation, embeddings, and reranking models, you can configure it as follows:
provider_credential_schema:
  credential_form_schemas:
  - variable: model_type
    type: select
    label:
      en_US: Model type
      zh_Hans: 模型类型
    required: true
    options:
    - value: text-generation
      label:
        en_US: Language Model
        zh_Hans: 语言模型
    - value: embeddings
      label:
        en_US: Text Embedding
    - value: reranking
      label:
        en_US: Rerank
Every model in Xinference requires a model_name:
  - variable: model_name
    type: text-input
    label:
      en_US: Model name
      zh_Hans: 模型名称
    required: true
    placeholder:
      zh_Hans: 填写模型名称
      en_US: Input model name
Because Xinference is locally deployed, users must also supply the server address (server_url) and model UID:
  - variable: server_url
    label:
      zh_Hans: 服务器 URL
      en_US: Server url
    type: text-input
    required: true
    placeholder:
      zh_Hans: 在此输入 Xinference 的服务器地址,如 https://example.com/xxx
      en_US: Enter the url of your Xinference, for example https://example.com/xxx

  - variable: model_uid
    label:
      zh_Hans: 模型 UID
      en_US: Model uid
    type: text-input
    required: true
    placeholder:
      zh_Hans: 在此输入你的 Model UID
      en_US: Enter the model uid
This completes the YAML configuration for your custom model provider. Next, create the code files for each model defined in the configuration.

2. Develop the Model Code

Xinference supports llm, rerank, speech2text, and tts, so create a corresponding directory under /models for each type, each containing its feature code. Below is an example for an llm type model. Create a file named llm.py, then define a class such as XinferenceAILargeLanguageModel that extends __base.large_language_model.LargeLanguageModel. The class must implement the following methods.

LLM Invocation

The core method for invoking the LLM, supporting both streaming and synchronous responses:
def _invoke(
    self,
    model: str,
    credentials: dict,
    prompt_messages: list[PromptMessage],
    model_parameters: dict,
    tools: Optional[list[PromptMessageTool]] = None,
    stop: Optional[list[str]] = None,
    stream: bool = True,
    user: Optional[str] = None
) -> Union[LLMResult, Generator]:
    """
    Invoke the large language model.

    :param model: model name
    :param credentials: model credentials
    :param prompt_messages: prompt messages
    :param model_parameters: model parameters
    :param tools: tools for tool calling
    :param stop: stop words
    :param stream: determines if response is streamed
    :param user: unique user id
    :return: full response or a chunk generator
    """
Implement streaming and synchronous responses as separate functions. Python treats any function containing yield as a generator that returns Generator, so splitting them keeps the return types clean:
def _invoke(self, stream: bool, **kwargs) -> Union[LLMResult, Generator]:
    if stream:
        return self._handle_stream_response(**kwargs)
    return self._handle_sync_response(**kwargs)

def _handle_stream_response(self, **kwargs) -> Generator:
    for chunk in response:
        yield chunk

def _handle_sync_response(self, **kwargs) -> LLMResult:
    return LLMResult(**response)

Pre-calculate Input Tokens

If your model doesn’t provide a token-counting interface, return 0:
def get_num_tokens(
    self,
    model: str,
    credentials: dict,
    prompt_messages: list[PromptMessage],
    tools: Optional[list[PromptMessageTool]] = None
) -> int:
    """
    Get the number of tokens for the given prompt messages.
    """
    return 0
Alternatively, you can call self._get_num_tokens_by_gpt2(text: str) from the AIModel base class, which uses a GPT-2 tokenizer. Remember this is an approximation and may not match your model exactly.

Validate Model Credentials

Similar to provider-level credential checks, but scoped to a single model:
def validate_credentials(self, model: str, credentials: dict) -> None:
    """
    Validate model credentials.
    """

Dynamic Model Parameters Schema

Unlike predefined models, no YAML file defines which parameters a model supports, so you must generate the parameter schema dynamically. For example, Xinference supports max_tokens, temperature, and top_p. Other providers (e.g., OpenLLM) may support parameters like top_k only for certain models, so the schema must adapt to each model’s capabilities:
def get_customizable_model_schema(self, model: str, credentials: dict) -> AIModelEntity | None:
    """
        used to define customizable model schema
    """
    rules = [
        ParameterRule(
            name='temperature', type=ParameterType.FLOAT,
            use_template='temperature',
            label=I18nObject(
                zh_Hans='温度', en_US='Temperature'
            )
        ),
        ParameterRule(
            name='top_p', type=ParameterType.FLOAT,
            use_template='top_p',
            label=I18nObject(
                zh_Hans='Top P', en_US='Top P'
            )
        ),
        ParameterRule(
            name='max_tokens', type=ParameterType.INT,
            use_template='max_tokens',
            min=1,
            default=512,
            label=I18nObject(
                zh_Hans='最大生成长度', en_US='Max Tokens'
            )
        )
    ]

    # if model is A, add top_k to rules
    if model == 'A':
        rules.append(
            ParameterRule(
                name='top_k', type=ParameterType.INT,
                use_template='top_k',
                min=1,
                default=50,
                label=I18nObject(
                    zh_Hans='Top K', en_US='Top K'
                )
            )
        )

    # ... additional ParameterRule entries omitted for brevity ...

    entity = AIModelEntity(
        model=model,
        label=I18nObject(
            en_US=model
        ),
        fetch_from=FetchFrom.CUSTOMIZABLE_MODEL,
        model_type=model_type,
        model_properties={ 
            ModelPropertyKey.MODE:  ModelType.LLM,
        },
        parameter_rules=rules
    )

    return entity

Error Mapping

When an error occurs during model invocation, map it to one of the runtime’s InvokeError types so Dify can handle different errors consistently:
  • InvokeConnectionError
  • InvokeServerUnavailableError
  • InvokeRateLimitError
  • InvokeAuthorizationError
  • InvokeBadRequestError
@property
def _invoke_error_mapping(self) -> dict[type[InvokeError], list[type[Exception]]]:
    """
    Map model invocation errors to unified error types.
    The key is the error type thrown to the caller.
    The value is the error type thrown by the model, which needs to be mapped to a
    unified Dify error for consistent handling.
    """
    # return {
    #   InvokeConnectionError: [requests.exceptions.ConnectionError],
    #   ...
    # }
For more details on interface methods, see the Model Documentation. For the complete code files discussed in this guide, see the GitHub repository.

3. Debug the Plugin

After development, test the plugin to make sure it runs correctly. For details, see:

Debug Plugin

4. Publish the Plugin

To list the plugin on the Dify Marketplace, see Publish to Dify Marketplace.

Explore More

Quick Start: Plugins Endpoint Docs:
Edit this page | Report an issue