环境

安装Conda

最低要求 CUDA 版本为 11.3

#获取安装脚本

wget -c 'https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh'

#安装

sh ./Anaconda3-2024.06-1-Linux-x86_64.sh

#下面是安装过程

Please, press ENTER to continue	点击回车

Do you accept the license terms? [yes|no]	输入yes

>>> yes

Anaconda3 will now be installed into this location:

/root/anaconda3

  - Press ENTER to confirm the location

  - Press CTRL-C to abort the installation

  - Or specify a different location below

[/root/anaconda3] >>> 这里输入安装路径，回车这是默认路径

You can undo this by running `conda init --reverse $SHELL`? [yes|no]

[no] >>> 点击回车

配置环境变量

export  PATH=$PATH:/root/anaconda3/bin # /root/anaconda3/bin是配置的安装路径

命令

conda init #初始化conda，执行下面的命令报错则需要关闭控制台重新打开

conda create -n 环境名称 python=3.10 -y #python参数为3.12后面安装lmdeploy会报错

conda init #执行下面的命令报错则需要关闭控制台重新打开

conda activate 环境名称 #进入conda环境

conda deactivate #退出conda环境

#例

conda create -n lmdeploy python=3.10 -y #python参数为3.12后面安装lmdeploy会报错

conda activate lmdeploy #进入conda环境

安装lmdeploy

pip install lmdeploy

安装Vllm

要求 CUDA 版本>=12.1

pip install vllm

安装Ollama

命令行安装

curl -fsSL https://ollama.com/install.sh | sh

手动安装

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz

sudo tar -C /usr -xzf ollama-linux-amd64.tgz

ollama serve

ollama -v

GPU 状态

nvidia-smi

拉取模型

魔搭社区拉取

#拉取模型

pip install modelscope

modelscope download --model qwen/Qwen-VL-Chat --local_dir /mnt/workspace/qwen-vl

Lmdeploy

部署

使用命令行与 LLM 模型对话

可帮助用户检查和验证 LMDeploy 是否支持提供的模型，聊天模板是否被正确应用，以及推理结果是否正确。

lmdeploy chat 模型路径 --backend 引擎类型（turbomind/pytorch）

#例

lmdeploy chat ./Qwen2-1.5B-Instruct --backend pytorch

Openapi方式部署

LLM 模型

lmdeploy serve api_server --backend 引擎类型 --model-name 模型名称 --server-name 服务地址 --server-port 端口号 --max-batch-size 最大批处理数 --log-level 日志等级 --adapters mylora=lora训练模型路径 模型路径

#例

lmdeploy serve api_server --backend pytorch --model-name Qwen2-0.5B --server-name 127.0.0.1 --server-port 23333 --max-batch-size 128 --log-level INFO --adapters mylora=/mnt/workspace/output/Qwen2-0.5B/checkpoint-124 ./Qwen2-0.5B-Instruct/

测试案例

使用cURL测试

curl http://127.0.0.1:23333/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "Qwen2-0.5B",

    "messages": [{"role": "user", "content": "你好呀，吃饭了没"}]

  }'

VLM 模型（暂有问题）

lmdeploy serve api_server --backend pytorch --model-name qwen-vl --server-name 0.0.0.0 --server-port 23333 --max-batch-size 8 --cache-max-entry-count 0.1 ./qwen-vl

测试案例

代码测试

from modelscope import (

    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig

)

import torch

# model_id = 'qwen/Qwen-VL-Chat'

# revision = 'v1.1.0'

# model_dir = snapshot_download(model_id, revision=revision)

model_dir = '/mnt/workspace/Qwen-VL-Chat'

# torch.manual_seed(12)

# 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。

print(model_dir, "1")

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存

print("2")

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()

# 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()

# 使用CPU进行推理，需要约32GB内存

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()

# 默认使用自动模式，根据设备自动选择精度

# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()

# 可指定不同的生成长度、top_p等相关超参

print("3")

model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)

# 第一轮对话 1st dialogue turn

print("4")

query = tokenizer.from_list_format([

    {'image': '/mnt/workspace/ocr6.jpg'},

    {'text': '提取其中的文字'},

])

print("5")

response, history = model.chat(tokenizer, query=query, history=None)

print(response)

# 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，与人互动。

# 第二轮对话 2st dialogue turn

# response, history = model.chat(tokenizer, '输出击掌的检测框', history=history)

# print(response)

# <ref>"击掌"</ref><box>(211,412),(577,891)</box>

# image = tokenizer.draw_bbox_on_latest_picture(response, history)

# image.save('output_chat.jpg')

CLI 文档

lmdeploy 命令行接口 (CLI) 提供了一个统一的 API，用于管理大型语言模型 (LLMs)，包括转换、压缩和部署这些模型的操作。

一般用法

lmdeploy [-h] [-v] {lite,serve,convert,list,check_env,chat} ...

选项

-h, --help: 显示帮助信息并退出。
-v, --version: 显示程序版本号并退出。

命令

lmdeploy 提供以下命令：

lite: 使用 lmdeploy.lite 模块压缩和加速 LLMs。
serve: 通过 Gradio、OpenAI API 或 Triton 服务器提供 LLMs 服务。
convert: 将 LLMs 转换为 TurboMind 格式。
list: 列出支持的模型名称。
check_env: 检查环境信息。
chat: 与 PyTorch 或 TurboMind 引擎进行对话。

提供模型服务

要提供 LLMs 服务，请使用以下命令：

lmdeploy serve [-h] {gradio,api_server,api_client,triton_client} ...

服务选项

-h, --help: 显示帮助信息并退出。

可用子命令

gradio: 使用 Gradio 提供带有 Web UI 的 LLMs 服务。
api_server: 使用 FastAPI 提供 RESTful API 的 LLMs 服务。
api_client: 在终端与 RESTful API 服务器交互。
triton_client: 使用 gRPC 协议与 Triton 服务器交互。

使用 API 服务器

要使用 RESTful API 提供 LLMs 服务，请使用：

lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT] ... model_path

位置参数

model_path: 模型的路径，可以是：
- 通过 lmdeploy convert 命令转换的 TurboMind 模型的本地目录。
- Hugging Face 上量化模型的模型 ID（例如，"internlm/internlm-chat-20b-4bit"）。
- Hugging Face 上托管的模型的模型 ID（例如，"internlm/internlm-chat-7b"）。

API 服务器选项

--server-name SERVER_NAME: 提供服务的主机 IP（默认：0.0.0.0）。
--server-port SERVER_PORT: 服务器端口（默认：23333）。
--allow-origins ALLOW_ORIGINS [...]: CORS 允许的来源列表（默认：['*']）。
--allow-credentials: 是否允许 CORS 的凭证（默认：False）。
--allow-methods ALLOW_METHODS [...]: CORS 允许的 HTTP 方法列表（默认：['*']）。
--allow-headers ALLOW_HEADERS [...]: CORS 允许的 HTTP 头列表（默认：['*']）。
--qos-config-path QOS_CONFIG_PATH: QoS 策略配置路径（默认：.）。
--backend {pytorch,turbomind}: 设置推理后端（默认：turbomind）。
--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}: 设置日志级别（默认：ERROR）。
--api-keys [API_KEYS ...]: 可选的空格分隔的 API 密钥列表（默认：None）。
--ssl: 启用 SSL。需要环境变量 'SSL_KEYFILE' 和 'SSL_CERTFILE'（默认：False）。
--meta-instruction META_INSTRUCTION: 已弃用。请使用 --chat-template。
--chat-template CHAT_TEMPLATE: 指定聊天模板配置的 JSON 文件或字符串。
--cap {completion,infilling,chat,python}: 已弃用。请使用 --chat-template。
--revision REVISION: 使用的特定模型版本（分支名、标签名、提交 ID）。
--download-dir DOWNLOAD_DIR: 下载/加载权重的目录（默认：Hugging Face 缓存目录）。

PyTorch 引擎参数

--adapters [ADAPTERS ...]: 设置 lora 适配器的路径，可以输入多个适配器的键值对（xxx=yyy 格式）。默认: None。
--tp: 用于张量并行的 GPU 数量，必须是 2 的幂。默认: 1。
--model-name MODEL_NAME: 要部署的模型名称，如 llama-7b、llama-13b、vicuna-7b 等。可以运行 lmdeploy list 获取支持的模型名称。默认: None。
--session-len SESSION_LEN: 序列的最大会话长度。默认: None。
--max-batch-size MAX_BATCH_SIZE: 最大批处理大小。默认: 128。
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: k/v 缓存占用的 GPU 内存百分比（不包括权重）。默认: 0.8。
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: k/v 块中的 token 序列长度。对于 TurboMind 引擎，GPU 计算能力 >= 8.0 时应为 32 的倍数，否则为 64 的倍数。对于 PyTorch 引擎，如果指定了 Lora 适配器，则忽略此参数。默认: 64。
--enable-prefix-caching: 启用缓存并匹配前缀。默认: False。

TurboMind 引擎参数

--tp: 用于张量并行的 GPU 数量，必须是 2 的幂。默认: 1。
--model-name MODEL_NAME: 要部署的模型名称，如 llama-7b、llama-13b、vicuna-7b 等。可以运行 lmdeploy list 获取支持的模型名称。默认: None。
--session-len SESSION_LEN: 序列的最大会话长度。默认: None。
--max-batch-size MAX_BATCH_SIZE: 最大批处理大小。默认: 128。
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: k/v 缓存占用的 GPU 内存百分比（不包括权重）。默认: 0.8。
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: k/v 块中的 token 序列长度。对于 TurboMind 引擎，GPU 计算能力 >= 8.0 时应为 32 的倍数，否则为 64 的倍数。对于 PyTorch 引擎，如果指定了 Lora 适配器，则忽略此参数。默认: 64。
--enable-prefix-caching: 启用缓存并匹配前缀。默认: False。
--model-format {hf,llama,awq}: 输入模型的格式。hf 表示 hf_llama，llama 表示 meta_llama，awq 表示通过 awq 量化的模型。默认: None。
--quant-policy {0,4,8}: 是否量化 k/v。0: 不量化；4: 4 位 k/v；8: 8 位 k/v。默认: 0。
--rope-scaling-factor ROPE_SCALING_FACTOR: Rope 缩放因子。默认: 0.0。
--num-tokens-per-iter NUM_TOKENS_PER_ITER: 在一次前向传递中处理的 token 数量。默认: 0。
--max-prefill-iters MAX_PREFILL_ITERS: 预填充阶段的最大前向传递次数。默认: 1。

视觉模型参数

--vision-max-batch-size VISION_MAX_BATCH_SIZE: 视觉模型的批处理大小。默认: 1。

Vllm

openpi方式部署

LLM 模型

python -m vllm.entrypoints.openai.api_server --model qwen2-0.5B/ --host 0.0.0.0 --port 23334 --dtype auto --served-model-name qwen2-0.5B

使用cURL测试

curl http://127.0.0.1:23333/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "Qwen2-0.5B",

    "messages": [{"role": "user", "content": "你好呀，吃饭了没"}]

  }'

python测试代码

import requests

import json

url = "http://0.0.0.0:23334/v1/chat/completions"

headers = {"Content-Type": "application/json"}

data = {

    "model": "qwen2-0.5B",

    "messages": [

        {"role": "system", "content": "你是一个智能助手"},

        {"role": "user", "content": "你能为我做什么"}

    ],

    "temperature": 0.7,

    "top_p": 0.8,

    "repetition_penalty": 1.05,

    "max_tokens": 512

}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json())

VLM 模型(暂有问题)

#部署模型

python -m vllm.entrypoints.openai.api_server --model qwen-vl/ --host 0.0.0.0 --port 23334 --dtype auto --max-model-len 2048 --served-model-name qwen-vl --gpu-memory-utilization 0.5 --trust-remote-code

curl测试

curl http://localhost:23335/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

    "model": "qwen-vl",

    "messages": [

        {"role": "system", "content": "你是一个智能助手"},

        {"role": "user", "content": [

            {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},

            {"type": "text", "text": "分析一下这张图片"}

        ]}

    ]

}'

#返回值

{"object":"error","message":"Unknown model type: {model_type}","type":"BadRequestError","param":null,"code":400}

python代码测试

from openai import OpenAI

openai_api_key = "EMPTY"

openai_api_base = "http://localhost:23334/v1"

client = OpenAI(

    api_key=openai_api_key,

    base_url=openai_api_base,

)

# Single-image input inference

image_url = "/mnt/workspace/ocr4.jpg"

chat_response = client.chat.completions.create(

    model="qwen-vl-chat",

    messages=[{

        "role": "user",

        "content": [

            # NOTE: The prompt formatting with the image token `<image>` is not needed

            # since the prompt will be processed automatically by the API server.

            {"type": "text", "text": "What’s in this image?"},

            {"type": "image_url", "image_url": {"url": image_url}},

        ],

    }],

)

print("Chat completion output:", chat_response.choices[0].message.content)

多卡分布式部署（未测试）

通过传递参数 --tensor-parallel-size 来运行多GPU服务

python -m vllm.entrypoints.api_server \

    --model Qwen/Qwen2-72B-Instruct \

    --tensor-parallel-size 4

引擎参数

vllm serve [-h] [--model MODEL] [--tokenizer TOKENIZER]

                  [--skip-tokenizer-init] [--revision REVISION]

                  [--code-revision CODE_REVISION]

                  [--tokenizer-revision TOKENIZER_REVISION]

                  [--tokenizer-mode {auto,slow,mistral}] [--trust-remote-code]

                  [--download-dir DOWNLOAD_DIR]

                  [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes}]

                  [--dtype {auto,half,float16,bfloat16,float,float32}]

                  [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]

                  [--quantization-param-path QUANTIZATION_PARAM_PATH]

                  [--max-model-len MAX_MODEL_LEN]

                  [--guided-decoding-backend {outlines,lm-format-enforcer}]

                  [--distributed-executor-backend {ray,mp}] [--worker-use-ray]

                  [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]

                  [--tensor-parallel-size TENSOR_PARALLEL_SIZE]

                  [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]

                  [--ray-workers-use-nsight] [--block-size {8,16,32}]

                  [--enable-prefix-caching] [--disable-sliding-window]

                  [--use-v2-block-manager]

                  [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]

                  [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB]

                  [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]

                  [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]

                  [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]

                  [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS]

                  [--disable-log-stats]

                  [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,neuron_quant,None}]

                  [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]

                  [--enforce-eager]

                  [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]

                  [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]

                  [--disable-custom-all-reduce]

                  [--tokenizer-pool-size TOKENIZER_POOL_SIZE]

                  [--tokenizer-pool-type TOKENIZER_POOL_TYPE]

                  [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]

                  [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT] [--enable-lora]

                  [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK]

                  [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]

                  [--lora-dtype {auto,float16,bfloat16,float32}]

                  [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]

                  [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]

                  [--enable-prompt-adapter]

                  [--max-prompt-adapters MAX_PROMPT_ADAPTERS]

                  [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]

                  [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu}]

                  [--num-scheduler-steps NUM_SCHEDULER_STEPS]

                  [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]

                  [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]

                  [--speculative-model SPECULATIVE_MODEL]

                  [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,neuron_quant,None}]

                  [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]

                  [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]

                  [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]

                  [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]

                  [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]

                  [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]

                  [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]

                  [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]

                  [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]

                  [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]

                  [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]

                  [--ignore-patterns IGNORE_PATTERNS]

                  [--preemption-mode PREEMPTION_MODE]

                  [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]

                  [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]

                  [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]

                  [--collect-detailed-traces COLLECT_DETAILED_TRACES]

                  [--disable-async-output-proc]

                  [--override-neuron-config OVERRIDE_NEURON_CONFIG]

参数详解

--model

Name or path of the huggingface model to use.

要使用的 huggingface 模型的名称或路径。

Default: “facebook/opt-125m”

默认值： “facebook/opt-125m”

--tokenizer

Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.

要使用的 huggingface 分词器的名称或路径。如果未指定，则将使用模型名称或路径。

--skip-tokenizer-init

Skip initialization of tokenizer and detokenizer

跳过 tokenizer 和 detokenizer 的初始化

--revision

The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

要使用的特定模型版本。它可以是分支名称、标签名称或提交 ID。如果未指定，将使用默认版本。

--code-revision

The specific revision to use for the model code on Hugging Face Hub. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

用于 Hugging Face Hub 上的模型代码的特定修订。它可以是分支名称、标签名称或提交 ID。如果未指定，将使用默认版本。

--tokenizer-revision

Revision of the huggingface tokenizer to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

修订了要使用的 huggingface 分词器。它可以是分支名称、标签名称或提交 ID。如果未指定，将使用默认版本。

--tokenizer-mode

Possible choices: auto, slow, mistral

可能的选项：自动、慢速、杂音

The tokenizer mode. 分词器模式。

“auto” will use the fast tokenizer if available.

“auto” 将使用 Fast 分词器（如果可用）。

“slow” will always use the slow tokenizer.

“slow” 将始终使用 slow 分词器。

“mistral” will always use the mistral_common tokenizer.

“mistral” 将始终使用 mistral_common 分词器。

Default: “auto” 默认值： “auto”

--trust-remote-code

Trust remote code from huggingface.

信任来自 huggingface 的远程代码。

--download-dir

Directory to download and load the weights, default to the default cache dir of huggingface.

目录下载加载权重，默认为 huggingface 的默认缓存目录。

--load-format

Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes

可能的选项：auto、pt、safetensors、npcache、dummy、tensorizer、sharded_state、gguf、bitsandbytes

The format of the model weights to load.

要加载的模型权重的格式。

“auto” will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.

“auto” 将尝试以 SafeTensors 格式加载权重，如果 SafeTensors 格式不可用，则回退到 PyTorch bin 格式。

“pt” will load the weights in the pytorch bin format.

“pt” 将以 PyTorch bin 格式加载权重。

“safetensors” will load the weights in the safetensors format.

“safeTensors” 将以 safeTensors 格式加载权重。

“npcache” will load the weights in pytorch format and store a numpy cache to speed up the loading.

“npcache” 将以 PyTorch 格式加载权重，并存储 numpy 缓存以加快加载速度。

“dummy” will initialize the weights with random values, which is mainly for profiling.

“dummy” 会用 random 值初始化权重，主要是为了分析。

“tensorizer” will load the weights using tensorizer from CoreWeave. See the Tensorize vLLM Model script in the Examples section for more information.

“tensorizer” 将使用 CoreWeave 中的 tensorizer 加载权重。有关更多信息，请参阅示例部分中的 Tensorize vLLM 模型脚本。

“bitsandbytes” will load the weights using bitsandbytes quantization.

“bitsandbytes” 将使用 bitsandbytes 量化加载权重。

Default: “auto” 默认值： “auto”

--dtype

Possible choices: auto, half, float16, bfloat16, float, float32

可能的选项：auto、half、float16、bfloat16、float、float32

Data type for model weights and activations.

模型权重和激活的数据类型。

“auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.

“auto” 将对 FP32 和 FP16 模型使用 FP16 精度，对 BF16 模型使用 BF16 精度。

“half” for FP16. Recommended for AWQ quantization.

“half” 代表 FP16。推荐用于 AWQ 量化。

“float16” is the same as “half”.

“float16” 与 “half” 相同。

“bfloat16” for a balance between precision and range.

“bfloat16” 用于精度和范围之间的平衡。

“float” is shorthand for FP32 precision.

“float” 是 FP32 精度的简写。

“float32” for FP32 precision.

“float32” 表示 FP32 精度。

Default: “auto” 默认值： “auto”

--kv-cache-dtype

Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3

可能的选项：auto、fp8、fp8_e5m2 fp8_e4m3

Data type for kv cache storage. If “auto”, will use model data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports fp8 (=fp8_e4m3)

kv 缓存存储的数据类型。如果为 “auto”，将使用 model 数据类型。CUDA 11.8+ 支持 fp8 （=fp8_e4m3） 和 fp8_e5m2。ROCm （AMD GPU） 支持 fp8 （=fp8_e4m3）

Default: “auto” 默认值： “auto”

--quantization-param-path

Path to the JSON file containing the KV cache scaling factors. This should generally be supplied, when KV cache dtype is FP8. Otherwise, KV cache scaling factors default to 1.0, which may cause accuracy issues. FP8_E5M2 (without scaling) is only supported on cuda versiongreater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for common inference criteria.

包含 KV 缓存缩放因子的 JSON 文件的路径。当 KV 缓存 dtype 为 FP8 时，通常应该提供此项。否则，KV 缓存缩放因子默认为 1.0，这可能会导致准确性问题。FP8_E5M2（无扩展）仅在高于 11.8 的 CUDA 版本上受支持。在 ROCm （AMD GPU） 上，FP8_E4M3 支持常见的推理标准。

--max-model-len

Model context length. If unspecified, will be automatically derived from the model config.

模型上下文长度。如果未指定，将自动从模型配置中派生。

--guided-decoding-backend

Possible choices: outlines, lm-format-enforcer

可能的选择：outlines、lm-format-enforcer

Which engine will be used for guided decoding (JSON schema / regex etc) by default. Currently support outlines-dev/outlines and noamgat/lm-format-enforcer. Can be overridden per request via guided_decoding_backend parameter.

默认情况下，哪个引擎将用于引导式解码（JSON 架构/正则表达式等）。目前支持 outlines-dev/outlines 和 noamgat/lm-format-enforcer。可以通过 guided_decoding_backend 参数按请求覆盖。

Default: “outlines” 默认值： “outlines”

--distributed-executor-backend

Possible choices: ray, mp

可能的选择：ray、mp

Backend to use for distributed serving. When more than 1 GPU is used, will be automatically set to “ray” if installed or “mp” (multiprocessing) otherwise.

用于分布式服务的后端。当使用超过 1 个 GPU 时，如果已安装，则自动设置为 “ray” 或 “mp” （multiprocessing），否则将自动设置为 “mp” （multiprocessing）。

--worker-use-ray

Deprecated, use –distributed-executor-backend=ray.

已弃用，请使用 –distributed-executor-backend=ray。

--pipeline-parallel-size, -pp

Number of pipeline stages.

管道阶段数。

Default: 1 默认值：1

--tensor-parallel-size, -tp

Number of tensor parallel replicas.

张量并行副本数。

Default: 1 默认值：1

--max-parallel-loading-workers

Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.

分多个批次顺序加载模型，以避免在使用 Tensor 并行和大型模型时出现 RAM OOM。

--ray-workers-use-nsight

If specified, use nsight to profile Ray workers.

如果指定，请使用 nsight 分析 Ray 工作线程。

--block-size

Possible choices: 8, 16, 32

可能的选项：8、16、32

Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to max-model-len

连续 Token 块的 Token 区块大小。这在 neuron 设备上被忽略，并设置为 max-model-len

Default: 16 默认值：16

--enable-prefix-caching

Enables automatic prefix caching.

启用自动前缀缓存。

--disable-sliding-window

Disables sliding window, capping to sliding window size

禁用滑动窗口，上限为滑动窗口大小

--use-v2-block-manager

Use BlockSpaceMangerV2. 使用 BlockSpaceMangerV2。

--num-lookahead-slots

Experimental scheduling config necessary for speculative decoding. This will be replaced by speculative config in the future; it is present to enable correctness tests until then.

推测解码所需的实验性调度配置。这将在未来被 speculative config 取代;在此之前，它用于启用正确性测试。

Default: 0 默认值：0

--seed

Random seed for operations.

操作的随机种子。

Default: 0 默认值：0

--swap-space

CPU swap space size (GiB) per GPU.

每个 GPU 的 CPU 交换空间大小 （GiB）。

Default: 4 默认值：4

--cpu-offload-gb

The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight,which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model isloaded from CPU memory to GPU memory on the fly in each model forward pass.

每个 GPU 要卸载到 CPU 的空间（以 GiB 为单位）。默认值为 0，这意味着没有卸载。直观地说，这个参数可以看作是增加 GPU 内存大小的虚拟方式。例如，如果您有一个 24 GB 的 GPU，并将其设置为 10，则几乎可以将其视为 34 GB GPU。然后你可以加载一个 BF13 权重的 16B 模型，这至少需要 26GB 的 GPU 内存。请注意，这需要快速的 CPU-GPU 互连，因为模型的一部分在每个模型前向传递中都会动态地从 CPU 内存加载到 GPU 内存。

Default: 0 默认值：0

--gpu-memory-utilization

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

用于模型执行程序的 GPU 内存分数，范围为 0 到 1。例如，值 0.5 表示 GPU 内存利用率为 50%。如果未指定，将使用默认值 0.9。

Default: 0.9 默认值：0.9

--num-gpu-blocks-override

If specified, ignore GPU profiling result and use this numberof GPU blocks. Used for testing preemption.

如果指定，则忽略 GPU 分析结果并使用此数量的 GPU 块。用于测试抢占。

--max-num-batched-tokens

Maximum number of batched tokens per iteration.

每次迭代的最大批处理令牌数。

--max-num-seqs

Maximum number of sequences per iteration.

每次迭代的最大序列数。

Default: 256 默认值：256

--max-logprobs

Max number of log probs to return logprobs is specified in SamplingParams.

返回 logprobs 的最大对数概率数在 SamplingParams 中指定。

Default: 20 默认值：20

--disable-log-stats

Disable logging statistics.

禁用日志记录统计信息。

--quantization, -q

Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, qqq, experts_int8, neuron_quant, None

可能的选择：aqlm、awq、deepspeedfp、tpu_int8、fp8、fbgemm_fp8、marlin、gguf、gptq_marlin_24、gptq_marlin、awq_marlin、gptq、squeezellm、compressed-tensors、bitsandbytes、qqq、experts_int8、neuron_quant、无

Method used to quantize the weights. If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

用于量化权重的方法。如果为 None，我们首先检查模型配置文件中的 quantization_config 属性。如果为 None，则我们假设模型权重未量化，并使用 dtype 来确定权重的数据类型。

--rope-scaling

RoPE scaling configuration in JSON format. For example, {“type”:”dynamic”,”factor”:2.0}

JSON 格式的 RoPE 扩展配置。例如，{“type”：“dynamic”，“factor”：2.0}

--rope-theta

RoPE theta. Use with rope_scaling. In some cases, changing the RoPE theta improves the performance of the scaled model.

RoPE θ。与 rope_scaling 一起使用。在某些情况下，更改 RoPE theta 可以提高缩放模型的性能。

--enforce-eager

Always use eager-mode PyTorch. If False, will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.

始终使用 eager-mode PyTorch。如果为 False，将在混合模式下使用 Eager 模式和 CUDA 图，以获得最大的性能和灵活性。

--max-context-len-to-capture

Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. (DEPRECATED. Use –max-seq-len-to-capture instead)

CUDA 图涵盖的最大上下文长度。当序列的上下文长度大于此长度时，我们将回退到 Eager 模式。（已弃用。请改用 –max-seq-len-to-capture）

--max-seq-len-to-capture

Maximum sequence length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.

CUDA 图涵盖的最大序列长度。当序列的上下文长度大于此长度时，我们将回退到 Eager 模式。

Default: 8192 默认值：8192

--disable-custom-all-reduce

See ParallelConfig. 请参阅 ParallelConfig。

--tokenizer-pool-size

Size of tokenizer pool to use for asynchronous tokenization. If 0, will use synchronous tokenization.

用于异步分词化的分词器池的大小。如果为 0，则将使用同步分词。

Default: 0 默认值：0

--tokenizer-pool-type

Type of tokenizer pool to use for asynchronous tokenization. Ignored if tokenizer_pool_size is 0.

用于异步分词化的分词器池的类型。如果 tokenizer_pool_size 为 0，则忽略。

Default: “ray” 默认值： “ray”

--tokenizer-pool-extra-config

Extra config for tokenizer pool. This should be a JSON string that will be parsed into a dictionary. Ignored if tokenizer_pool_size is 0.

tokenizer pool 的额外配置。这应该是一个 JSON 字符串，将被解析为字典。如果 tokenizer_pool_size 为 0，则忽略。

--limit-mm-per-prompt

For each multimodal plugin, limit how many input instances to allow for each prompt. Expects a comma-separated list of items, e.g.: image=16,video=2 allows a maximum of 16 images and 2 videos per prompt. Defaults to 1 for each modality.

对于每个多模态插件，限制每个提示允许的输入实例数。需要以逗号分隔的项目列表，例如：image=16，video=2 允许每个提示最多 16 张图像和 2 个视频。对于每个模式，默认为 1。

--enable-lora

If True, enable handling of LoRA adapters.

如果为 True，则启用 LoRA 适配器的处理。

--max-loras

Max number of LoRAs in a single batch.

单个批次中的最大 LoRA 数。

Default: 1 默认值：1

--max-lora-rank

Max LoRA rank. 最大 LoRA 等级。

Default: 16 默认值：16

--lora-extra-vocab-size

Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocabulary).

LoRA 适配器中可以存在的额外词汇表的最大大小（添加到基本模型词汇表中）。

Default: 256 默认值：256

--lora-dtype

Possible choices: auto, float16, bfloat16, float32

可能的选项：auto、float16、bfloat16、float32

Data type for LoRA. If auto, will default to base model dtype.

LoRA 的数据类型。如果为 auto，则默认为基础模型 dtype。

Default: “auto” 默认值： “auto”

--long-lora-scaling-factors

Specify multiple scaling factors (which can be different from base model scaling factor - see eg. Long LoRA) to allow for multiple LoRA adapters trained with those scaling factors to be used at the same time. If not specified, only adapters trained with the base model scaling factor are allowed.

指定多个比例因子（可能与基础模型比例因子不同 - 参见 eg.Long LoRA） 以允许同时使用使用这些缩放因子训练的多个 LoRA 接头。如果未指定，则只允许使用基本模型缩放因子训练的适配器。

--max-cpu-loras

Maximum number of LoRAs to store in CPU memory. Must be >= than max_num_seqs. Defaults to max_num_seqs.

要存储在 CPU 内存中的最大 LoRA 数。必须为 >= 而不是 max_num_seqs。默认为 max_num_seqs。

--fully-sharded-loras

By default, only half of the LoRA computation is sharded with tensor parallelism. Enabling this will use the fully sharded layers. At high sequence length, max rank or tensor parallel size, this is likely faster.

默认情况下，只有一半的 LoRA 计算使用张量并行进行分片。启用此选项将使用完全分片的层。在高序列长度、最大秩或张量并行大小时，这可能更快。

--enable-prompt-adapter

If True, enable handling of PromptAdapters.

如果为 True，则启用 PromptAdapters 的处理。

--max-prompt-adapters

Max number of PromptAdapters in a batch.

批处理中 PromptAdapter 的最大数量。

Default: 1 默认值：1

--max-prompt-adapter-token

Max number of PromptAdapters tokens

PromptAdapters 令牌的最大数量

Default: 0 默认值：0

--device

Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu

可能的选择：auto、cuda、neuron、cpu、openvino、tpu、xpu

Device type for vLLM execution.

vLLM 执行的设备类型。

Default: “auto” 默认值： “auto”

--num-scheduler-steps

Maximum number of forward steps per scheduler call.

每个计划程序调用的最大前进步骤数。

Default: 1 默认值：1

--scheduler-delay-factor

Apply a delay (of delay factor multiplied by previousprompt latency) before scheduling next prompt.

在计划下一个提示之前应用延迟（延迟因子乘以 previousprompt 延迟）。

Default: 0.0 默认值：0.0

--enable-chunked-prefill

If set, the prefill requests can be chunked based on the max_num_batched_tokens.

如果设置，则可以根据 max_num_batched_tokens 对预填充请求进行分块。

--speculative-model

The name of the draft model to be used in speculative decoding.

要用于推测解码的草稿模型的名称。

--speculative-model-quantization

Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, qqq, experts_int8, neuron_quant, None

可能的选择：aqlm、awq、deepspeedfp、tpu_int8、fp8、fbgemm_fp8、marlin、gguf、gptq_marlin_24、gptq_marlin、awq_marlin、gptq、squeezellm、compressed-tensors、bitsandbytes、qqq、experts_int8、neuron_quant、无

Method used to quantize the weights of speculative model.If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

用于量化推测模型权重的方法。如果为 None，我们首先检查模型配置文件中的 quantization_config 属性。如果为 None，则我们假设模型权重未量化，并使用 dtype 来确定权重的数据类型。

--num-speculative-tokens

The number of speculative tokens to sample from the draft model in speculative decoding.

在推测解码中要从 draft 模型中采样的推测令牌的数量。

--speculative-draft-tensor-parallel-size, -spec-draft-tp

Number of tensor parallel replicas for the draft model in speculative decoding.

推测解码中 draft 模型的张量并行副本数。

--speculative-max-model-len

The maximum sequence length supported by the draft model. Sequences over this length will skip speculation.

绘制模型支持的最大序列长度。超过此长度的序列将跳过猜测。

--speculative-disable-by-batch-size

Disable speculative decoding for new incoming requests if the number of enqueue requests is larger than this value.

如果入队请求的数量大于此值，则对新的传入请求禁用推理解码。

--ngram-prompt-lookup-max

Max size of window for ngram prompt lookup in speculative decoding.

推测解码中 ngram 提示查找的最大窗口大小。

--ngram-prompt-lookup-min

Min size of window for ngram prompt lookup in speculative decoding.

推测解码中 ngram 提示查找的窗口最小大小。

--spec-decoding-acceptance-method

Possible choices: rejection_sampler, typical_acceptance_sampler

可能的选项：rejection_sampler、typical_acceptance_sampler

Specify the acceptance method to use during draft token verification in speculative decoding. Two types of acceptance routines are supported: 1) RejectionSampler which does not allow changing the acceptance rate of draft tokens, 2) TypicalAcceptanceSampler which is configurable, allowing for a higher acceptance rate at the cost of lower quality, and vice versa.

指定在推理解码中的草稿令牌验证期间要使用的接受方法。支持两种类型的接受例程：1） RejectionSampler，不允许更改草稿令牌的接受率，2） TypicalAcceptanceSampler，这是可配置的，允许以较低质量为代价获得更高的接受率，反之亦然。

Default: “rejection_sampler”

默认值：“rejection_sampler”

--typical-acceptance-sampler-posterior-threshold

Set the lower bound threshold for the posterior probability of a token to be accepted. This threshold is used by the TypicalAcceptanceSampler to make sampling decisions during speculative decoding. Defaults to 0.09

设置要接受的标记的后验概率的下限阈值。此阈值由 TypicalAcceptanceSampler 用于在推理解码期间做出采样决策。默认值为 0.09

--typical-acceptance-sampler-posterior-alpha

A scaling factor for the entropy-based threshold for token acceptance in the TypicalAcceptanceSampler. Typically defaults to sqrt of –typical-acceptance-sampler-posterior-threshold i.e. 0.3

TypicalAcceptanceSampler 中基于熵的令牌接受阈值的比例因子。通常默认为 –typical-acceptance-sampler-posterior-threshold 的 sqrt，即 0.3

--disable-logprobs-during-spec-decoding

If set to True, token log probabilities are not returned during speculative decoding. If set to False, log probabilities are returned according to the settings in SamplingParams. If not specified, it defaults to True. Disabling log probabilities during speculative decoding reduces latency by skipping logprob calculation in proposal sampling, target sampling, and after accepted tokens are determined.

如果设置为 True，则在推理解码期间不会返回令牌对数概率。如果设置为 False，则根据 SamplingParams 中的设置返回对数概率。如果未指定，则默认为 True。在推测解码期间禁用对数概率可以通过在建议采样、目标采样和确定接受的令牌后跳过 logprob 计算来减少延迟。

--model-loader-extra-config

Extra config for model loader. This will be passed to the model loader corresponding to the chosen load_format. This should be a JSON string that will be parsed into a dictionary.

模型加载器的额外配置。这将传递给与所选load_format相对应的模型加载器。这应该是一个 JSON 字符串，将被解析为字典。

--ignore-patterns

The pattern(s) to ignore when loading the model.Default to ‘original/**/*’ to avoid repeated loading of llama’s checkpoints.

加载模型时要忽略的模式。默认为 'original/**/*' 以避免重复加载 llama 的 checkpoints。

Default: [] 默认值：[]

--preemption-mode

If ‘recompute’, the engine performs preemption by recomputing; If ‘swap’, the engine performs preemption by block swapping.

如果为 'recompute'，则引擎通过重新计算来执行抢占;如果为 'swap'，则引擎通过区块交换执行抢占。

--served-model-name

The model name(s) used in the API. If multiple names are provided, the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be the same as the –model argument. Noted that this name(s)will also be used in model_name tag content of prometheus metrics, if multiple names provided, metricstag will take the first one.

API 中使用的模型名称。如果提供了多个名称，则服务器将响应提供的任何名称。响应的 model 字段中的 model name 将是此列表中的第一个名称。如果未指定，则模型名称将与 –model 参数相同。注意，这个 name 也会用model_name prometheus metrics 的标签内容，如果提供了多个 name，metricstag 会取第一个。

--qlora-adapter-name-or-path

Name or path of the QLoRA adapter.

QLoRA 适配器的名称或路径。

--otlp-traces-endpoint

Target URL to which OpenTelemetry traces will be sent.

OpenTelemetry 跟踪将发送到的目标 URL。

--collect-detailed-traces

Valid choices are model,worker,all. It makes sense to set this only if –otlp-traces-endpoint is set. If set, it will collect detailed traces for the specified modules. This involves use of possibly costly and or blocking operations and hence might have a performance impact.

有效选项为 model，worker，all。仅当设置了 –otlp-traces-endpoint 时，设置此项才有意义。如果设置，它将收集指定模块的详细跟踪。这涉及使用可能成本高昂的 OR 阻塞操作，因此可能会对性能产生影响。

--disable-async-output-proc

Disable async output processing. This may result in lower performance.

禁用异步输出处理。这可能会导致性能降低。

--override-neuron-config

override or set neuron device configuration.

覆盖或设置 Neuron 设备配置。

Ollama

启动服务

#启动服务

ollama serve

#后台运行

nohup ollama serve &

常见命令

run
- 描述: 运行指定的模型并生成输出。
- 示例:
```
ollama run <model_name> --input "<your_input>"
```
pull
- 描述: 下载指定的模型。
- 示例:
```
ollama pull <model_name>
```
list
- 描述: 列出已下载的模型。
- 示例:
```
ollama list
```
remove
- 描述: 删除指定的模型。
- 示例:
```
ollama remove <model_name>
```



5. **`info`**

 - **描述**: 显示指定模型的详细信息。

 - **示例**:

   ```bash

   ollama info <model_name>

测试

curl测试-LLM

curl http://localhost:11434/api/generate -d '{

  "model": "qwen2:7b",

  "prompt":"你好啊"

}'

#请求效果相同

curl http://localhost:11434/api/chat -d '{

  "model": "qwen2:7b",

  "messages": [

    { "role": "user", "content": "你好啊" }

  ]

}'

curl测试-VLM

curl http://localhost:11434/api/generate -d '{

  "model": "llava:7b",

  "prompt": "What's in this image?",

  "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"]

}'

配置

Ollama 提供了多种环境变量以供配置：

OLLAMA_DEBUG：是否开启调试模式，默认为 false。

OLLAMA_FLASH_ATTENTION：是否闪烁注意力，默认为 true。

OLLAMA_HOST：Ollama 服务器的主机地址，默认为空。

OLLAMA_KEEP_ALIVE：保持连接的时间，默认为 5m。

OLLAMA_LLM_LIBRARY：LLM 库，默认为空。

OLLAMA_MAX_LOADED_MODELS：最大加载模型数，默认为 1。

OLLAMA_MAX_QUEUE：最大队列数，默认为空。

OLLAMA_MAX_VRAM：最大虚拟内存，默认为空。

OLLAMA_MODELS：模型目录，默认为空。

OLLAMA_NOHISTORY：是否保存历史记录，默认为 false。

OLLAMA_NOPRUNE：是否启用剪枝，默认为 false。

OLLAMA_NUM_PARALLEL：并行数，默认为 1。

OLLAMA_ORIGINS：允许的来源，默认为空。

OLLAMA_RUNNERS_DIR：运行器目录，默认为空。

OLLAMA_SCHED_SPREAD：调度分布，默认为空。

OLLAMA_TMPDIR：临时文件目录，默认为空。Here is the optimized list in the desired format:

OLLAMA_DEBUG：是否开启调试模式，默认为 false。

OLLAMA_FLASH_ATTENTION：是否闪烁注意力，默认为 true。

OLLAMA_HOST：Ollama 服务器的主机地址，默认为空。

OLLAMA_KEEP_ALIVE：保持连接的时间，默认为 5m。

OLLAMA_LLM_LIBRARY：LLM 库，默认为空。

OLLAMA_MAX_LOADED_MODELS：最大加载模型数，默认为 1。

OLLAMA_MAX_QUEUE：最大队列数，默认为空。

OLLAMA_MAX_VRAM：最大虚拟内存，默认为空。

OLLAMA_MODELS：模型目录，默认为空。

OLLAMA_NOHISTORY：是否保存历史记录，默认为 false。

OLLAMA_NOPRUNE：是否启用剪枝，默认为 false。

OLLAMA_NUM_PARALLEL：并行数，默认为 1。

OLLAMA_ORIGINS：允许的来源，默认为空。

OLLAMA_RUNNERS_DIR：运行器目录，默认为空。

OLLAMA_SCHED_SPREAD：调度分布，默认为空。

OLLAMA_TMPDIR：临时文件目录，默认为空。

模型微调

1、Lora

python依赖

pip install modelscope transformers streamlit sentencepiece accelerate transformers_stream_generator datasets peft tf_keras swanlab

模型微调

# Qwen2-0.5B-train.py

import json

import pandas as pd

import torch

from datasets import Dataset

from modelscope import AutoTokenizer

from swanlab.integration.huggingface import SwanLabCallback

from peft import LoraConfig, TaskType, get_peft_model

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

import os

import swanlab

# 权重根目录

BASE_DIR = '/mnt/workspace'

# 设备名称

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 数据集处理函数，包括：训练数据集和测试数据集

def dataset_jsonl_transfer(origin_path, new_path):

    """

    将原始数据集转换为大模型微调所需数据格式的新数据集

    """

    messages = []

    # 读取原JSONL文件

    with open(origin_path, "r", encoding="utf-8") as file:

        for line in file:

            # 解析每一行原始数据（每一行均是一个JSON格式）

            data = json.loads(line)

            #字段根据训练集数据调整

            text = data["prompt"]

            catagory = data["catagory"]

            output = data["respond"]

            message = {

                "input": f"文本:{text},分类选项列表:{catagory}",

                "output": output,

            }

            messages.append(message)

    # 保存处理后的JSONL文件，每行也是一个JSON格式

    with open(new_path, "w", encoding="utf-8") as file:

        for message in messages:

            file.write(json.dumps(message, ensure_ascii=False) + "\n")

# 在使用数据集训练大模型之前，对每行数据进行预处理

def process_func(example):

    """

    将数据集进行预处理

    """

    MAX_LENGTH = 384

    input_ids, attention_mask, labels = [], [], []

    instruction = tokenizer(f"<|im_start|>system\n你是一个智能助手<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens

    response = tokenizer(f"{example['output']}", add_special_tokens=False)

    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]

    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  # 因为eos token咱们也是要关注的所以 补充为1

    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]

    if len(input_ids) > MAX_LENGTH:  # 做一个截断

        input_ids = input_ids[:MAX_LENGTH]

        attention_mask = attention_mask[:MAX_LENGTH]

        labels = labels[:MAX_LENGTH]

    return {

        "input_ids": input_ids,

        "attention_mask": attention_mask,

        "labels": labels

    }

# 加载预训练模型和分词器

model_dir = os.path.join(BASE_DIR, 'Qwen2-0.5B-Instruct')

tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, torch_dtype=torch.bfloat16)

model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法

# 加载、处理数据集和测试集

train_dataset_path = os.path.join(BASE_DIR, 'Muice-Dataset', 'train.jsonl')

# test_dataset_path = os.path.join(BASE_DIR, 'zh_cls_fudan-news', 'test.jsonl')

train_jsonl_new_path = os.path.join(BASE_DIR, 'Muice-Dataset', 'train.json')

# test_jsonl_new_path = os.path.join(BASE_DIR, 'test.jsonl')

if not os.path.exists(train_jsonl_new_path):

    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)

# if not os.path.exists(test_jsonl_new_path):

#     dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)

# 得到微调数据集

train_df = pd.read_json(train_jsonl_new_path, lines=True)

train_ds = Dataset.from_pandas(train_df)

train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

# 创建LoRA配置

config = LoraConfig(

    task_type=TaskType.CAUSAL_LM,

    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],

    inference_mode=False,  # 训练模式

    r=8,  # Lora 秩

    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理

    lora_dropout=0.1,  # Dropout 比例

)

# 将LoRA应用于模型

model = get_peft_model(model, config)

# 创建微调参数

args = TrainingArguments(

    #输出路径

    output_dir=os.path.join(BASE_DIR, 'output', 'Qwen2-0.5B'),

    #每一个GPU/TPU 或者CPU核心训练的批次大小

    per_device_train_batch_size=4,

    #累积梯度的更新步骤数

    gradient_accumulation_steps=4,

    #两个日志中更新step的数量

    logging_steps=10,

    #要执行的训练epoch的次数

    num_train_epochs=2,

    #两个checkpoint 保存的更新步骤数

    save_steps=100,

    #优化器初始化的学习率

    learning_rate=1e-4,

    #分布式训练时是否保存模型到其他节点

    save_on_each_node=False,

    #是否使用梯度检查点

    gradient_checkpointing=True,

    #输出路径

    report_to="none",

)

# SwanLab微调过程回调数据

swanlab_callback = SwanLabCallback(project="Qwen2-FineTuning", experiment_name="Qwen2-0.5B")

trainer = Trainer(

    model=model,

    args=args,

    train_dataset=train_dataset,

    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),

    callbacks=[swanlab_callback],

)

# 开始微调

trainer.train()

# 模型结果结果评估

def predict(messages, model, tokenizer):

    text = tokenizer.apply_chat_template(

        messages,

        tokenize=False,

        add_generation_prompt=True

    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(

        model_inputs.input_ids,

        max_new_tokens=512

    )

    generated_ids = [

        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)

    ]

    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# 模型评估：获取测试集的前10条测试数据

# test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]

# test_text_list = []

# for index, row in test_df.iterrows():

#     instruction = row['你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项列表，请输出文本内容的正确分类']

#     input_value = row['input']

#     messages = [

#         {"role": "system", "content": f"{instruction}"},

#         {"role": "user", "content": f"{input_value}"}

#     ]

#     response = predict(messages, model, tokenizer)

#     messages.append({"role": "assistant", "content": f"{response}"})

#     result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"

#     test_text_list.append(swanlab.Text(result_text, caption=response))

# swanlab.log({"Prediction": test_text_list})

# swanlab.finish()

print("微调完成")

微调参数

output_dir：模型的输出路径

num_train_epochs：要执行的训练epoch的次数（如果不是整数，将执行停止训练前的最后一个epoch的小数部分百分比）

lr_scheduler_type：选择什么类型的学习率调度器来更新模型的学习率。

可选的值有：

"linear"

"cosine"

"cosine_with_restarts"

"polynomial"

"constant"

"constant_with_warmup"

load_best_model_at_end：是否在训练结束时加载训练期间发现的最佳模型当设置为“True”时，参数“save_strategy”需要与“evaluation_strategy”相同，并且在这种情况下， steps和 save_steps 必须是eval_steps的整数倍.

save_strategy：训练过程中，checkpoint的保存策略，

可选择的值有：

"no"：训练过程中，不保存checkpoint

"epoch"：每个epoch完成之后保存checkpoint

"steps"：每个save_steps完成之后checkpoint

save_steps：如果`save_strategy="steps"，则两个checkpoint 保存的更新步骤数

evaluation_strategy：训练期间采用的评估策略，

可选的值有：

"no"：训练期间不进行评估

"steps"：每一个eval_steps`阶段之后都进行评估

"epoch"：每一个epoch之后进行评估

metric_for_best_model：与load_best_model_at_end一起使用，指定用于比较两个不同模型。必须是评估返回的度量的名称，带或不带前缀“eval_”。如果没有设定且load_best_model_at_end=True，则默认使用 "loss"，如果我们设置了这个值，则greater_is_better需要设置为 True。如果我们的度量在较低时更好，请不要忘记将其设置为“False”。

greater_is_better：与load_best_model_at_end 和 metric_for_best_model一起使用，说明好的模型是否应该有更好的度量值。

默认值：

True：如果metric_for_best_model设置了值，并且该值不是"loss" 或者 "eval_loss"

False：如果metric_for_best_model没有设置值，或者该值是"loss"或者 "eval_loss".

optim：可以使用的优化器，

adamw_hf

adamw_torch

adamw_apex_fused

adafactor

group_by_length：是否将训练数据集中长度大致相同的样本分组在一起（以最大限度地减少所应用的填充并提高效率）。仅在应用动态填充时有用。

length_column_name：预计算列名的长度，如果列存在，则在按长度分组时使用这些值，而不是在训练启动时计算这些值。例外情况是：group_by_length设置为true，且数据集是Dataset的实例

per_device_train_batch_size：每一个GPU/TPU 或者CPU核心训练的批次大小

gradient_accumulation_steps: 在执行向后/更新过程之前，用于累积梯度的更新步骤数。如果你的显存比较小，那可以把 batch_size 设置小一点，梯度累加增大一些。

logging_steps：如果 logging_strategy="steps"，则两个日志中更新step的数量

gradient_checkpointing：如果为True，则使用梯度检查点以节省内存为代价降低向后传递速度。这个一旦开启，模型就必须执行model.enable_input_require_grads()

learning_rate：AdamW优化器初始化的学习率

save_on_each_node(bool, optional, defaults to False)：当进行多节点分布式训练，是否在每个节点上保存模型和checkpoint还是仅仅在主节点上保存。当不同节点使用相同的存储时，不应激活此选项，因为文件将以相同的名称保存到每个节点

report_to(str or List[str], optional, defaults to "all")：报告结果和日志的integration列表，支持的平台有："azure_ml", "comet_ml", "mlflow", "tensorboard" 和"wandb". 使用 "all"则报告到所有安装的integration，配置为"none"则不报报告到任何的integration。

模型合并

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

from peft import PeftModel

mode_path = '/mnt/workspace/qwen/Qwen2-0.5B-Instruct'

lora_path = '/mnt/workspace/output/Qwen2-0.5B/checkpoint-124' # 这里改称你的 lora 输出对应 checkpoint 地址

# 加载tokenizer

tokenizer = AutoTokenizer.from_pretrained(mode_path, trust_remote_code=True)

# 加载模型

model = AutoModelForCausalLM.from_pretrained(mode_path, device_map="auto",torch_dtype=torch.bfloat16, trust_remote_code=True).eval()

# 加载lora权重

model = PeftModel.from_pretrained(model, model_id=lora_path)

prompt = "你是谁？"

messages = [

    #{"role": "system", "content": "现在你要扮演皇帝身边的女人--甄嬛"},

    #{"role": "user", "content": "假设你是皇帝身边的女人--甄嬛。"},

    {"role": "user", "content": prompt}

]

inputs = tokenizer.apply_chat_template(messages,add_generation_prompt=True,tokenize=True,return_tensors="pt",return_dict=True).to('cuda')

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

with torch.no_grad():

    outputs = model.generate(**inputs, **gen_kwargs)

    outputs = outputs[:, inputs['input_ids'].shape[1]:]

    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ai大模型推理-未完善的更多相关文章

AI大模型学习了解
# 百度文心上线时间:2019年3月官方介绍:https://wenxin.baidu.com/ 发布地点: 参考资料: 2600亿!全球最大中文单体模型鹏城-百度·文心发布 # 华为盘古上线时 ...
华为高级研究员谢凌曦：下一代AI将走向何方？盘古大模型探路之旅
摘要:为了更深入理解千亿参数的盘古大模型,华为云社区采访到了华为云EI盘古团队高级研究员谢凌曦.谢博士以非常通俗的方式为我们娓娓道来了盘古大模型研发的"前世今生",以及它背后的艰难 ...
优化故事: BLOOM 模型推理
经过"九九八十一难",大模型终于炼成.下一步就是架设服务,准备开门营业了.真这么简单?恐怕未必!行百里者半九十,推理优化又是新的雄关漫道.如何进行延迟优化?如何进行成本优化 (别忘 ...
保姆级教程：用GPU云主机搭建AI大语言模型并用Flask封装成API，实现用户与模型对话
导读在当今的人工智能时代,大型AI模型已成为获得人工智能应用程序的关键.但是,这些巨大的模型需要庞大的计算资源和存储空间,因此搭建这些模型并对它们进行交互需要强大的计算能力,这通常需要使用云计算服务 ...
zz独家专访AI大神贾扬清：我为什么选择加入阿里巴巴？
独家专访AI大神贾扬清:我为什么选择加入阿里巴巴? Natalie.Cai 拥有的都是侥幸,失去的都是人生关注她 5 人赞同了该文章本文由「AI前线」原创,原文链接:独家专访AI大神贾扬清:我 ...
天猫精灵业务如何使用机器学习PAI进行模型推理优化
引言天猫精灵(TmallGenie)是阿里巴巴人工智能实验室(Alibaba A.I.Labs)于2017年7月5日发布的AI智能语音终端设备.天猫精灵目前是全球销量第三.中国销量第一的智能音箱品牌 ...
【翻译】借助 NeoCPU 在 CPU 上进行 CNN 模型推理优化
本文翻译自 Yizhi Liu, Yao Wang, Ruofei Yu.. 的 "Optimizing CNN Model Inference on CPUs" 原文链接: h ...
MindSpore模型推理
MindSpore模型推理如果想在应用中使用自定义的MindSpore Lite模型,需要告知推理器模型所在的位置.推理器加载模型的方式有以下三种: 加载本地模型. 加载远程模型. 混合加载本地和远 ...
千亿参数开源大模型 BLOOM 背后的技术
假设你现在有了数据,也搞到了预算,一切就绪,准备开始训练一个大模型,一显身手了,"一朝看尽长安花"似乎近在眼前 -- 且慢!训练可不仅仅像这两个字的发音那么简单,看看 BLOOM ...
DeepSpeed Chat: 一键式RLHF训练，让你的类ChatGPT千亿大模型提速省钱15倍
DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍 1. 概述近日来,ChatGPT及类似模型引发了人工智能(AI)领域的一场风潮. 这场风潮对数字世 ...

随机推荐

Python 使用Python操作xmind文件
使用Python操作xmind文件 by:授客 QQ:1033553122 测试环境 Win10 Python 3.5.4 XMind-1.2.0.tar.gz 下载地址: https://fil ...
openGL之多线程渲染
随着Vulkan的引入,我们的图形技术的发展到达了一个新的顶点,但是呢,我们的老干爹OpenGL作为落日余晖,他在一些Vulkan才有的新功能上,也提供了一些支持,现在我们来讨论一下OpenGL之多线 ...
Java 监听POST请求
要监听POST请求,我们可以使用Java中的HttpServlet类.以下是一个使用Servlet API监听POST请求的完整示例.这个示例使用了Servlet 3.1规范,不需要在web.xml中 ...
18B20的CRC官方讲解
理解和运用MAXIM IBUTTON产品中的循环冗余校验(CRC) 摘要 : 全部1-Wire器件,包括iButton器件,都具有唯一的8字节注册码,储存在只读存储器(ROM)中.该注册码在1-Wir ...
mysql中的隐式转换导致全表扫描
mysql中的隐式转换导致全表扫描在mysql查询中,当查询条件左右两侧类型不匹配的时候会发生隐式转换,可能导致查询无法使用索引.下面分析两种隐式转换的情况看表结构 phone为 int类型,na ...
对比python学julia（第一章）--（第六节）数字黑洞
6.1. 问题描述 6174数字黑洞是印度数学家卡普雷卡尔于1949年发现的,又称为卡普雷卡尔黑洞,其规则描述如下. 任意取一个4位的整数(4个数字不能完全相同),把4个数字由大到小排列成一个大的数, ...
【Vue】Re12 Webpack 第三部分(插件、热部署、配置分离)
一.HtmlWebpackPlugin webpack插件安装: npm install html-webpack-plugin --save-dev // 版本太高构建报错原因换这个 npm ins ...
【转载】 nvidia-smi - Persistence-M (Persistence Mode)
原文链接: https://blog.csdn.net/chengyq116/article/details/103224622 版权声明:本文为CSDN博主「Yongqiang Cheng」的原创文 ...
baselines算法库common/vec_env/vec_env.py模块分析
common/vec_env/vec_env.py模块内容: import contextlib import os from abc import ABC, abstractmethod from ...
php学习笔记（一）————php类的概念
<?php //类的概念 /* * 一个类包含自己的属性和函数 * * 属性:属于类自己的常量和变量 * * 方法:就是函数 * * 类是一类事物的抽象 */ //例子: //车就是一种抽象 c ...

Ai大模型推理-未完善

环境

安装Conda

配置环境变量

命令

安装lmdeploy

安装Vllm

安装Ollama

命令行安装

手动安装

GPU 状态

拉取模型

魔搭社区拉取

Lmdeploy

部署

使用命令行与 LLM 模型对话

Openapi方式部署

LLM 模型

测试案例

VLM 模型（暂有问题）

测试案例

CLI 文档

一般用法

选项

命令

提供模型服务

服务选项

可用子命令

使用 API 服务器

位置参数

API 服务器选项

PyTorch 引擎参数

TurboMind 引擎参数

视觉模型参数

Vllm

openpi方式部署

LLM 模型

VLM 模型(暂有问题)

多卡分布式部署（未测试）

引擎参数

参数详解

Ollama

启动服务

常见命令

测试

配置

模型微调

1、Lora

python依赖

模型微调

微调参数

模型合并

Ai大模型推理-未完善的更多相关文章

随机推荐

热门专题