Qwen3.6开源第一发：把能力压进更小的激活参数里

大家好，我是讯享网，很高兴认识大家。这里提供最前沿的Ai技术和互联网信息。

继 Qwen3.6-Plus 发布之后，Qwen 团队正式开源 Qwen3.6-35B-A3B——一个总参数 35B、激活参数仅 3B 的稀疏混合专家（MoE）模型，采用 Apache 2.0 协议。该模型是 Qwen3.6 系列的首个开源权重版本，在智能体编程能力上大幅超越前代 Qwen3.5-35B-A3B，并可与 Qwen3.5-27B、Gemma-31B 等稠密模型一较高下。模型同时支持多模态思考与非思考模式。

Github：https://github.com/QwenLM/Qwen3.6

Blog：https://qwen.ai/blog?id=qwen3.6-35b-a3b

Model：https://modelscope.cn/collections/Qwen/Qwen36

据官方Blog介绍，Qwen3.6 基于社区反馈构建，侧重稳定性与实际开发效用，主要升级集中在两方面：

其一，智能体编程（Agentic Coding）能力显著增强，模型在前端工作流和仓库级推理任务上表现更流畅、更精准。

其二，新增思维保留（Thinking Preservation）机制，允许在多轮对话中保留历史推理上下文，减少迭代开发中的重复开销。

Qwen3.6-35B-A3B 是一个带视觉编码器的因果语言模型，经过预训练与后训练两个阶段。架构层面，模型包含 40 层，隐藏层维度 2048，采用混合注意力设计：每 10 个重复单元中，包含 3 个门控 DeltaNet（线性注意力）层与 1 个门控注意力层，均接 MoE 前馈网络。MoE 部分共 256 个专家，每次激活 8 个路由专家加 1 个共享专家，专家中间层维度为 512。模型采用多步预测（MTP）训练，原生支持 262,144 tokens 上下文，可扩展至 1,010,000 tokens。

在编程智能体方向，Qwen3.6-35B-A3B 以 3B 激活参数取得了接近甚至超越 27B 稠密模型的成绩。SWE-bench Verified 达到 73.4（Qwen3.5-27B 为 75.0，前代 Qwen3.5-35B-A3B 为 70.0），Terminal-Bench 2.0 达到 51.5，超越所有同级对比模型。在 NL2Repo 上取得 29.4，同样超过 Qwen3.5-27B 的 27.3。QwenWebBench 前端代码生成评测中，Elo 评分达到 1397，大幅领先。

在知识与推理方面，MMLU-Pro 85.2、GPQA 86.0、AIME 2026 全卷 92.7、LiveCodeBench v6 80.4，均与 27B 稠密模型处于同一水平线。

Qwen3.6-35B-A3B 原生支持多模态，在大多数视觉语言基准上表现已与 Claude Sonnet 4.5 持平，部分任务实现超越。具体来看：MMMU 81.7、MathVista 86.4、RealWorldQA 85.3、OmniDocBench 89.9，均高于 Claude Sonnet 4.5 的对应分数。

空间智能方面优势尤为突出：RefCOCO 92.0、ODInW13 50.8、EmbSpatialBench 84.3。视频理解方面，VideoMME（含字幕）86.6、VideoMMMU 83.7，后者超越 Claude Sonnet 4.5 的 77.6。

模型权重已在ModelScope 发布，兼容 Transformers、vLLM、SGLang、KTransformers 等主流推理框架。

官方推荐的采样参数如下：

思考模式下一般任务使用 temperature=1.0、top_p=0.95、presence_penalty=1.5；
精确编码任务使用 temperature=0.6、presence_penalty=0.0；
非思考模式下一般任务使用 temperature=0.7、top_p=0.8。

SGLang

建议在全新环境中安装 sglang>=0.5.10 来运行 Qwen3.6，安装命令如下：

uv pip install sglang[all]

以下命令将在 http://localhost:8000/v1 创建 API 端点：

标准版本：以下命令可在 8 张 GPU 上使用张量并行创建最大上下文长度为 262,144 tokens 的 API 端点。

SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length  --reasoning-parser qwen3

工具调用：若需支持工具调用，可使用以下命令。

SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length  --reasoning-parser qwen3 --tool-call-parser qwen3_coder

多 Token 预测（MTP）：推荐使用以下命令启用 MTP：

SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path Qwen/Qwen3.6-35B-A3B --port 8000 --tp-size 8 --mem-fraction-static 0.8 --context-length  --reasoning-parser qwen3 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

vLLM

建议在全新环境中安装 vllm>=0.19.0 来运行 Qwen3.6，安装命令如下：

uv pip install vllm --torch-backend=auto

以下命令将在 http://localhost:8000/v1 创建 API 端点：

标准版本：以下命令可在 8 张 GPU 上使用张量并行创建最大上下文长度为 262,144 tokens 的 API 端点。

VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len  --reasoning-parser qwen3

工具调用：若需支持工具调用，可使用以下命令。

VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

多 Token 预测（MTP）：推荐使用以下命令启用 MTP：

VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len  --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

纯文本模式：以下命令将跳过视觉编码器和多模态分析，以释放更多内存用于 KV 缓存：

VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3.6-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len  --reasoning-parser qwen3 --language-model-only

Transformers

运行 Qwen3.6 需要最新版的 transformers：

pip install "transformers[serving]"

同时请确保已安装 torchvision 和 pillow。

然后，运行 transformers serve 以在 http://localhost:8000/v1 启动一个带有 API 端点的服务器；如果可用，它会将模型加载到加速器上：

transformers serve Qwen/Qwen3.6-35B-A3B --port 8000 --continuous-batching

更多模型部署推理实战cookbook，详见模型详情：

https://modelscope.cn/models/Qwen/Qwen3.6-35B-A3B

ms-swift 已支持使用transformers/Megatron后端对 Qwen3.6 Moe模型进行训练。ms-swift开源地址：https://github.com/modelscope/ms-swift

由于使用megatron后端训练支持MTP训练，序列packing以及FP8训练等，这里将介绍使用megatron对Qwen3.6进行微调和强化学习。使用transformers后端进行训练请参考：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html

pip install -U ms-swift
pip install -U "transformers==5.2.*" "qwen_vl_utils>=0.0.14" peft liger-kernel
pip install -U "flash-linear-attention>=0.4.2" --no-build-isolation
pip install -U git+https://github.com/Dao-AILab/causal-conv1d --no-build-isolation
pip install "flash-attn==2.8.3" --no-build-isolation
pip install deepspeed
# megatron环境准备请参考: https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Quick-start.html
# vllm (torch2.10) for RL
pip install -U "vllm>=0.17.0"
# 对于强化学习（RL）训练，需要覆盖 vLLM 的默认安装版本
pip install -U "transformers==5.2.*"

训练脚本如下：

# 4 * 32GiB, 7min PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 IMAGE_MAX_TOKEN_NUM=1024 VIDEO_MAX_TOKEN_NUM=128 FPS_MAX_FRAMES=12 megatron sft --model Qwen/Qwen3.6-35B-A3B --save_safetensors true --merge_lora true --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' 'AI-ModelScope/alpaca-gpt4-data-en#500' 'swift/self-cognition#500' 'AI-ModelScope/LaTeX_OCR:human_handwrite#2000' --load_from_cache_file true --add_non_thinking_prefix true --loss_scale ignore_empty_think --split_dataset_ratio 0.01 --tuner_type lora --lora_rank 8 --lora_alpha 32 --target_modules all-linear --tensor_model_parallel_size 2 --expert_model_parallel_size 4 --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --moe_aux_loss_coeff 1e-6 --micro_batch_size 1 --global_batch_size 2 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --num_train_epochs 1 --finetune true --freeze_llm false --freeze_vit true --freeze_aligner true --cross_entropy_loss_fusion true --lr 1e-4 --lr_warmup_fraction 0.05 --min_lr 1e-5 --output_dir megatron_output/Qwen3.6-35B-A3B --eval_steps 200 --save_steps 200 --max_length 4096 --mtp_num_layers 1 --packing true --dataloader_num_workers 8 --dataset_num_proc 8 --no_save_optim true --no_save_rng true --sequence_parallel true --attention_backend flash --padding_free false --model_author swift --model_name swift-robot

训练结束后，使用以下脚本对验证集进行推理：

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' CUDA_VISIBLE_DEVICES=0,1 IMAGE_MAX_TOKEN_NUM=1024 VIDEO_MAX_TOKEN_NUM=128 FPS_MAX_FRAMES=12 swift infer --model megatron_output/Qwen3.6-35B-A3B/vx-xxx/checkpoint-xxx-merged --stream true --enable_thinking false --max_new_tokens 512 --load_data_args true

如果您需要自定义数据集微调模型，你可以将数据准备成以下格式，并在命令行中设置`--dataset train.jsonl --val_dataset val.jsonl`，其中验证集为可选。

{"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]} {"messages": [{"role": "user", "content": "   两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫，后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]} {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "  图片中是什么，

使用 Megatron 后端对 Qwen3.6-35B-A3B MoE 模型进行 GRPO LoRA 训练，在 DAPO-Math-17k 数据集上训练，使用swift内置的 accuracy 作为奖励函数。

SYSTEM_PROMPT=“”“You are a helpful math assistant. Solve the problem step by step and put your final answer within \boxed{}.”“” CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 PYTORCH_CUDA_ALLOC_CONF=‘expandable_segments:True’ megatron rlhf

--rlhf_type grpo --model Qwen/Qwen3.6-35B-A3B --save_safetensors true --enable_thinking false --merge_lora true --context_parallel_size 1 --tensor_model_parallel_size 1 --expert_model_parallel_size 8 --pipeline_model_parallel_size 1 --moe_permute_fusion true --dataset open-r1/DAPO-Math-17k-Processed --system "$SYSTEM_PROMPT" --num_train_epochs 1 --global_batch_size 64 --micro_batch_size 1 --steps_per_generation 2 --num_generations 8 --reward_funcs accuracy --use_vllm true --vllm_mode colocate --vllm_gpu_memory_utilization 0.5 --vllm_tensor_parallel_size 2 --vllm_max_model_len 9192 --max_length 1000 --max_completion_length 8192 --tuner_type lora --target_modules all-linear --lr 5e-5 --bf16 true --beta 0.00 --epsilon 0.2 --epsilon_high 0.28 --dynamic_sample false --overlong_filter true --loss_type grpo --sleep_level 1 --offload_model true --offload_bridge false --offload_optimizer true --logging_steps 1 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --finetune --dataloader_num_workers 8 --dataset_num_proc 8 --no_save_optim --no_save_rng --save_steps 20 --attention_backend flash --moe_expert_capacity_factor 2 --temperature 1.0 --padding_free false --sequence_parallel true --log_completions true --report_to tensorboard swanlab

模型合集

https://modelscope.cn/collections/Qwen/Qwen36

Qwen3.6开源第一发：把能力压进更小的激活参数里

相关推荐