hongdengdao's recent timeline updates

hongdengdao

V2EX member #42707, joined on 2013-07-27 15:19:45 +08:00

hongdengdao 提问技术话题好玩工作信息交易信息城市相关

evernote, pocket,wunderlist,lastpass,newyork times 一年订阅优惠打包只要 59.9 美元。。

分享发现 • hongdengdao • Jan 15, 2015 • Lastly replied by codejay

中国意对公民境外收入征税,大家怎么看

分享发现 • hongdengdao • Jan 11, 2015 • Lastly replied by sadaharu09

柏林墙倒塌的纪念日，谷歌搜索的 doodle 不错!!

Google • hongdengdao • Nov 10, 2014 • Lastly replied by skydiver

大家对下列新闻怎么看？屁民是不是以后只能拿擦屁股纸了？

问与答 • hongdengdao • Jul 10, 2014 • Lastly replied by sesinx

line 被屏蔽，局域网进度+1

LINE • hongdengdao • Jul 3, 2014 • Lastly replied by simon7

stackoverflow 也被墙了....，大家也一样么？

程序员 • hongdengdao • Oct 13, 2014 • Lastly replied by standin000

各位团购 sketch 的 v2exer 能更新到 3.0.2 么？我的每次都在 3.0.1 循环，每次更新完都还是 3.0.1

macOS • hongdengdao • May 20, 2014 • Lastly replied by Superoutman

» More topics by hongdengdao

hongdengdao's recent replies

Apr 24

Replied to a topic by KaiWuBOSS › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

这个应该有点问题,双卡 32g 显存, 8k-64k 肯定是可以运行的

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.2 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti × 2 (SM89, 16380 MB VRAM each, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 64K 重试...
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 32K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 3 次启动均失败，建议选择更小的模型
Usage:
kaiwu run <model> [flags],

Apr 24

Replied to a topic by KaiWuBOSS › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

.\kaiwu.exe run Qwen3.6-27B-Q4_K_M.gguf

██╗ ██╗ █████╗ ██╗██╗ ██╗██╗ ██╗
██║ ██╔╝██╔══██╗██║██║ ██║██║ ██║
█████╔╝ ███████║██║██║ █╗ ██║██║ ██║
██╔═██╗ ██╔══██║██║██║███╗██║██║ ██║
██║ ██╗██║ ██║██║╚███╔███╔╝╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚══╝╚══╝ ╚═════╝
本地大模型部署器 vv0.1.1 · llama.cpp b8864
by llmbbs.ai · 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 4060 Ti (SM89, 16380 MB VRAM, 0 GB/s)
RAM: 61 GB DDR5
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3.6-27B (dense, 28B)
Quant: Q4_K_M (15.7 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3.6-27B-Q4_K_M.gguf [cached]

[4/6] Preflight check...
✓ VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
⚠️ Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
Waiting for llama-server to be ready (port 11434)...
⚠️ 显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行
建议：选择更小的量化或使用 MoE offload 模型
Usage:
kaiwu run <model> [flags]

Apr 24

Replied to a topic by KaiWuBOSS › Local LLM › 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现