Google 云平台 Vertex AI 服务流式输出非常慢， gemini-3-pro-preview 模型，首个流式输出超过 17s，有没有好的解决方案

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

问题描述当从位于美国硅谷的基础设施向 Vertex AI API （ aiplatform.googleapis.com ）模型: gemini-3-pro-preview 发起流式预测调用时，我们观察到响应流中首个 Token 的延迟异常偏高。首 Token 延迟（ TTFT ）持续超过 17 秒，而通常情况应低于 2 秒。

server address: 142.250.191.42

1 、Basic Ping Tests (Connectivity & Baseline Latency) Run these commands from the affected server/client in Silicon Valley. ping(base) [root@usa-gg-test01 ~]# ping aiplatform.googleapis.com PING aiplatform.googleapis.com (142.250.191.42) 56(84) bytes of data. 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms

2 、python code test Using the model：gemini-3-pro-preview

import requests import json import time

def stream_gemini_content(): api_key='xxx' url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"

headers = {
    "x-goog-api-key": api_key,
    "Content-Type": "application/json"
}

data = {
    "contents": [{
        "role": "user",
        "parts": [{
            "text": "请讲一个 200 字的故事，不要用推理，直接回答。"
        }]
    }],
    "generationConfig": {
        "thinkingConfig": {
            "includeThoughts": False
        }
    }
}

print(f"begin requests: {url} ...")

start_time = time.time()
first_token_time = None
last_chunk_time = None  

try:
    with requests.post(url, headers=headers, json=data, stream=True) as response:

        if response.status_code != 200:
            print(f"status: {response.status_code}")
            print(response.text)
            return

        print("-" * 50)

        for line in response.iter_lines():
            if not line:
                continue

            decoded_line = line.decode('utf-8').strip()
            if not decoded_line.startswith("data: "):
                continue

            json_str = decoded_line[6:]
            if json_str == "[DONE]":
                break

            try:
                now = time.time()

                if first_token_time is None:
                    first_token_time = now
                    print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms")
                    print("-" * 50)
                    last_chunk_time = now  

                chunk_data = json.loads(json_str)
                candidates = chunk_data.get("candidates", [])

                total_elapsed = (now - start_time) * 1000
                chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0
                last_chunk_time = now



                if candidates:
                    content = candidates[0].get("content", {})
                    parts = content.get("parts", [])
                    if parts:
                        text_chunk = parts[0].get("text", "")
                        print(text_chunk, end="", flush=True)

            except Exception as e:
				pass

except Exception as e:
    pass

end_time = time.time()
print("\n\n" + "-" * 50)
print(f"total time: {(end_time - start_time) * 1000:.2f} ms")

if name == "main": stream_gemini_content()

代码测试非常慢，200 个字故事就超过 17s 了

流式输出

延迟

Vertex AI

8 条回复 • 2025-12-12 17:10:09 +08:00

heqing

1 天前

第一、第二次调用首个 Token 输出延迟是否有显著差异？更换其他模型是否出现相同的现象?

xuliang12187

1 天前

用了 gemini-2.0-flash 模型首个 token 输出 300ms 200 字的故事，3-4s 就返回了全部内容了 gemini-2.5-flash 首 token 超过 3s 很慢，总时间长度超过 5s ，gemini-3-pro-preview 首个 token 超过 12s ，我们用的 google cloud 企业服务 vertex AI apI 接口。

chenluo0429

1 天前 via Android

你也没指定不思考啊，gemeni3 默认思考级别是高，这不是得先思考再给你回答吗

xuliang12187

1 天前

@chenluo0429 调过一样，很慢都超过 17s

fov6363

1 天前

+1 ，不加 thinking 太弱智了，加了就是得 10s+，即使是简单的 QA 也不行。问了 chatGPT 说是 vertex 要开那个 endpoint 独占的实例概念，不了会有冷启动，first chunk 只有几百 ms ，但是等到第一次返回就得 10s+

xuliang12187

1 天前

@fov6363 vertex 先阶段没有 endpoint 独立实例概念，现在只有 global 全球的。说是有不同付费级别。那个是针对业务并发量高。并不能解决接口延迟问题

GXD

1 天前

gemini3 得用`thinking_level`参数来指定推理深度吧，默认是 high

fov6363

19 小时 33 分钟前

@xuliang12187 #6 感谢，这个方案我也没试。

有没有试过不用 vertex ，直接用 gemini API ？我这边试了感觉没有快多少，带 thinking 好像就挺慢的

Google 云平台 Vertex AI 服务 流式输出非常慢， gemini-3-pro-preview 模型，首个流式输出超过 17s，有没有好的解决方案

Google 云平台 Vertex AI 服务流式输出非常慢， gemini-3-pro-preview 模型，首个流式输出超过 17s，有没有好的解决方案