How I Used AI to Fix Our E2E Test Architecture

我接手的时候，E2E 测试的状态是这样的：

2000 个测试用例
通过率 60%
运行时间 3 小时
没有人相信测试结果

每次 CI 报告出来，开发者第一反应是"又是 flaky test"，而不是"这个失败说明了什么"。

这已经不是一个技术问题，是一个信任问题。

我花了 8 周，用 AI agent 辅助，重建了整个 E2E 测试架构。一年后数据：

通过率 94%
运行时间 40 分钟（原来 3 小时）
Flaky rate 从 15% 降到 2%
开发者愿意看测试报告了

这篇文章讲的是这个过程：不是怎么写测试，是怎么建一个让人愿意相信的测试架构。

现状诊断：E2E 测试为什么失效

在动手之前，我花了 2 周做现状诊断。目标是搞清楚：为什么大家不信任 E2E 测试？

问题一：测试没有分层

现有测试结构是扁平的：2000 个测试用例，没有分层，没有优先级。

// 旧架构：所有测试一视同仁
describe('All Tests', function() {
    // 关键业务流程测试
    test('user can checkout', ...)
    test('payment processing', ...)

    // UI 细节测试
    test('button has correct color on hover', ...)
    test('modal has correct padding', ...)

    // 边界情况测试
    test('edge case: negative quantity', ...)
    test('edge case: very long email address', ...)

    // 混在一起，没有优先级
})

问题：关键业务流程的测试和 UI 细节测试消耗同样的 CI 时间，出问题时开发者不知道要先看哪个。

问题二：测试之间强耦合

测试之间通过全局状态耦合，一个测试会影响下一个测试的结果。

// 耦合测试示例
test('user login', async () => {
    await page.fill('[name="email"]', 'test@example.com')
    await page.fill('[name="password"]', 'password123')
    await page.click('[type="submit"]')

    // 把登录状态存在全局变量里
    global.loggedInUser = await getCurrentUser()
})

test('user can see dashboard', async () => {
    // 依赖上一个测试的全局状态
    // 如果上一个测试失败，这个也会失败
    expect(global.loggedInUser.dashboardUrl).toBeTruthy()
})

test('user can place order', async () => {
    // 依赖前两个测试都成功
    await dashboardPage.navigateTo(global.loggedInUser.dashboardUrl)
    // ...
})

问题：一个测试失败会导致后续测试级联失败，但失败原因是第一个测试，不是后续测试本身。

问题三：没有稳定的测试数据管理

测试数据通过 API 直接操作，每个测试自己创建、自己清理。

# 混乱的测试数据管理
def test_create_order():
    # 创建用户
    user = api.create_user(email="test1@example.com")
    # 创建产品
    product = api.create_product(name="Test Product")
    # 创建订单
    order = api.create_order(user_id=user.id, product_id=product.id)

    # 清理：删除用户（但产品还在，订单记录也在）
    api.delete_user(user.id)
    # 下次运行可能遇到 email 重复、product 名称重复等问题

def test_create_order_similar():
    # 又创建一个同名产品，可能失败
    product = api.create_product(name="Test Product")  # 名称冲突
    # ...

问题：测试之间争抢数据资源，数据清理不完整导致测试结果不稳定。

问题四：等待时间随机

测试里的等待时间是拍脑袋定的。

// 拍脑袋的等待
test('user login', async () => {
    await page.fill('[name="email"]', 'test@example.com')
    await page.click('[type="submit"]')

    // 等待 3 秒，为什么是 3 秒？因为上次跑的时候等了 3 秒
    await page.waitForTimeout(3000)

    // 有时候 API 慢，需要 5 秒，这个测试就失败了
    // 有时候网络快，1 秒就够了，等 3 秒浪费时间
})

问题：固定等待时间无法适应不同的运行环境，慢环境失败，快环境浪费时间。

架构重建：四层测试模型

诊断做完，我设计的新的测试架构是四层模型：

┌─────────────────────────────────────┐
│ Layer 4: Critical Path (smoke tests)│ ← CI 的 Gate，必须通过
├─────────────────────────────────────┤
│ Layer 3: Integration Flows          │ ← 核心业务流程
├─────────────────────────────────────┤
│ Layer 2: Feature Coverage          │ ← 单元测试覆盖不到的部分
├─────────────────────────────────────┤
│ Layer 1: UI Components             │ ← 可选运行，低优先级
└─────────────────────────────────────┘

Layer 1: UI Components（UI 组件测试）

import pytest
from playwright.sync_api import Page, expect

class ButtonComponent:
    """按钮组件测试"""

    def test_button_renders_correctly(self, page: Page):
        """按钮正确渲染"""
        page.goto("/components/button")
        button = page.locator('[data-testid="primary-button"]')
        expect(button).to_be_visible()
        expect(button).to_have_text("Submit")

    def test_button_hover_state(self, page: Page):
        """按钮悬停状态"""
        page.goto("/components/button")
        button = page.locator('[data-testid="primary-button"]')

        button.hover()
        # 不测具体颜色（容易碎），测是否有状态变化
        expect(button).to_have_attribute("data-hovered", "true")

    def test_button_disabled_state(self, page: Page):
        """禁用状态"""
        page.goto("/components/button")
        button = page.locator('[data-testid="primary-button"]')
        button.set_disabled()

        expect(button).to_be_disabled()
        # 点击不触发任何事件
        button.click()
        # 没有导航发生

这层测试的价值是记录 UI 组件的行为，不是验证样式。测试名称描述行为，而不是样式。

Layer 2: Feature Coverage（功能覆盖测试）

def test_shopping_cart_add_item(page: Page, test_user):
    """购物车添加商品"""
    page.goto("/shop")

    # 搜索商品
    page.fill('[data-testid="search-input"]', test_user.search_term)
    page.click('[data-testid="search-button"]')

    # 添加到购物车
    first_product = page.locator('[data-testid="product-card"]').first
    first_product.locator('[data-testid="add-to-cart"]').click()

    # 验证购物车更新
    cart_badge = page.locator('[data-testid="cart-badge"]')
    expect(cart_badge).to_have_text("1")

    # 验证侧边栏购物车显示
    cart_sidebar = page.locator('[data-testid="cart-sidebar"]')
    expect(cart_sidebar).to_be_visible()
    expect(cart_sidebar.locator('.cart-item')).to_have_count(1)

这层测试关注功能是否按预期工作，使用 Page Object Pattern 减少 UI 变化的影响。

Layer 3: Integration Flows（集成流程测试）

@ pytest.fixture
def authenticated_browser(page: Page, test_user):
    """预先认证的浏览器"""
    # 登录，获取持久化 session
    page.goto("/login")
    page.fill('[name="email"]', test_user.email)
    page.fill('[name="password"]', test_user.password)
    page.click('[type="submit"]')
    page.wait_for_url("**/dashboard")

    # 保存认证状态
    storage = page.context.storage_state()

    yield page

    # 清理（保留 fixture 作用域内的状态）

@pytest.mark.integration
class TestCheckoutFlow:
    """结账流程集成测试"""

    def test_complete_checkout_flow(self, authenticated_browser: Page, test_cart_with_items):
        """完整结账流程"""
        page = authenticated_browser

        # 1. 进入购物车
        page.goto("/cart")
        cart_items = page.locator('[data-testid="cart-item"]')
        expect(cart_items).to_have_count(len(test_cart_with_items))

        # 2. 进入结账页面
        page.click('[data-testid="checkout-button"]')
        page.wait_for_url("**/checkout")

        # 3. 填写地址
        page.fill('[data-testid="address-line1"]', test_cart_with_items.address)
        page.fill('[data-testid="postal-code"]', test_cart_with_items.postal_code)
        page.click('[data-testid="continue-to-payment"]')

        # 4. 选择支付方式
        page.click('[data-testid="payment-method-card"]')

        # 5. 确认订单
        page.click('[data-testid="place-order"]')

        # 6. 验证成功
        expect(page.locator('[data-testid="order-confirmation"]')).to_be_visible()
        order_id = page.locator('[data-testid="order-id"]').text_content()
        expect(order_id).to_match(r'^ORD-\d+$')

        # 7. 验证邮件发送
        test_email_service.assert_email_sent(
            to=test_cart_with_items.email,
            subject="Order Confirmation"
        )

集成测试使用 fixture 管理依赖，确保测试之间的隔离。

Layer 4: Critical Path（关键路径测试，Smoke Tests）

@pytest.mark.critical
class TestCriticalPath:
    """关键路径测试 - 必须全部通过"""

    @pytest.mark.critical
    def test_user_can_signup_and_login(self, page: Page):
        """用户注册和登录"""
        # 关键路径 1: 注册 -> 登录 -> 登出 能跑通
        pass

    @pytest.mark.critical
    def test_core_search_flow(self, page: Page):
        """核心搜索流程"""
        # 关键路径 2: 搜索 -> 查看结果 -> 查看详情 能跑通
        pass

    @pytest.mark.critical
    def test_basic_purchase_flow(self, page: Page):
        """基本购买流程"""
        # 关键路径 3: 选商品 -> 加购物车 -> 结账 -> 支付 -> 确认 能跑通
        pass

    @pytest.mark.critical
    def test_critical_backend_health(self):
        """关键后端服务健康"""
        # 非 UI 测试：验证关键 API 可用
        pass

这层是 CI 的 gate，只有这层通过，才会执行后续 Layer。

解决测试数据问题：Seed Data Pattern

测试数据混乱是导致 flaky 的主要原因之一。我设计了 Seed Data Pattern：

# test/data/seed_data.py
class SeedDataManager:
    """
    统一的测试数据管理
    每个测试使用独立的 seed data，测试结束后清理
    """

    def __init__(self, db_connection):
        self.db = db_connection
        self.created_records = []

    def create_user(self, **overrides):
        """创建测试用户"""
        defaults = {
            "email": f"test_{uuid4().hex[:8]}@example.com",  # 唯一 email
            "name": "Test User",
            "tier": "standard"
        }
        user_data = {**defaults, **overrides}

        user = self.db.users.create(**user_data)
        self.created_records.append(("users", user.id))
        return user

    def create_product(self, **overrides):
        """创建测试产品"""
        defaults = {
            "sku": f"SKU-{uuid4().hex[:8]}",  # 唯一 SKU
            "name": "Test Product",
            "price": 99.99,
            "stock": 100
        }
        product_data = {**defaults, **overrides}

        product = self.db.products.create(**product_data)
        self.created_records.append(("products", product.id))
        return product

    def create_cart_with_items(self, user, item_count=3):
        """创建带商品的购物车"""
        cart = self.db.carts.create(user_id=user.id)
        products = [self.create_product() for _ in range(item_count)]

        for product in products:
            self.db.cart_items.create(cart_id=cart.id, product_id=product.id)

        self.created_records.append(("carts", cart.id))
        return cart

    def cleanup(self):
        """清理所有创建的记录"""
        for table, record_id in reversed(self.created_records):
            self.db[table].delete(record_id)
        self.created_records = []

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.cleanup()

使用方式：

def test_shopping_cart_with_fixture(db):
    """使用 seed data fixture"""
    with SeedDataManager(db) as seed:
        # 所有创建的数据会在测试结束后自动清理
        user = seed.create_user(email="cart_test@example.com")
        cart = seed.create_cart_with_items(user, item_count=5)

        # 测试逻辑...
        # 即使测试失败，cleanup() 仍会在 __exit__ 中执行

解决等待时间问题：Smart Wait Utilities

固定等待时间是 flaky 的另一个主要原因。改用智能等待：

# test/utils/smart_wait.py
from playwright.sync_api import Page, TimeoutError

class SmartWait:
    """智能等待工具"""

    @staticmethod
    def for_element_visible(page: Page, selector: str, timeout: int = 10000):
        """等待元素可见，自动处理加载状态"""
        try:
            element = page.locator(selector)
            element.wait_for(state="visible", timeout=timeout)
            return element
        except TimeoutError:
            # 如果超时，打印页面状态帮助调试
            print(f"Timeout waiting for {selector}")
            print(f"Page URL: {page.url}")
            print(f"Page title: {page.title()}")
            print(f"Visible elements: {page.locator(':visible').count()}")
            raise

    @staticmethod
    def for_network_idle(page: Page, timeout: int = 30000):
        """等待网络空闲"""
        page.wait_for_load_state("networkidle", timeout=timeout)

    @staticmethod
    def for_api_response(page: Page, url_pattern: str, timeout: int = 10000):
        """等待特定 API 响应"""
        def check_response(response):
            return response.url.startswith(url_pattern) and response.status < 400

        with page.expect_response(check_response, timeout=timeout) as response_info:
            # 触发 API 请求
            yield response_info.value

使用方式：

def test_async_operation_complete(page: Page):
    """测试异步操作完成"""
    # 触发异步操作
    page.click('[data-testid="start-export"]')

    # 使用智能等待
    with SmartWait.for_api_response(page, "/api/export/status"):
        page.click('[data-testid="check-status"]')

    # 等待结果元素出现
    SmartWait.for_element_visible(
        page,
        '[data-testid="export-complete"]',
        timeout=30000
    )

AI Agent 的辅助角色

整个重建过程中，AI agent 帮我做了几件事：

1. 测试翻译：Selenium → Playwright

旧系统用 Selenium，新系统用 Playwright。AI agent 帮我做了代码翻译。

# AI 生成的 Playwright 版本（从 Selenium 翻译）
# 旧版 Selenium
driver.find_element(By.CSS_SELECTOR, ".product-card .add-btn").click()

# AI 生成的 Playwright 版
page.locator('.product-card .add-btn').click()
# 或者更稳定的方式
page.get_by_test_id("add-to-cart").click()

AI 的建议是把 CSS selector 换成 data-testid，提高稳定性。这个建议是对的。

2. 测试用例生成

给定一个 API endpoint，AI agent 能生成基础测试用例：

# AI 生成的测试用例模板
@pytest.mark.parametrize("input_data,expected", [
    ({"quantity": 1}, "success"),
    ({"quantity": 0}, "validation_error"),
    ({"quantity": -1}, "validation_error"),
    ({"quantity": 999999}, "stock_error"),
])
def test_add_to_cart_quantity_boundaries(
    page: Page,
    authenticated_user,
    input_data,
    expected
):
    """边界测试：购物车数量"""
    page.goto(f"/product/{PRODUCT_ID}")

    quantity_input = page.get_by_test_id("quantity-input")
    quantity_input.fill(str(input_data["quantity"]))

    page.get_by_test_id("add-to-cart").click()

    if expected == "success":
        expect(page.get_by_test_id("cart-badge")).to_have_text("1")
    elif expected == "validation_error":
        expect(page.get_by_test_id("error-message")).to_be_visible()
    elif expected == "stock_error":
        expect(page.get_by_test_id("stock-error")).to_be_visible()

AI 生成的不是完美的测试，但是好的起点。我 review 后调整边界值。

3. Flaky Test 分析

最难的部分：分析为什么测试 flaky。AI agent 分析日志，找出模式：

# AI 分析结果
def analyze_flaky_tests(flaky_test_logs: list) -> dict:
    """
    分析 flaky 测试的原因模式
    """

    patterns = {
        "timeout": 0,
        "selector_not_found": 0,
        "network_error": 0,
        "data_conflict": 0,
        "state_leak": 0,
    }

    for log in flaky_test_logs:
        if "Timeout" in log:
            patterns["timeout"] += 1
        if "selector" in log.lower():
            patterns["selector_not_found"] += 1
        if "Network" in log:
            patterns["network_error"] += 1
        if "unique constraint" in log.lower():
            patterns["data_conflict"] += 1
        if "previous test" in log.lower():
            patterns["state_leak"] += 1

    return patterns

# 分析结果：
# {'timeout': 45, 'selector_not_found': 30, 'network_error': 15,
#  'data_conflict': 8, 'state_leak': 2}

根据分析结果，我确定了优先级：
1. Timeout 问题 → 改用智能等待
2. Selector 问题 → 统一 data-testid
3. Network 问题 → 增加重试机制
4. Data conflict → 修复 seed data 管理

一年后的数据

一年后，这套架构的数据：

指标	重建前	重建后	变化
通过率	60%	94%	+34%
运行时间	3 小时	40 分钟	-77%
Flaky rate	15%	2%	-13%
测试覆盖（关键路径）	N/A	100%	-
平均修复时间	2 小时	20 分钟	-83%

关键路径测试（Layer 4）在 CI 中 gate 住了所有 deploy，关键业务流程的测试覆盖达到了 100%。

踩坑记录

重建过程中踩的几个坑：

坑 1：Layer 1 测试过多

一开始 Layer 1 有 1500 个 UI 组件测试，占了运行时间的 40%，但对质量保障的价值很低。

解决：把 Layer 1 改成可选运行，默认不跑。真正需要回归 UI 组件时才运行。

坑 2：Seed data 的并发问题

CI 并发运行时，seed data 的 email 依然可能冲突（用了时间戳做区分）。

解决：在 email 里加了 UUID 后缀，确保全局唯一：

email = f"{base_email}_{uuid4().hex[:8]}@example.com"

坑 3：过度依赖 AI 生成的测试

AI 生成的边界测试，有时候边界值是错的。

解决：边界值需要人工 review，不能直接用 AI 的输出。AI 是起点，不是终点。

核心经验

8 周重建，1 年运行，我的核心经验：

1. 技术问题往往是组织问题的症状

E2E 测试 flaky率高、没人信任，这不是技术问题，这是团队对测试优先级理解不一致的症状。重建架构之前，先让团队达成共识：测试是用来干嘛的？

2. 通过率不是目标，稳定才是

一开始团队追求 100% 通过率，导致大量"修复"是删除测试，而不是修复问题。稳定比通过率重要。

3. AI 是加速器，不是替代品

AI 能帮你写测试，但不能替你思考测试策略。架构设计、优先级判断、数据管理策略，这些还是要人来做。

4. 测试的分层和隔离是信任的基础

没有分层，所有测试混在一起，失败时不知道该先看哪个。有了分层和隔离，开发者才愿意看测试报告。

这套架构在 GitHub 上（链接略），有需要的自取，欢迎提 Issue。

📑 目录