OtherWenJunDuanFree

e2e-testing

E2E 测试 — T 阶段 (Path C+)

View on GitHub ↗</>github.com/WenJunDuan/Rlues Yours? Claim it ↗

§ 01 — Stats

Stars175

Forks25

Prior1252

Quality—

Score56.5

Tasks3/5

§ 02c — Judges

Judges differ by ±8.2 points.

Typical spread — score is acceptable. The "spread" is the standard deviation between the 3 judges' average scores — small spread means they agree, large spread means take the score with a grain of salt.

JudgeInstructionCorrectnessCompletenessUsefulnessSafetyScore

Haiku 4.53 scores · -11.2 vs avg

554337

§ 02 — Install

Get e2e-testing.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install e2e-testing

Or clone the repo

$git clone https://github.com/WenJunDuan/Rlues.git

Or copy the SKILL.md manually

$cp Rlues/vibeCoding/codex/8.9/skills/e2e-testing/SKILL.md ~/.claude/skills/e2e-testing/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge e2e-testing↵

How the score works. Each judge scores the agent output across 5 axes (instruction-following × 0.35, correctness × 0.30, completeness × 0.20, usefulness × 0.10, safety × 0.05). No cap — the full 0-100 range is in play. Rubric v4 aligned with FLASK / JudgeBench / HELM, with explicit length-bias neutrality. The final score is the average across all judges. Mean across the registry targets ~55 with stdev ~12, so a 70+ is genuinely above-average. Per-judge calibration on a human gold set + pairwise tie-breaking on close scores arrive in V1.

Show judge rationales

Haiku 4.5

“Output is a bare JSON array with zero implementation, no crawler logic, and no evidence of robots.txt compliance, filtering, or deduplication. Penalty rules E (well-formed but no specifics) and C (generic, ignores task) apply. A real developer cannot use this; it's a mock result, not a working solution. Major gaps across instruction-following, correctness, and completeness.”

deepseek-chat

“Output is well-formed but lacks specifics like robots.txt handling and deduplication, reducing completeness and usefulness.”

GPT-5 mini

“The output matches the requested JSON shape but fails to provide the required observational evidence and performs no checks, so correctness, completeness, and usefulness are low and the claim is overconfident.”

Show SKILL.md content (~56 tokens)

---
name: e2e-testing
description: E2E 测试 — T 阶段 (Path C+)
---
1. 检查 playwright 安装
2. 从 plan.md 提取关键用户流
3. 编写/更新 E2E 测试
4. 执行, 失败重跑 (最多 3 轮)
5. Path C+: chrome-devtools MCP 辅助浏览器调试

降级: curl/fetch API 冒烟测试。