Core

Experiment Hypothesis Design

Generate testable hypotheses from product data, rank by expected impact, and calculate required sample sizes

PostHogAnthropicAttio
$npx gtm-skills add drill/experiment-hypothesis-design

What this drill teaches

Experiment Hypothesis Design

This drill turns product data into testable experiment hypotheses. It prevents the common failure mode of A/B testing programs: running random tests with no theory of change. Every experiment starts with data, produces a structured hypothesis, and has a pre-calculated sample size so you know whether the test is feasible before you build anything.

Prerequisites

  • PostHog with at least 30 days of product usage events
  • At least 100 weekly active users (below this, most experiments cannot reach statistical significance within a reasonable timeframe)
  • Anthropic API key for Claude (hypothesis generation)
  • Attio configured for experiment logging

Input

  • A product area or metric the team wants to improve (e.g., "activation rate", "feature X adoption", "trial-to-paid conversion")
  • Current baseline value for that metric
  • Minimum detectable effect (the smallest improvement worth investing in)

Steps

1. Extract opportunity signals from PostHog

Query PostHog for data that reveals improvement opportunities:

Funnel drop-offs: Using posthog-funnels, build the funnel for the target metric. Identify the step with the largest absolute drop-off. A step where 40% of users abandon is a bigger opportunity than one where 5% abandon, regardless of how "broken" the latter feels.

Cohort divergence: Using posthog-cohorts, compare high-performing user cohorts (retained 60+ days) against churned cohorts. What features or actions differentiate them? The gap between these cohorts reveals what to test.

Session patterns: Query event sequences for users who completed the target action vs. those who did not. Look for friction signals: repeated attempts at the same step, navigation loops, or long pauses between steps.

Compile the raw data into a structured opportunity brief:

{
  "target_metric": "trial_to_paid_conversion",
  "current_baseline": "8.2%",
  "funnel_drop_offs": [
    {"step": "pricing_page_viewed -> checkout_started", "drop_off_rate": "62%"},
    {"step": "checkout_started -> payment_completed", "drop_off_rate": "18%"}
  ],
  "cohort_differences": [
    "Converted users viewed pricing page 2.4x more often before converting",
    "Converted users used feature X within first 3 days (72% vs 31%)"
  ],
  "friction_signals": [
    "38% of users who view pricing leave within 5 seconds",
    "Users toggle between plan comparison 4+ times before selecting"
  ]
}

2. Generate ranked hypotheses

Pass the opportunity brief to the hypothesis-generation fundamental. Request 5 hypotheses, each structured as:

{
  "hypothesis": "If we add a plan recommendation quiz to the pricing page, then trial-to-paid conversion will increase by 2 percentage points",
  "reasoning": "62% drop-off at pricing->checkout and users toggling between plans 4+ times suggests confusion, not price resistance. A quiz reduces decision effort.",
  "target_metric": "trial_to_paid_conversion",
  "expected_lift": "2pp (8.2% -> 10.2%)",
  "risk_level": "low",
  "implementation_effort": "medium",
  "dependencies": ["pricing page can render dynamic content via feature flag"],
  "estimated_impact_score": 8.5
}

Rank hypotheses by: (expected_lift * confidence) / implementation_effort. The top-ranked hypothesis should have the best ratio of potential impact to effort.

3. Calculate sample size for each hypothesis

For each hypothesis, compute the required sample size:

  • Inputs: baseline rate, minimum detectable effect (from the hypothesis), significance level (0.05), power (0.80)
  • Formula: Use PostHog's built-in experiment calculator or compute manually: n = (Z_alpha + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 - p2)^2
  • Feasibility check: Given your current traffic, how many days will this experiment take? If longer than 28 days, either increase the expected effect size (test a bolder change) or pick a higher-traffic surface to test on.

Mark each hypothesis as feasible (can run within 28 days) or infeasible (requires more traffic than available). Drop infeasible hypotheses.

4. Log the experiment backlog

Using the attio-notes fundamental, create an experiment record in Attio for each feasible hypothesis:

  • Hypothesis statement
  • Expected lift and confidence
  • Required sample size and estimated duration
  • Implementation dependencies
  • Status: "queued"

This creates the experiment backlog that the ab-test-orchestrator drill pulls from.

5. Prioritize and schedule

From the feasible backlog, select the top hypothesis for the next experiment. Criteria:

  • Highest impact score among feasible hypotheses
  • No dependency conflicts with currently running experiments
  • Implementation can start within the current sprint

Log the selected hypothesis as "next" in Attio. Archive hypotheses that become stale (older than 90 days without being tested).

Output

  • Ranked list of 3-5 testable hypotheses with sample sizes and feasibility assessments
  • Top hypothesis selected and logged in Attio as the next experiment to run
  • Opportunity brief documented for future reference

Triggers

Run this drill:

  • At the start of each experiment cycle (before running ab-test-orchestrator)
  • When a previous experiment completes (to select the next one)
  • When the team identifies a new metric to improve