Boosting A/B Testing Efficiency for 1M+ Users
Explore how Stripe rolled out A/B testing at scale
TL;DR
Situation
Stripe users wanted to test how new payment methods impacted conversions and revenue but faced complex, resource-intensive A/B testing processes
Task
Stripe set out to build a no-code A/B testing tool, enabling users to experiment with payment methods and analyze outcomes effortlessly
Action
The team took three actions into account:
Statistical Significance: Introduced a time window feature to let the same customer appear in both control and treatment groups over time.
Avoiding Dilution: Filtered transactions to ensure only relevant data was analyzed for accurate results.
Event Linking: Built a pipeline to connect payment "render" and "confirm" events for precise metric tracking
Result
Indiegogo (Stripe customer) saw a 2% increase in conversion
Use Cases
Payment Method Optimization, Checkout Experience Improvement
Tech Stack/Framework
A/B Testing
Explained Further
Improving Statistical Significance
Stripe addressed the challenge of long A/B testing durations for smaller customer bases by increasing the number of data points without artificially boosting transaction volumes. Traditionally, each checkout session counts as a data point, but to accelerate statistical significance, Stripe introduced a time-window component. This allowed the same customer to appear in both control and treatment groups over time. Within a specific time interval, a customer would see one set of payment methods, and after the interval elapsed, they could encounter a different set during their next purchase. This method effectively increased sample sizes and shortened experiment durations.
To implement this, Stripe used a deterministic hash function that leveraged inputs like UserAgent, IP address, and the time window component. Customers were randomly assigned a number between 1 and 10,000, with the number determining their group based on the experiment’s split (e.g., 90/10). This system ensured consistent randomization and maintained a seamless payment experience across Stripe's interfaces. This innovation provided businesses with faster and more reliable testing outcomes while preserving customer experience continuity.
Avoiding Data Dilution
Stripe addressed dilution in A/B tests by ensuring only eligible transactions are included. Dilution occurs when treatment group customers don’t experience differences from the control group, adding noise to the dataset and delaying statistical significance. For example, a buy now, pay later (BNPL) option restricted to transactions over $50 would show control behavior for smaller transactions, reducing effect size and statistical power.
To prevent this, Stripe validated transaction eligibility before including it in the experiment. This ensures that only sessions where payment methods in the treatment group can be displayed are analyzed. The process involves synchronously validating all control and treatment payment methods for a session and randomly assigning an experiment outcome only if the session is eligible.
Algorithm:
Create a combined superset of control and treatment payment methods.
Remove payment methods failing general constraints (e.g., transaction type, currency).
Split into control and treatment subsets, filtering each by specific rule constraints.
Compare subsets; if at least one payment method differs, mark the session as eligible.
Assign an outcome and display the respective payment methods.
This reduces dilution and ensures faster, accurate results by maintaining statistical power, even for experiments with complex payment constraints.
Scalability and Experiment Design
Stripe designed its A/B testing system to work seamlessly for all user integration types, including those finalizing payments via their own servers. Unlike client-side payments using Stripe.js, server-side transactions lack key data such as UserAgent or IP address, complicating the connection between two critical events:
Render Event: Logs the displayed payment methods and assigns the session to control or treatment.
Confirm Event: Logs the chosen payment method when a payment is completed.
To bridge this gap, Stripe uses the unique session ID included in the metadata of the PaymentMethod object. When payment methods are rendered, a "render" event with the session ID is logged. Later, the PaymentMethod ID is referenced during the server-side confirmation, allowing Stripe’s data pipeline to join "render" and "confirm" events via the session ID.
The pipeline aggregates results into a unified table for experiment summaries, enabling accurate reporting across all integration types.
Lessons Learned
Enhancing Sample Sizes: Implementing time-window sampling can effectively increase data points, expediting the achievement of statistical significance in experiments.
Data Integrity: Pre-experiment filtering of transactions is crucial to maintain the purity of data, thereby ensuring the accuracy of test results.
Comprehensive Event Tracking: Developing robust pipelines that link disparate events enables precise analysis of user behavior and payment method performance.
The Full Scoop
To learn more about the update, check the Stripe Blog post on this topic.