Streaming Recommendation Engine, Did the New One Move the Needle?
A streaming service ran an A/B test on a new recommendation engine and wanted to know whether to roll it out. The naive comparison said yes. The careful comparison agreed, but with an important condition attached: the lift was real, and it concentrated on the users who were already engaged. The headline number alone would have understated where the engine actually earned its keep.
The problem
The service ran a controlled experiment: Group A on the previous recommender, Group B on the new one, hours watched as the engagement metric. The brief was to decide rollout. The hidden problem was that the random assignment was not as clean as it looked. Group A skewed older than Group B by a statistically meaningful margin, and age has a strong negative correlation with hours watched in the underlying dataset. A two-sample test on the raw data would have given the new engine credit that age, not the engine, deserved.
The analytical question, properly stated, was: after controlling for the demographic noise the assignment did not, does the new recommendation engine still move hours watched, and if so, for whom?
The approach
A staged analysis that built up from the simplest comparison to a model that accounted for the variables doing real work.
Welch two-sample t-test. Headline check of hours watched between groups. The test returned a significant difference (p < 0.001) with Group B watching 4.81 hours on average versus 4.34 hours in Group A. Useful as a first signal, insufficient as a basis for rollout.
Initial regression model. Hours watched against age, social-engagement score, group assignment, gender, demographic category, and tenure on the platform. Age dominated the coefficient table with a large negative effect. Social engagement and group both showed strong positive effects. Gender and demographic category did not.
Age imbalance check. A targeted t-test confirmed that Group B was younger on average than Group A (p = 0.005). The imbalance was real and large enough to matter.
Adjusted regression with interaction terms. The model was rerun including interactions between group and age, and between group and social engagement. The interaction with age was not significant, meaning the new engine's effect was not differentially driven by the younger Group B skew. The interaction with social engagement was significant (p = 0.009), and that is the finding that matters for rollout.

What the model said
The new engine works, and it works hardest on the users who were already most socially engaged with the platform. For a user with low social engagement, the engine moves the needle modestly. For a user with high social engagement, the engine moves the needle substantially. The two effects compound: socially engaged users were already heavier viewers, and the engine makes them heavier still.
The rollout recommendation, on the strength of this finding, was not just "yes, roll out the engine." It was "yes, and the lift will show up most clearly in the socially-engaged segment, which has implications for how to monitor the rollout and which cohort to watch first."
Evidence
- Group B mean hours watched: 4.81, Group A: 4.34, raw difference significant at p < 0.001
- Age imbalance between groups: Group A mean 36.2 years, Group B mean 38.9 years, difference significant at p = 0.005
- Adjusted regression confirmed group effect remains positive and significant after age control
- Group × social-engagement interaction term significant at p = 0.009, indicating differential lift by engagement level
- Model R² = 0.40 on the adjusted specification; the model explains a substantial share of variance in hours watched
- Conclusion: roll out the new engine, monitor socially-engaged segment for confirmation in production
The framing shift is what graduates the work from a test-result table into actionable analysis. A new recommender does not just have an average effect. It has a population of effects, and naming where the lift lives is what makes the rollout decision defensible.
← kipjordan.com