Most growth-minded product managers will tell you that retention is the most powerful force in a business's growth model. When users stick around they create value not just for a single transaction, but for their whole lifetime. High retention can justify expensive user acquisition tactics. Variance in retention across cohorts can be an indicator for opportunity to optimize.
Most analytics tools include retention visualizations to help an analyst understand if users are sticking around. Mixpanel includes a useful retention report that looks something like this.
At the top you can see the average retention curve describing the product's overall stickiness for the average user acquired in the last 12 months. The table view shows monthly retention for every cohort. We can see that every month, 3,000 - 7,000 users are acquired, or enter a funnel for the first time. About 26% convert, or activate, within 1 month. In some months activation rate is higher than others, which can be a lagging indicator that we did something right and we should think about how to replicate that success in the future. After month 1, fewer and fewer users convert again.
Churn is a law of nature. All retention curves look something like what's shown above, but the most successful products flatline early after acquisition, keeping users retained at a high rate and increasing lifetime value.
Stack Overflow User Retention
We can reproduce this report with SQL. Here's an example using Stack Overflow answer data.
The cornerstone of any retention query is a log of events where we have a unique identifier for the user who did the event. In the example above we use the 14-year history of answers posted by Stack Overflow users as our event log. This dataset can be found in the table posts_answers. Here's some pseudocode that describes what the code above is up to:
1. Get all years where the user posted an answer
2. For each answer year, append the year where the user submitted their first answer
3. Use step 1 & 2 to determine interval since first answer
4. Get all first answer years, and count up users who continued to answer in subsequent years
5. Turn it into a percentage
When we run the code, we get a table of results that looks a lot like the table in the Mixpanel report. Here's that result exported into Sheets and styled a little bit.
A retention table is worth a thousand words. A few observations...
The data only goes up to 2022. But we see that new answerers seems to peak around 2015, and then again in 2020 (covid?)
Year 1 retention has dropped steadily since Stack Overflow was founded. Fewer and fewer users are contributing answers after the first year when they contribute an answer.
After 10 years on Stack Overflow, only about 1% of answerers continue to answer questions.
In fact, 10-year answerer retention is much lower than this, but the number is inflated by the earliest cohorts of users. Nearly 10% of 2008 answerers continued to answer questions in 2022!
London Bike Share Lifespan
Let's generate another retention table, this time using London bike share data.
This is basically the same code that was shared in the first example, but we're using a different data set: bigquery-public-data.london_bicycles.cycle_hire.
Here we see that by year 4 about 20% of bikes churn out of the system. And in 2022, a lot more bikes were added to the system – many of them are ebikes, which you can explore through the data.
This is just a tease of the things you can do with SQL to analyze retention and churn. These queries can be extended to analyze retention across segments, eg. regular bikes vs. ebikes, or to identify activities that correlate with increased lifetime value. (Do answerers stick around longer if one of their first answers was accepted? Yes.) No matter what data you're working on, retention analysis will come in handy.
Did you learn something from this post? See an error or have an observation? Leave a comment!