Best BigQuery Datasets
A list of the most useful public datasets available to query through BigQuery.
BigQuery is a tremendously powerful tool. Not just because you can use it as a SQL client for your own data (you can run SQL on your own CSVs, for example), but also because it's a tremendously useful tool for learning. BigQuery includes hundreds of public datasets that you can query for free (up to 1 TB per month). This is a great playground where you can practice your skills and test new functionality such as AI features which are regularly released by the Google team.
But not all of the datasets are useful. I would know. I've spent many hours poking at what's available. A lot of these datasets are outdated or thin. So many of them only include aggregated data, which limits the potential for fun and insights. A lot of these datasets are interesting, but they don't look anything like the data you might use at work.
There are some keepers in here! In this post I list my favorite datasets that you can find in BigQuery.
IMBD Reviews
100,000 real movie reviews written by IMDB users. This dataset includes unstructured free response text and also numeric scores on a scale of 1-10. This dataset is great for natural language processing and generative AI models which you can access through BigQuery.
Example query - Create a sample table, from Sentiment Analysis using AI
CREATE OR REPLACE TABLE `[project].[dataset].[table]` AS
SELECT review, movie_url, label, reviewer_rating
FROM `bigquery-public-data.imdb.reviews`
WHERE reviewer_rating IS NOT NULL AND RAND() < 0.001;
Stack Overflow
Probably the most usable dataset to practice product analysis. That's because this dataset looks a lot like what you might use at work. It includes a huge users
table and a few tables – questions, answers, and votes – that resemble event logs associated with users, so that we can query the data to understand product fundamentals like activation, retention, and user segmentation. Many of the tutorials shared through this website use Stack Overflow data as an example.
Example query – Signup >> Ask Funnel, from Funnels in SQL
DECLARE yr, conversion_window INT64;
SET (yr, conversion_window) = (2021, 30);
WITH users AS (
SELECT *
FROM `bigquery-public-data.stackoverflow.users`
WHERE EXTRACT(YEAR FROM creation_date) = yr
),
users_questions as (
SELECT u.display_name, u.id user_id, u.creation_date signup, COUNT(q.id) questions, MIN(q.creation_date) first_question
FROM users u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` q on q.owner_user_id = u.id
AND DATE_DIFF(q.creation_date, u.creation_date, day) <= conversion_window
GROUP BY 1,2,3
)
select
EXTRACT(MONTH FROM signup) month,
COUNT(user_id) new_users,
COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) asked,
ROUND(COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_asked
FROM users_questions
GROUP BY 1
ORDER BY 1 ASC
NYC Taxi Fares
A record of every yellow taxi ride in NYC since 2011, including trip distance, passenger count, fare, and tip amount. These datasets are pretty big, with 36 million rows for the 2022 dataset. Read more about this dataset.
Example query – NYC taxi fare deciles, from Histograms in SQL
SELECT APPROX_QUANTILES(total_amount, 9)
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`
Machine Learning Datasets
Includes classic datasets used in machine learning 101 classes such as Irises, Credit Card Defaults, ULB Fraud Detection, and Penguins.
Example query – Distribution of penguin height, from Histograms in SQL
DECLARE bin INT64 DEFAULT 200;
DECLARE outlier FLOAT64 DEFAULT null;
SET outlier = (SELECT PERCENTILE_CONT(body_mass_g, outlier) OVER() from `bigquery-public-data.ml_datasets.penguins` limit 1);
WITH cleaned as (
SELECT
body_mass_g, outlier, species,
(case when body_mass_g > outlier then outlier else body_mass_g end) clean
FROM `bigquery-public-data.ml_datasets.penguins`
)
SELECT
FLOOR(clean / bin * 1.0) * bin flr,
CASE WHEN FLOOR(outlier) > (FLOOR(clean / bin * 1.0) * bin) AND FLOOR(outlier) < (FLOOR(clean / bin * 1.0) * bin + bin) then FLOOR(clean / bin * 1.0) * bin || ' - max' else FLOOR(clean / bin * 1.0) * bin || ' - ' || (FLOOR(clean / bin * 1.0) * bin + bin) end weight_range,
COUNT(*) AS penguins,
COUNT(CASE WHEN species = 'Adelie Penguin (Pygoscelis adeliae)' then 1 else null end) adelie,
COUNT(CASE WHEN species = 'Chinstrap penguin (Pygoscelis antarctica)' then 1 else null end) chinstrap,
ROUND(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cleaned),3) percent
FROM cleaned
GROUP BY 1,2
ORDER BY 1 ASC
Wikipedia
A massive database that includes traffic data for every single page on Wikipedia. This is a great dataset for you to practice PARTITION BY
. You'll need it, or else you'll run out of your free 1 TB fast.
Example query – Yesterday's most popular Wikipedia pages, from How to Use BigQuery
SELECT title, sum(views) views
FROM `bigquery-public-data.wikipedia.pageviews_2023`
WHERE DATE(datehour) > CURRENT_DATE() - 1
AND LOWER(title) NOT LIKE '%wikipedia%' AND LOWER(title) NOT LIKE '%cookie%' AND LOWER(title) NOT LIKE '%main%' AND LOWER(title) NOT LIKE '%search%'
GROUP BY 1
ORDER BY 2 desc
LIMIT 100
Hacker News
Like Stack Overflow and Wikipedia, this is another corpus of user generated content. There used to be a few more tables in here (the comments were especially fun), but now it's just a record of posts and votes.
Example query – Most popular posts about Steve Jobs
SELECT
f.id, f.title story_title, f.timestamp story_time, f.score story_score
FROM `bigquery-public-data.hacker_news.full` f
WHERE LOWER(f.title) like '%steve jobs%'
ORDER BY f.score DESC NULLS LAST
LIMIT 10
Google Trends
See what the world is searching for. Actually, what they searched for – this massive dataset is usually a few days stale. Still, it can be useful for real life use cases like planning a content calendar or establishing an SEO strategy.
Example query – Top 25 search terms in the UK
SELECT DISTINCT t.term, min(t.rank) rnk
FROM `bigquery-public-data.google_trends.international_top_terms` t
WHERE refresh_date = DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY) AND country_code = 'GB'
GROUP BY 1
ORDER BY 2 ASC
Baby Names
A comprehensive record of names given to babies in the United States, grouped on states. A fun dataset for trendspotting and prediction. Which grandma name will be popular next year?
Example query
SELECT
YEAR,
SUM(n.number) number,
COUNT(DISTINCT n.name) names,
SUM(case when n.name in ('Sam','Samuel','Samantha') then n.number else null end) sams
FROM `bigquery-public-data.usa_names.usa_1910_current` n
GROUP BY 1
ORDER BY 1 DESC
Iowa Liquor Sales
A record of every bottle of liquor sold in the state of Iowa since 2012, including details about the manufacturer and the store where the bottle was sold. Fun dataset for time series analysis and can be used for machine learning and forecasting. Read more.
Sample query – Products purchased per month
SELECT
DATE_TRUNC(date, month) month,
COUNT(*) items
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY 1
ORDER BY 1 DESC
NYC Citi Bike Trips
A record of all rides that occur on the largest bike share system in the US. At the time of this writing, it appears this dataset only goes through 2016.
Example query – Most popular Citi Bike stations
SELECT
start_station_name, start_station_latitude, start_station_longitude,
COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY 1,2,3
ORDER BY num_trips DESC
LIMIT 10
The Look
This is a fake ecommerce dataset that is somewhat useful for practice working on analysis problems you might find if you worked for a business that sells widgets online. This dataset includes a few tables such as inventory items, orders, users and distribution centers. Read more.
Example query – Distribution of orders per user
WITH oo as (
SELECT u.id, count(o.order_id) orders
FROM `bigquery-public-data.thelook_ecommerce.users` u
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o on o.user_id = u.id
GROUP BY 1
ORDER BY 2 DESC
)
SELECT orders, count(*) users
FROM oo
GROUP BY 1
ORDER BY 1 ASC