Best BigQuery Datasets

A list of the most useful public datasets available to query through BigQuery.

Best BigQuery Datasets

BigQuery is a tremendously powerful tool. Not just because you can use it as a SQL client for your own data (you can run SQL on your own CSVs, for example), but also because it's a tremendously useful tool for learning. BigQuery includes hundreds of public datasets that you can query for free (up to 1 TB per month). This is a great playground where you can practice your skills and test new functionality such as AI features which are regularly released by the Google team.

But not all of the datasets are useful. I would know. I've spent many hours poking at what's available. A lot of these datasets are outdated or thin. So many of them only include aggregated data, which limits the potential for fun and insights. A lot of these datasets are interesting, but they don't look anything like the data you might use at work.

There are some keepers in here! In this post I list my favorite datasets that you can find in BigQuery.

IMBD Reviews

100,000 real movie reviews written by IMDB users. This dataset includes unstructured free response text and also numeric scores on a scale of 1-10. This dataset is great for natural language processing and generative AI models which you can access through BigQuery.

Example query - Create a sample table, from Sentiment Analysis using AI

CREATE OR REPLACE TABLE `[project].[dataset].[table]` AS
SELECT review, movie_url, label, reviewer_rating
FROM `bigquery-public-data.imdb.reviews`
WHERE reviewer_rating IS NOT NULL AND RAND() < 0.001;

Stack Overflow

Probably the most usable dataset to practice product analysis. That's because this dataset looks a lot like what you might use at work. It includes a huge users table and a few tables – questions, answers, and votes – that resemble event logs associated with users, so that we can query the data to understand product fundamentals like activation, retention, and user segmentation. Many of the tutorials shared through this website use Stack Overflow data as an example.

Example query – Signup >> Ask Funnel, from Funnels in SQL

DECLARE yr, conversion_window INT64;
SET (yr, conversion_window) = (2021, 30);

WITH users AS (
SELECT *
FROM `bigquery-public-data.stackoverflow.users`
WHERE EXTRACT(YEAR FROM creation_date) = yr
),

users_questions as (
SELECT u.display_name, u.id user_id, u.creation_date signup, COUNT(q.id) questions, MIN(q.creation_date) first_question
FROM users u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` q on q.owner_user_id = u.id 
  AND DATE_DIFF(q.creation_date, u.creation_date, day) <= conversion_window
GROUP BY 1,2,3
)

select 
  EXTRACT(MONTH FROM signup) month,
  COUNT(user_id) new_users,
  COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) asked,
  ROUND(COUNT(DISTINCT CASE WHEN questions >0 THEN user_id ELSE NULL END) * 1.0 / COUNT(user_id), 3) pct_asked
FROM users_questions
GROUP BY 1
ORDER BY 1 ASC

NYC Taxi Fares

A record of every yellow taxi ride in NYC since 2011, including trip distance, passenger count, fare, and tip amount. These datasets are pretty big, with 36 million rows for the 2022 dataset. Read more about this dataset.

Example query – NYC taxi fare deciles, from Histograms in SQL

SELECT APPROX_QUANTILES(total_amount, 9)
FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2022`

Machine Learning Datasets

Includes classic datasets used in machine learning 101 classes such as Irises, Credit Card Defaults, ULB Fraud Detection, and Penguins.

Example query – Distribution of penguin height, from Histograms in SQL

DECLARE bin INT64 DEFAULT 200;
DECLARE outlier FLOAT64 DEFAULT null;
SET outlier = (SELECT PERCENTILE_CONT(body_mass_g, outlier) OVER() from `bigquery-public-data.ml_datasets.penguins` limit 1);

WITH cleaned as (
SELECT 
  body_mass_g, outlier, species,
  (case when body_mass_g > outlier then outlier else body_mass_g end) clean
FROM `bigquery-public-data.ml_datasets.penguins`
)

SELECT
  FLOOR(clean / bin * 1.0) * bin flr,
  CASE WHEN FLOOR(outlier) > (FLOOR(clean / bin * 1.0) * bin) AND FLOOR(outlier) < (FLOOR(clean / bin * 1.0) * bin + bin) then FLOOR(clean / bin * 1.0) * bin || ' - max' else FLOOR(clean / bin * 1.0) * bin || ' - ' || (FLOOR(clean / bin * 1.0) * bin + bin) end weight_range,
  COUNT(*) AS penguins,
  COUNT(CASE WHEN species = 'Adelie Penguin (Pygoscelis adeliae)' then 1 else null end) adelie,
  COUNT(CASE WHEN species = 'Chinstrap penguin (Pygoscelis antarctica)' then 1 else null end) chinstrap,
  ROUND(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM cleaned),3) percent
FROM cleaned
GROUP BY 1,2
ORDER BY 1 ASC

Wikipedia

A massive database that includes traffic data for every single page on Wikipedia. This is a great dataset for you to practice PARTITION BY. You'll need it, or else you'll run out of your free 1 TB fast.

Example query – Yesterday's most popular Wikipedia pages, from How to Use BigQuery

SELECT title, sum(views) views
FROM `bigquery-public-data.wikipedia.pageviews_2023`
WHERE DATE(datehour) > CURRENT_DATE() - 1
AND LOWER(title) NOT LIKE '%wikipedia%' AND LOWER(title) NOT LIKE '%cookie%' AND LOWER(title) NOT LIKE '%main%' AND LOWER(title) NOT LIKE '%search%'
GROUP BY 1
ORDER BY 2 desc
LIMIT 100

Hacker News

Like Stack Overflow and Wikipedia, this is another corpus of user generated content. There used to be a few more tables in here (the comments were especially fun), but now it's just a record of posts and votes.

Example query – Most popular posts about Steve Jobs

SELECT
 f.id, f.title story_title, f.timestamp story_time, f.score story_score
FROM `bigquery-public-data.hacker_news.full` f
WHERE LOWER(f.title) like '%steve jobs%'
ORDER BY f.score DESC NULLS LAST
LIMIT 10

See what the world is searching for. Actually, what they searched for – this massive dataset is usually a few days stale. Still, it can be useful for real life use cases like planning a content calendar or establishing an SEO strategy.

Example query – Top 25 search terms in the UK

SELECT DISTINCT t.term, min(t.rank) rnk
FROM `bigquery-public-data.google_trends.international_top_terms` t
WHERE refresh_date = DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY) AND country_code = 'GB'
GROUP BY 1
ORDER BY 2 ASC

Baby Names

A comprehensive record of names given to babies in the United States, grouped on states. A fun dataset for trendspotting and prediction. Which grandma name will be popular next year?

Example query

SELECT
  YEAR,
  SUM(n.number) number,
  COUNT(DISTINCT n.name) names,
  SUM(case when n.name in ('Sam','Samuel','Samantha') then n.number else null end) sams
FROM `bigquery-public-data.usa_names.usa_1910_current` n
GROUP BY 1 
ORDER BY 1 DESC

Iowa Liquor Sales

A record of every bottle of liquor sold in the state of Iowa since 2012, including details about the manufacturer and the store where the bottle was sold. Fun dataset for time series analysis and can be used for machine learning and forecasting. Read more.

Sample query – Products purchased per month

SELECT  
DATE_TRUNC(date, month) month,
COUNT(*) items
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY 1
ORDER BY 1 DESC

NYC Citi Bike Trips

A record of all rides that occur on the largest bike share system in the US. At the time of this writing, it appears this dataset only goes through 2016.

Example query – Most popular Citi Bike stations

SELECT
  start_station_name, start_station_latitude, start_station_longitude,
  COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY 1,2,3
ORDER BY num_trips DESC
LIMIT 10

The Look

This is a fake ecommerce dataset that is somewhat useful for practice working on analysis problems you might find if you worked for a business that sells widgets online. This dataset includes a few tables such as inventory items, orders, users and distribution centers. Read more.

Example query – Distribution of orders per user

WITH oo as (
SELECT u.id, count(o.order_id) orders
FROM `bigquery-public-data.thelook_ecommerce.users` u
LEFT JOIN `bigquery-public-data.thelook_ecommerce.orders` o on o.user_id = u.id
GROUP BY 1 
ORDER BY 2 DESC
)

SELECT orders, count(*) users
FROM oo
GROUP BY 1 
ORDER BY 1 ASC