SELECT FROM WHERE - Basic SQL Queries

Start here to run your first SQL queries on free public datasets.

SELECT FROM WHERE - Basic SQL Queries

SELECT *

Run this. Return a list of all users on Stack Overflow.

SELECT *
FROM `bigquery-public-data.stackoverflow.users`

You may want to read this first: How to use BigQuery

Congratulations, you know SQL.

Seriously. SQL is copy / pasting and troubleshooting. This query is pretty simple so you probably won't need to troubleshoot, at least not yet.

The syntax for SQL is fairly straightforward. SELECT determines the columns you want to return. FROM accesses the dataset.

In this specific example, we are using the asterisk: *. This is a shortcut to return all of the columns in the dataset. The dataset we are using here is the users table in the stackoverflow database which is included in the bigquery-public-data dataset. 

Let's try the same query but with a different dataset.

SELECT *
FROM `bigquery-public-data.hacker_news.stories`

We chose these 2 tables for first queries because they're relatively small, so you won't burn through your free 1 TB using SELECT *. But in general be careful querying all columns. The more columns you return the more data you burn through.

For any query, you can analyze the output at a glance in the Results that are returned below your query in the BigQuery console. Or you can save the results to a CSV or to Google Sheets and then do calculations there.

SELECT Columns

SELECT display_name, reputation
FROM `bigquery-public-data.stackoverflow.users`

This query becomes about 10x cheaper now that we're only returning the columns we care about: display_name and reputation.

It is important that you have some “business sense” for what it is that you're querying. Data is never just data. It's a real record of an event or an entity that exists in real life. In this case, we're looking at user records for Stack Overflow, an online community website where users post questions and answers about software development (disclosure: I used to work there). 

Stack Overflow licenses all content under Creative Commons so anybody can use it for (almost) any purpose. And periodically they make a full data dump of all posts available for anyone to access. Occasionally this dump is uploaded to Google Cloud. That's what we're querying now.

We'll be using the Stack Overflow database a lot on this website, not because I used to work there, but because this is probably the most robust and realistic dataset publicly available through BigQuery. This dataset looks a lot like what you'll find at work. It has a real users table and we can analyze their behavior through their questions, answers, and votes. We can do segmentation and cohorting to identify user traits that correlate with levels of activity. We can measure activation and retention and chart trends. These are real analysis techniques that I regularly use at work.

SELECT FROM WHERE

OK so you've returned an entire table. Now let's filter that result. Add a WHERE statement to constrain the scope of what your query returns.

SELECT display_name, reputation
FROM `bigquery-public-data.stackoverflow.users`
WHERE display_name = 'samthebrand'

WHERE is how we filter. I know my username so I can find myself in this dataset.

We can include additional columns in the result to see even more user traits like name, reputation, location, and more.

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
WHERE display_name = 'samthebrand'

WHERE: Comparison operators

We can use comparison operators to filter a dataset based on logic.

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
WHERE reputation > 1000000

The query above returns all users where reputation is greater than one million. There are only a handful of generous Stack Overflow users who ever reached one million rep.

We found samthebrand with the = comparison operator. There are several other operators we can use.

>

Greater than

Work with numbers (normal comparison),

dates (date comparison),

and text (alphabetic comparison)





Special NULL value means "nothing here yet".  More on Null [here]

<

Less than

>=

Greater than or equal to

=<

Less than or equal to

=

Equal to

<> or !=

Not equal to

LIKE

Matches pattern

IS NULL

Is equal to NULL

IS NOT NULL

Is not equal to NULL

Here’s another example. This time we use a date to define the filter.

SELECT display_name, reputation, location, creation_date
FROM `bigquery-public-data.stackoverflow.users`
WHERE creation_date < ‘2008-08-01’

This query returns all 9 users who registered earlier than August 2008. I see some old friends here.

ORDER BY

Here is another fundamental building block of SQL. You will use ORDER BY all the time.

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
ORDER BY reputation DESC

ORDER BY sorts a result by a column of our choice. It's just like Excel or Google Sheets when you click on the top of a column to sort a list alphabetically or chronologically. In the example above we sort the result by user reputation in descending order. This returns a list of the users with the most reputation. We could also sort by reputation ascending: ASC. (Nobody on Stack Overflow has less than 1 rep — that'd just be cruel.)

Here's another example.

SELECT display_name, creation_date, reputation, views, location
FROM `bigquery-public-data.stackoverflow.users`
ORDER BY creation_date ASC

Sorting can also be done on dates, or really any datatype. The query above returns all user accounts with the oldest up top.

LIMIT

Here's another bit of SQL you'll use all the time.

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
ORDER BY reputation DESC
LIMIT 10

It's good practice to always use LIMIT when writing queries. This puts a cap on the number of rows in your result. In some environments this might save your butt in case you were to otherwise accidentally download a massive dataset that crashes your computer.

In the example above, we use ORDER BY and LIMIT to return the “top ten” users by rep. ORDER BY and LIMIT is a common query pattern that delivers the most or least or best records in a dataset.

SELECT WHERE ORDER BY LIMIT

Putting it all together...

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE creation_date > ‘2022-01-01’
ORDER BY reputation DESC
LIMIT 10

We can combine what we've learned so far to write a query that returns the top users who meet all sorts of criteria. In the above query we filter to find the top 10 users by rep who registered their account since 2022.

AND & OR

We can combine comparison operators to build complex filters. 

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
WHERE location = ‘Pakistan' OR location = ‘Mongolia'

In the query above we find all users whose Location is stored exactly as Pakistan or Mongolia (there’s more than 20,000 users who fit this description).

We can go even further, combining as many comparison operators as we want. And we can include filters across properties.

SELECT display_name, reputation, location
FROM `bigquery-public-data.stackoverflow.users`
WHERE (location = 'NYC' OR location = 'New York City') AND reputation > 10000

Here we find all users with more than 10,000 rep who say they live in NYC or New York City. Now you can see how it might be useful for a technical recruiter to learn SQL.

Be careful when using comparison operators. The computer reads operators in a specific order according to logical precedence, similar to PEMDAS. It is useful to always use parentheses to group similar statements.

And always check your work. Run the query and scan your result. Does the result match the parameters of your filter? If you see users in your result who have less than 10k rep then you did something wrong.

LIKE and NOT LIKE

LIKE and NOT LIKE are useful operators that let you specify results that do not exactly match a lookup value.

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE location LIKE ‘%City’

The query above returns all users who have a location that ends in "City". So in your result you’ll see users from New York City, Ho Chi Minh City, Salt Lake City, etc.

When using LIKE you’ll probably want to also include a wildcard character: % to indicate where the matching starts and ends.

IN & NOT IN

Use IN to pattern match for a list of values.

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE location IN ('Houston’,'houston')
ORDER BY reputation DESC
LIMIT 10

The query above will find the top users by reputation whose location is either Houston or houston.

LOWER

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE LOWER(location) = 'houston'
ORDER BY reputation DESC
LIMIT 10

Here’s another way to get the same result.

Use LOWER to lowercase all values in the column. In the example above, we get the users whose lower-cased location is equal to ‘houston’. This will catch users whose location is Houston, housTon, HousTon, etc. This is a great tool to use when you are dealing with user generated content or unstructured data.

Combine operators LIKE, LOWER, and NOT IN to create a compound filter and clean up your result to your needs.

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE location LIKE '%ston' AND LOWER(location) NOT IN ('houston','boston')

NULL

NULL is a special value that means “does not exist”. NULL is not equal to zero. Zero is a number. NULL is the absence of a number or a string or any other data type.

SELECT display_name, reputation, location, about_me
FROM `bigquery-public-data.stackoverflow.users`
WHERE location is NULL
ORDER BY reputation DESC
LIMIT 10

In the query above we find the top 10 users by reputation who have no location in their profile.

As you use SQL more and more you'll learn that NULL is cool in all sorts of ways, such as when it is safe to use COUNT or SUM (zeros count but they don't add up, whereas NULLs don't count and they don’t add up).

That's it! If you've made it this far you're about halfway done with SQL 101. Next up: aggregates (using group by & order by), but I haven't yet published this tutorial. Do you want it? Lemme know if so.

Other articles you can find on this website:

How to use BigQuery
Follow these instructions to run your first query.
Funnels with SQL
Don’t build funnels in SQL. Unless you have to.
Histograms with SQL
Even if all you do is sit around and send histograms through Slack, you’ll be a decent enough product manager.
Best BigQuery Datasets
BigQuery is a tremendously powerful tool. Not just because you can use it as a SQL client for your own data (for instance, you can use it to query a CSV that you upload), but also because it’s a tremendous tool for teaching. BigQuery includes hundreds of public datasets that