SQL Skills for Data Science Mastery

SQL is an essential tool in data science for managing and analyzing structured data. It allows data scientists to retrieve, filter, aggregate, and manipulate large datasets stored in relational databases efficiently. By writing SQL queries, data professionals can extract meaningful insights, clean and preprocess data, and integrate it with other tools like Python and R for further analysis. Mastering SQL is crucial for handling real-world datasets, as it enables efficient data extraction and transformation, forming the foundation for advanced data science workflows. Following are few of the tasks that a data scientist can do using SQL,

Data extraction - Most real-world data is stored in relational databases.
Data cleaning - SQL helps in filtering, transforming, and handling missing data.
Feature engineering - you can derive new insights using SQL queries.
Exploratory data analytics (EDA) - Aggregations, joins, and window functions help in data analysis.

Lets take a look at the features in SQL that are used for data science,

Data Aggregation (GROUP BY, HAVING, COUNT, SUM, AVG, etc.)
Filtering & Cleaning Data (WHERE, CASE, COALESCE, NULL handling)
Joins for Merging Data (INNER, LEFT, RIGHT, FULL JOIN, UNION)
Subqueries & CTEs (Common Table Expressions)
Window Functions (RANK, DENSE_RANK, LAG, LEAD, ROW_NUMBER)
Handling Time-Series Data (DATE functions, time intervals, rolling averages)

Data Aggregation

We have discussed data aggregation in detail in our previous SQL articles, therefore I’m only going to include an example of using data aggregation. Data aggregation helps to summarize large datasets. Below example lets you find the average salary and the max salary and filter it by department.

SELECT department, AVG(salary) AS avg_salary, MAX(salary) AS max_salary
FROM employees
GROUP BY department;

Lets see how to use the HAVING clause in this scenario,

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

The above example lets you find out the number of employees in each department with more than 5 employees. The important thing to remember in filter conditioning is HAVING lets you filter aggregate data and WHERE lets you filter raw data. So you cannot use WHERE on aggregate data.

Filtering and Cleaning Data

Filtering raw data can be done with the WHERE clause.

SELECT * 
FROM employees
WHERE salary > 50000 AND department = 'IT';

Data cleaning is a crucial step in preparing your dataset for better analysis. Some of the most important data cleaning methods include,

Finding empty values
Replacing empty values with a default value
Removing duplicate values

Finding null values or empty values is easy with the SELECT clause.

SELECT * FROM employees WHERE name IS NULL;

You can do the same with any other column where you want to find out how many null values are there in a dataset. If you have a really large dataset, you can use COUNT() to find out if a considerable amount of the data you need has values or not.

select count(*) from employees where name IS NULL;

Once you’ve found out if there are any null values in your dataset, you can either remove those records or replace the null value with a default value. For this we use COALESCE.

SELECT name, COALESCE(email, 'No Email') AS email_status
FROM employees;

Removing duplicates can be done by using the DISTICT clause.

SELECT DISTINCT id FROM employees;

Another useful feature is the CASE clause which can be used for creating custom logic for data transformation.

SELECT name, 
       CASE 
           WHEN age < 18 THEN 'Minor'
           ELSE 'Adult'
       END AS age_group
FROM users;

JOINS

JOINS are a powerful feature in SQL as combining tables is essential for relational database analysis. We have already discussed JOIN in details in our previous article which you can refer - https://themathlab.hashnode.dev/beginners-guide-to-sql-joins-and-table-relationships

For example, below query is to find out customers who have made a purchase.

SELECT customers.name, orders.total_price
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id;

INNER JOIN combines two tables if there is a match. Now lets say we want to find out customers who have not made a purchase yet.

SELECT customers.name
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
WHERE orders.order_id IS NULL;

LEFT JOIN returns all records from the first table, in this case customers table regardless of if there’s a match with the second table or not. By putting the WHERE clause here we can filter out customers who hasn’t made a purchase yet.

Subqueries and CTEs

We already discussed subqueries in our previous article - https://themathlab.hashnode.dev/what-is-a-sql-subquery-learn-with-examples

An example for subquery would be,

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

An alternative to subquery is CTE which gives better readability and organization.

WITH high_salary_employees AS (
    SELECT name, salary 
    FROM employees 
    WHERE salary > 60000
)
SELECT * FROM high_salary_employees;

Window Functions

Window functions are powerful SQL functions that perform calculations across a set of table rows related to the current row. Unlike GROUP BY, window functions do not collapse rows; they maintain individual rows while adding additional calculations. A window function has the below structure,

function_name() OVER (
    PARTITION BY column_name
    ORDER BY column_name
)

Here PARTITION BY means dividing the data into groups, this is an optional clause. ORDER BY means defining the sequence which the function is applied.

There are a few types of window functions including ranking, aggregate window functions, Lag and Lead, First Value and Last Value, NTILE. Lets take a look at each of these functions.

Ranking Functions

There are three ranking functions which we are going to look at today,

RANK()
DENSE_RANK()
ROW_NUMBER()

These functions will assign a rank row based on the ORDER BY clause in the window function. Now lets look at this example where I’ve used all three ranking functions.

SELECT name, department, salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rankk,
       DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rankk,
       ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num
FROM employees;

Lets take a look at the results first and then discuss each function,

Lets look at the RANK() function first. We can see that the dataset has been partitioned or grouped by the department and ordered based on the salary in the descending order. and each row has been given a ranking. You can see that both Max and George has the same value for salary, which means their ranking is a tie. In that case you can see that Carlos has been given the rank 3 instead of 2, because with RANK() it skips ranking numbers if there's a tie.

If you observe the DENSE_RANK() results, the difference is it doesn’t skip ranking numbers if there is a tie.

No if you observe the ROW_NUMBER() results, you can see that it always assigns a unique number as the ranking even if there’s a tie.

Aggregate Window Functions

Now we already know what aggregate functions are in SQL, so what’s the difference between them and aggregate window functions? In aggregate window functions the results are similar to the aggregate functions but it doesn’t collapse rows when displaying results. Lets see the difference with an example. Lets say we want to see the average salary per department. The normal aggregate function would look like this,

select department, avg(salary) as avg_salary
from employees 
 group by department;

Now lets write a window function for this,

SELECT department, name, salary,
       AVG(salary) OVER (PARTITION BY department) AS avg_salary,
       SUM(salary) OVER (PARTITION BY department) AS total_salary
FROM employees;

Now lets see the results of each query side by side,

So aggregate windows functions will compute aggregates without losing individual rows.

LAG() and LEAD()

These two functions helps to compare the current row value with the previous and next row values.

SELECT name, salary,
       LAG(salary, 1) OVER (ORDER BY salary DESC) AS prev_salary,
       LEAD(salary, 1) OVER (ORDER BY salary DESC) AS next_salary
FROM employees;

Its very straightforward as you can see, the above query returns the previous row salary and the next row salary along with the current row salary. So LAG() retrieves the previous row value and LEAD() retrieves the next row value.

First Value() and Last Value()

Now this is similar to the LAG() and LEAD() but instead of the previous and the next value, FIRST_VALUE() returns the first value and the LAST_VALUE() returns the last value. But the first and last values are taken of a partition and not the entire dataset.

SELECT employee_name, salary,
       FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC) AS highest_salary,
       LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC 
       ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS lowest_salary
FROM employees;

The FIRST_VALUE() query is basic and simple as you can see, but the LAST_VALUE() has a few words which we haven’t seen before.

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING defines the window frame in which LAST_VALUE() will operate. UNBOUNDED PRECEDING means it starts from the first row in the partition (the highest salary because of ORDER BY salary DESC). UNBOUNDED FOLLOWING means it extends to the last row in the partition (the lowest salary). This means the function sees all rows within the department's partition.

NTILE()

This functions divides rows into equal sized buckets.

SELECT name, salary,
       NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile
FROM employees;

Handling Time Series Data

Analyzing time based data is crucial or forecasting. There are a lot of SQL functions that can be used to manage date or datetime values. some common functions include,

NOW(), CURRENT_DATE to get current date/time.
DATEADD(interval, value, date), DATEDIFF(unit, date1, date2).

Lets take a look at an example to get monthly sales aggregation,

SELECT DATE_TRUNC('month', order_date) AS month, SUM(sales) AS total_sales
FROM orders
GROUP BY month;

To summarize, in todays article we looked at some features available in SQL to support data analysis. Some of these are basic SQL features we discussed during our initial articles but others like window functions are more advanced concepts which we discussed for the first time in todays article. In short, the following are the most crucial features of SQL for data science,

Data Extraction
Data Aggregation
Data Cleaning and Filtering
Feature Engineering
Exploratory Data Analytics
Subqueries and CTE
Window Functions
Handling Time Series Data

If you have the basic knowledge of these techniques then you can build from there, learn more advanced techniques, practice with large datasets and soon you will be comfortable with using SQL for your data science related tasks.

Mastering SQL: Essential Skills for Data Science