In this article I’m going to delve deeper into data science functions in Excel. Excel has a ton of features to help data analytics as we’ve discussed before. Sure, there are plenty of other options out there today that we can use to accomplish all these tasks, but I always find Excel to be such a simple and basic tool that everyone should know, even if we rarely use it anymore.
Today we are going to look at statistical and data science functions in Excel.
Descriptive Statistics
Descriptive statistics summarize key aspects of data, such as mean, variability, and dispersion.
Mean is the average of a dataset.
Sales |
5000 |
7000 |
8000 |
6500 |
In order to find the average sales we can use AVERAGE() function.
Standard deviation measures by how much data varies from the mean. There are two functions to calculate the standard deviation. STDEV.P() and STDEV.S()
Its crucial to know the difference. In STDEV.P(), the P stands for population. That means this function should be used if you have the data representing the entire population you are interested in studying. On the other hand STDEV.S() means standard deviation of a sample taken from that population.
Lets calculate the standard deviation of the above sales dataset,
If you get a higher value for standard deviation (away from zero) that means your dataset is spread out. If you get a lower value for standard deviation, closer to zero that means you data stays around the mean value.
Variance is the square of the standard deviation and shows the degree of spread.
Again high variance means data is widely spread and low variance means data points are closer to the mean.
Correlation
Correlation helps identify relationships between variables. Lets take the below dataset and see if we can find a relationship between the hours studies and the exam score
Hours Studied | Exam Score |
5 | 70 |
6 | 75 |
7 | 80 |
8 | 90 |
The range of value for correlation is from -1 to +1
If the value you get is closer to -1 that means there’s a strong negative connection between the two variables. If the value is 0 then there’s no connection. If the value is closer to +1 then there’s a strong positive connection.
The above dataset returns 0.98 value which means there’s a positive connection between the two variables.
Forecasting and Predictive Analytics
Forecasting helps to predict future data based on historical data. For example if we wanted to predict the sales for 2024 based on previous years sales. There are two functions in Excel that we can use for predicting future values.
Predicting Future Values Using Exponential Smoothing: FORECAST.ETS()
Year | Sales ($) |
2020 | 5000 |
2021 | 7000 |
2022 | 9000 |
2023 | ??? |
Here we use the target date (2023), values (known sales data range), timeline (year data range). Based on that we get the value 11000 as the sales for 2023.
Predicting Future Values Using TREND()
Year | Revenue ($) |
2019 | 10,000 |
2020 | 15,000 |
2021 | 20,000 |
2022 | ??? |
Lets use this function to predict the revenue for 2022 in the above dataset.
Based on the existing data we get the value 25000 as the revenue for 2022.
Lets do a practice exercise to summarize what we learned today,
Year | Revenue ($) | Ad Spend ($) |
2020 | 50,000 | 10,000 |
2021 | 65,000 | 15,000 |
2022 | 80,000 | 20,000 |
2023 | ? | 25,000 |
Tasks:
Find Mean, Variance, and Standard Deviation of revenue
Calculate the correlation between Ad Spend & Revenue
\=CORREL(C2:C5,B2:B5)
\=1
There’s a strong positive correlation between the two variables
Predict 2023 revenue using TREND() & FORECAST.ETS()
\=TREND(B2:B4,A2:A4,A5)
\= 95,000
\=FORECAST.ETS(A5,B2:B4,A2:A4)
\=95,000
That’s all for todays article. Hope you learned something new :)