August 31, 2021
SQL Server Data warehouse? Sounds boring. Who would want to learn this? But what if I tell you that data analysis is one of the sought-after skills today?
Netflix raised its value to $50 billion in 2020 despite the pandemic because of data-driven decisions. Even more, 40% of companies are planning to increase their budgets on data-driven marketing. And what does this all mean to you and me? Cha-ching! Yes, more jobs are available for data analysts and scientists. And you know what? A data warehouse is at the core of all this. And learning this is also the start of your journey to these worthwhile careers.
In this article, you’re going to learn about making a data warehouse using SQL Server. Moreover, SQL Server is one of the best choices for a data warehouse in case most of your transactional data sources use relational databases. If you’ve also been doing SQL database stuff for the past years, this should be easy for you. You can also try to follow the example described below on your own PC. Simply download and install the SQL Server Community Edition for free and let us begin.
A data warehouse is the central repository of information for data analysis, artificial intelligence, and machine learning. Data flows from different data sources like transactional databases. The data is also updated regularly to make informed decisions on time.
The illustration for a typical data warehouse environment is shown below.
The first part of the diagram is the sources of data. These are databases from transactional systems. It can be in SQL Server or another relational database. It can also be from flat files like CSVs, Excel, XML, and text files.
Afterwards you consolidate all the needed data from the source into a single format called the staging area. For simplicity, you can also implement the staging area in SQL Server.
Then, the SQL Server database with a dimensional model is the data warehouse. We will discuss how to make one with an example later.
The final part of the diagram is different data marts. A data mart focuses on one aspect of the business, like sales, purchasing, and more. We are going to make a data warehouse with one data mart about sales of insurance policies later.
SQL Server data warehouse needs to be modeled for efficient processing. The next topic will be about this.
Operational system databases are designed to be normalized for efficient storage and retrieval. But a data warehouse is structured a bit differently. Before we proceed with the structures or schema of data warehouses, let us discuss a few key terms in the model.
Fact table contains all the facts about a business entity or process. It is at the center of the schema surrounded by dimensions. A fact table may be about sales, tickets support, projects, and more. You can implement this as a SQL database table. Columns include the ID keys of dimensions and measures.
Each record in the fact table will determine how detailed a fact table is. There can be several fact tables in a data warehouse defining different business processes in one data warehouse. Each of them can share dimensions about location, date, and more.
Dimension categorizes facts and measures in a fact table. For example, city or region dimension describes the location of a customer in a sales transaction. Other examples of dimensions are customer and product in a sales business. Dimensions also enable users to answer a business question. For example, “how much did we earn from Product X this month?” In this question, Product is the dimension of a Sales fact.
Dimension is implemented as a table referenced by the fact table. It includes a primary key and the key description or name, for example, a product ID and a product name. Though, more can be defined within a dimension to categorize it and further build a hierarchy. For example, product category and subcategory describe a product.
Dimension’s primary key can be different from the primary key of the source table. This happens when a table of customers from one database is combined with a table of customers from another. It is also called a surrogate key.
Measure is a property of the fact table that allows calculation. This can be sum, average, count, minimum, or maximum. For example, you can sum sales amounts to form total sales.
Measures can be additive, non-additive, semi-additive, or calculated. The sales amount is an additive measure. You can sum or average it. But unit price is non-additive. It may not make sense if you sum it. Meanwhile, a calculated or computed measure is like its name. Total sales amount, for example, is calculated based on product unit price + tax.
The simplest and the most widely used dimensional model is a star schema. It has the fact table at the center and the dimensions surrounding it. It can also be described as a parent-child table design. The fact table is the parent while the dimensions are the children. But since it’s so simple, there are no grandchildren.
Common characteristics of star schema include:
Advantages of star schema include:
In a snowflake schema, dimension tables are normalized. The physical structure resembles a snowflake shape. Compared to a parent-child design, snowflake schemas can have grandchildren.
Common characteristics of snowflake schema include:
Advantages of snowflake schema include:
Time to put the concepts above to practical use. In this example, we will use a fictitious company called ABC Insurance Co. The company sells fire insurance policies for residential houses, apartments, and business structures.
Our data warehouse example will have these simple characteristics:
Output for this step:
Your stakeholders have questions in mind. Your role is to provide the answers to those questions so they can make informed decisions.
In our example, we only need to answer how many sales were made in a particular period. Of course, there are more. But to make a simple demonstration of the concepts we have learned, we will only answer this question. I leave it to your analytical minds how to apply it to others.
To get the answers, pay attention to the current state of the system and the desired outcome. Ask for report formats they need. Then, proceed to the next step, which is discussed next.
Output for this step:
The transactional database contains all the currently available information. For this example, we assume that all the information we need can be found in the source database. If there is missing information, you must go back to your stakeholders. Then, resolve the matter separately. Then, go back to this step.
After seeing the source database, identify what tables and columns you need. You don’t need everything. If you need to clean the data, identify the steps you need to do it. You may need to clarify some parts of the data from the stakeholders.
Now, let’s assume that we already have what we need. Below you can find a diagram of the database staging area.
At this point, you need to plan on how to get the data to the staging area. After this, you’re ready for the next step. But before we do that, I think this question deserves to be answered. Why create a separate database for the staging area?
Good point. You may ask what’s wrong with getting the data straight from the transactional database? Our example uses only 1 database source. In the real world, you don’t just deal with sales. You can have other systems for purchasing, petty cash, payroll, and more. If these have separate databases, and you want to analyze them as well, this staging area may be good for them too.
How would you know? Ask yourself whether there is information that these systems can share. If yes, consolidating them into one staging area will be an advantage. One example of something that they can share is an employee list.
Another point is data cleansing. You don’t want to touch a working transactional system. So, you clean the data in the staging area. And one more point is the precalculation of aggregates. Do you need to do some complex calculations or summarization before reaching the data warehouse? You can also do that in the staging area.
Finally, we have reached the focal point of this article. And here’s what we are going to do: we are going to create a new database for the data warehouse.
Output for this step:
To create a new database for the data warehouse, launch SQL Server Management Studio. Then, in the Object Explorer, right-click the Databases folder and select New Database. Name your database and set the database options. We named ours as fire_insurance_DW.
Now, the empty database needs new tables. And the first table you create is the fact table. For our fire insurance sales example, we have the structure as shown below.
The fact table above includes 3 additive measures: premium, other_charges, and total_amount_paid. Meantime, total_charges is a computed measure based on premium + other_charges.
Please pay also attention to the foreign keys client_id, building_city_id, product_id, and statement_date. They will reference dimension tables later.
Next, create the dimension tables. We have product, client, city, and date dimensions. Each serves a purpose in reporting. The below table shows all the dimensions in our data warehouse example.
Let us describe each dimension in more detail.
Check the final database diagram of our data warehouse below.
Doing data analysis doesn’t end in creating the database for the data warehouse. So, what are the next several steps?
What we mean here is extracting data from the source database to the staging area and, finally, to the data warehouse. Before you extract data, do not forget to create the field mappings from the source and target. You can find an example of fact table mappings below.
For the date dimension, you also need a script to generate data. The sample SQL code below will build a date table from 2020 to 2021. It uses the dimDate dimension table that we have in the data warehouse.
DECLARE @StartDate date = '01/01/2020'; DECLARE @EndDate date = '12/31/2021'; ;WITH seq(n) AS ( SELECT 0 UNION ALL SELECT n + 1 FROM seq WHERE n < DATEDIFF(DAY, @StartDate, @EndDate) ), d(d) AS ( SELECT DATEADD(DAY, n, @StartDate) FROM seq ), src AS ( SELECT [transaction_date] = CONVERT(date, d), [year] = DATEPART(YEAR, d), [month_number] = FORMAT(d,'MM'), [year_month_number] = FORMAT(d,'yyyy-MM'), [year_month_short] = FORMAT(d, 'yyyy-MMM'), [month_name_short] = FORMAT(d,'MMM'), [month_name_long] = FORMAT(d,'MMMM'), [day_of_week_number]= DATEPART(WEEKDAY, d), [day_of_week] = DATENAME(WEEKDAY, d), [day_of_week_short] = FORMAT(d,'ddd'), [quarter] = 'Q' + CAST(DATEPART(QUARTER,d) AS NCHAR(1)), [year_quarter] = CAST(YEAR(d) AS NCHAR(4)) + '-Q' + CAST(DATEPART(QUARTER,d) AS NCHAR(1)), [week_number] = DATEPART(WEEK, d) FROM d ) INSERT INTO dimDate SELECT * FROM src ORDER BY transaction_date OPTION (MAXRECURSION 0);
If you need more years, simply change the start and end dates in the script.
Then, you need an ETL tool for creating the workflow of the extraction, and a scheduling tool to automate the extraction. You can use the SQL Server Integration Services with SQL Server Agent or a cloud solution like Skyvia.
Finally, you can build the reports and dashboards your stakeholders asked for. You may use Excel because they are probably familiar with it. You can also use Power BI or SQL Server Reporting Services.
A possible report output for the data warehouse we've built is shown below. It uses Power BI to show product sales per period. A few more reports are possible with the data warehouse, like client sales or sales based on location.
Analyzing your data is a journey. It can be a long journey depending on the current state of your corporate information. But like Netflix, it will be worth it.
In this article, you have learned how to build a SQL Server data warehouse from scratch. The example is simple, however, it covers most basic needs of the data warehouse.
Was our article useful? If yes, then please share it on your favorite social media platforms.