Data Warehouse Incremental Load - Part 3

by Nicolae Guse 4. November 2023 13:22

Propagating the changes at Data Warehouse Level

What can I say, time flies and here we are at the 3’rd and last part of the article series dedicated to Incremental Load.

The good news is that, thanks to using tools like a Delta Engine, we can now know what changes have occurred in the Staging table, which we need to propagate to the Data Warehouse Fact table. Now, in theory, things should be quite straight forward:

  • In an IDEAL world, your Staging transaction table is the one which gives the granularity to your fact table

    • This would mean that any records which are marked with OPERATION_TYPE = ‘D’ in our example DELTA_STG_INVOICE_DTL should be directly deleted from the correspondent FT_INVOICE table

    • And ALL records from the DELTA_STG_INVOICE_DTL with OPERATION_TYPE IN (‘I’,’U’) should go thru the regular loading script for the Fact table, which will apply all necessary business rules, and when everything is ready, load them via a MERGE statement into the final FT_INVOICE table, and that’s it

    • Sounds deceivingly simple, doesn’t it? 😊

  • In the REAL world, things are NEVER that simple, and here are some reasons why:  More...

Tags:

Data Warehouse Best Practices

Data Warehouse Incremental Load - Part 2

by Nicolae Guse 16. September 2023 13:32

Looking at the Delta Engine approach in more detail

 

Some General Considerations

First and foremost, one pre-requisite when talking about Delta Engines, is to already have an Atomic Level implementation, as we’ve already discussed into  Data Warehouse Atomic Level  article

 As soon as you start aggregating data, it becomes much more difficult to push-up changes in the underlying transactions tables, as you may ending up having a single aggregated line that combines both data that changed and data that didn’t changed

o   Therefore, in order to make the Incremental load approach also work for aggregated tables you would need to have a link table that would contain the mapping between:

§  The Primary Key from the Detailed Transaction Table (N)

§  The Primary Key of the Aggregated table (1)

o   This type of approach based on Link tables would still allow you to use the DELTA tables in order to determine which are the impacted lines from the Aggregated table that you need to reprocess  More...

Tags:

Data Warehouse Best Practices

Data Warehouse Incremental Load - Part 1

by Nicolae Guse 20. August 2023 21:32

Today we are embarking on a journey to explore the world of the challenges related to building an Data Warehouse Incremental Load solution. But lets take it step by step, and see where it goes.

General Context

A while back we’ve discussed in the article  Real Time Data Warehouse - Is it really possible and do we really need it? about some of the challenges faced when moving towards a Real Time Data Warehouse architecture. This is still looked upon as the Holly Grail of Data Warehouse design: the ability to pick-up changes in your Source Application and to immediately propagate them to the Data Warehouse system and to your reporting Application.

Before moving forward, lets clarify a bit why this need of Real Time Data Warehousing:

·         Is it useful for real Business Intelligence decision making for a company Top Management?

o   On that I would typically answer with a BIG NO. You need to properly understand the data you are looking at, you need to discuss about it, and take informed decisions. And either way you’re looking at, this is not a Real Time process.

o   Having the latest invoice from your ERP system continuously flowing to your Data Warehouse shouldn’t have a significant impact over the decisions your Top Management is taking.

·         Or is it rather useful for Operational Activities?  More...

Tags:

Data Warehouse Best Practices

Custom Row Versioning - Part 2

by Nicolae Guse 15. August 2014 18:16

Together we've discovered in  Custom Row Versioning - Part 1 what were the main challenges we were facing in  order to address the very hot topic of the concurrent data access. Now we will focus on the technical solution that we've implemented for SQL Server 2008 R2 Database Engine. However, as you will see, the solution is pretty straight forward and a similar approach could be applied under different database engines. 

What I've realized while working on this data concurrency problem is that the final solution will involve a Row Versioning solution that would have to work with Partition Switch approach. This meant that we should be able to handle any INSERT/UPDATE/DELETE operations that would be reflected in the final table, as a result of the table Partition Switch, without:

- being blocked by an ongoing user query on that FACT / DIMENSION table

- blocking any new user reporting launched while the refresh operation was in progress or completing

 More...

Tags:

Data Warehouse Best Practices

Custom Row Versioning - Part 1

by Nicolae Guse 13. August 2014 14:30

 

I'll never stop being amazed by the importance of stepping outside of your comfort zone. Even if most of us don't like doing this, often without realizing, once you start having that warm feeling of comfort, the technical progress starts slowing down and, sooner or later, you're bound to be surprised of what's just around the corner. But let’s dig into  this and you'll better understand what I mean.

 

A few years back we were using Microsoft Analysis Services as a core backbone component of our Business Intelligence system, on top of a Data Warehouse build on Microsoft SQL Server Database Engine. We were using MOLAP cubes containing aggregated data, build especially for decision making. And for those users needing to perform operational reporting, we were building an SQL Server database, containing all details they needed. On top of both worlds we were using an in-house build reporting solution, based on Office Web Components. Perhaps not the most versatile reporting solution, but it was getting the job done.

 

And I would have said that everyone lived happily ever after but it wasn't quite so: some of the users were finding the custom reporting solution subpar, compared to other reporting solutions from the market. As time went by, and while using the tool in more and more complex ways, some of the customers found the lack of a more advance user interface a very serious limitation. In addition to this, Microsoft stopped their support for Office Web Components, which was mainly designed for Analysis Services 2000. Since back them we've migrated to Analysis Services 2005, already some of the more advance features were not supported.

 

More...

Tags:

Data Warehouse Best Practices

Microsoft Analysis Services Tips and Tricks

by Nicolae Guse 10. February 2013 15:02

Recently I've encountered some pretty nasty errors in Microsoft Analysis Services 2008, which gave me the inspiration for this article. It's about "Operation has been cancelled due to memory pressure" and sometimes its close companion “The lock operation ended unsuccessfully because of deadlock”. And they are exactly the kind of errors which really makes you think "how deep the rabbit hole goes".

Well, you'd say, what's more to dig into this? Isn't it obvious that the server doesn't have enough memory? Indeed, but before rushing to buy some more memory, and hope that everyone lived happily ever after, I'll say that's pretty useful to understand the various scenarios which may take there. Since not always buying more RAM will solve your problems.

First things first. These errors are usually the last symptoms of more complex problems occurring in the background, so you should really spend some time in identifying the actual scenarios which lead to these problem. For example, is one thing to start getting them during User Querying and a totally different thing when you are performing a dimension processing using ProcessUpdate option. And here is why:

More...

Tags:

Data Warehouse Best Practices

Data Warehouse Atomic Level

by Nicolae Guse 1. September 2012 11:43

There is an old saying which states "You can't unscramble scrambled eggs". Surprisingly, the same thing applies also to Data Warehouse Atomic level.

Before taking a deep dive into the subject, we should clarify what this Atomic Level stands for: it means adding in the fact table the finest level of granularity available in the Source application, in order to allow full tracing from the Data Warehouse back to originating data source.

I think here is the moment where we should make a step back, and start thinking how  Data Warehouse appeared in the first place.

Usually, it all starts with the business request coming from one department, which asks for for a report containing a certain information. Slowly these report requests start to pile up and they form the foundation for what will become the company  Data Warehouse. From technical point of view, very different paths are available:

1.      Design the fact tables just to contain information for the dimensions that the business user requested. Everything related to having the atomic level into the fact table is additional work, which at the end translates into additional costs that the users won't necessary agree to pay, especially if the disadvantages aren't clearly stated or if there is huge business pressure to deliver the end result very fast. You typically end up in this scenario when the IT department starting the  Data Warehouse doesn't have enough "weight" or experience in order to really influence the end solution, and you end up having the customer dictating the technical solution. Which may come back to hunt you later on.

More...

Tags:

Data Warehouse Best Practices

Data Warehouse Coding Guidelines and More - Part 2

by Nicolae Guse 15. January 2012 11:57

In the first part of the article we’ve talked quite a bit about Data Warehouse Coding Guidelines. We all go thru a love and hate relationship with rules in general, and Coding Guidelines are no exception. But the whole point is that there are really necessary, if you want to have a consistent Data Warehouse. Having consistent coding rules is just a part of the solution, since, in order to ensure consistency, elements like ETL (in our case SSIS) development rules, Source Control and an overall BI Department Development Strategy come into place. Always remember that a chain as strong as the weakest link, and Data Warehouses make no exception. And now, let’s go into detail about each one of the above components:

SSIS DEVELOPMENT

1.       Implement extensive usage of Configuration Files and Tables. As long as you have different environments to manage, one of the worst ideas is to have the package connection strings containing hard-coded values, since this will involve manual modifications before each deploy on a new environment. Instead, you should go for the following approach:

a.       Use a configuration file or an environment variable which contains the connection string to the server which contains the SSIS Configuration table corresponding to that environment. This configuration will be used when you launch the SSIS package

b.      In order to initialize all connection managers declared at package level, we should use the SSIS configuration table, which contains the values corresponding to that environment. In this way, each time we need to recover the value of a connection manager or variable we should add just a new line into this table. The SSIS configuration table can hold values for different SSIS packages, and therefore greatly improves the flexibility of the solution.

More...

Tags:

Data Warehouse Best Practices

Data Warehouse Coding Guidelines and More - Part 1

by Nicolae Guse 6. January 2012 16:19

 Since I’ve started the series of articles dedicated to Data Warehouse Best Practices, we've been talking a lot about focusing on reusable components like Partitioning Mechanism, Implementing a Business Rule Engine or a Generic Import Mechanism. And since the most reusable component is the code itself, I thought it's a good idea to prepare an article dedicated to the Coding Guidelines to be used for Data Warehouse solutions based on Microsoft SQL Server.

We will start up discussing about SQL Server Developments in the Part 1 of the article, continue in the Part 2 with SSIS Developments and Source Control, and end up with Development Strategy for a Business Intelligence department. 

SQL SERVER DEVELOPMENT

First things first. There are quite a few SQL Server Coding Guidelines out there, and one of the best I’ve come around is the SQLServerGuidelines, written by Pinal Dave. I highly suggest you to read it, since it contains a lot of valuable information for a developer.

Beside these excellent guidelines, which are applicable to any SQL Server development, I’ve tried to add some of my own, based on my experience so far, and more specific to Data Warehouse developments based on Microsoft SQL Server.

1.       Normalize database structure in the 3rd normalized form, at least for Actual Data Warehouse database. This approach allows greater flexibility when needed to regroup data at Data Mart level. For the Data Mart level, the key to obtaining better performance is de-normalization, since it greatly reduces number of joins to be performed, especially when using a ROLAP Presentation Layer.

 

2.       Avoid using INSERT INTO / SELECT * in query’s. Instead, use in the SELECT statement specific column names. In this way we avoid any unintended impact caused by adding new column in the source table.

 

3.       Avoid using INSERT statement without a column list, since could cause significant problems when new columns are added or the order of the existing columns is changed in the table.

 More...

Tags:

Data Warehouse Best Practices

What is Master Data Management?

by Nicolae Guse 18. December 2011 22:54

Before going to the Master Data Management topic in detail, i think we should make a step back and try to understand how the concept of Master Data appeared, and where does it fit inside a company IT infrastructure.

Since I’ve always liked a good story, let’s think a bit to the following one:

·         There was once a small company, driven by a few enthusiastic people who really believe that can make a great product which can really have an impact. At the beginning, the IT system for the company was very simple, and covering just their basic needs, and allowed them to move fast on the market, since it had to cover just a few scenarios.

Since money was a scares commodity at the beginning, part of the IT tools used were freeware products, part of them bought and part of them build by the very people that put the company together. It's that stage where everyone is doing everything, in order to kick-start the whole business. This means that, at this initial stage, the IT system doesn't have to be perfect, but only good enough, just the bare necessities which you strictly need to cover in order to build and sale a great product.

More...

Tags:

Data Warehouse Best Practices

RecentPosts

Disclaimer

The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.