To Have or Not to Have a Physical data mart

by Nicolae Guse 16. October 2011 21:51

Some time ago I’ve been involved in a quite heated debate with a BO Universe Designer regarding the presence of a Physical data mart. This really gave me the idea for this article.

First of all, lets compile a list of arguments against having a Physical data mart, when the Presentation Layer is using ROLAP technology:

  1. It adds an additional layer on top of the Data Warehouse, which slows down development process when you need to add a few new details at Presentation Layer level.
    1. Normally, as long as you don't have a Physical data mart, you end up with a growing snowflake design, where you can add all the tables you need, from all Data Warehouse layers, and, after performing the necessary joins, get the details you want very fast. And this approach removes the dependency, for every small development, between the Data Warehouse development team and the Reporting team and increases the speed in which the user request it's being solved
  2. It creates unnecessary data duplication by de-normalizing the data from the Actual Data Warehouse, mostly in order to generate the flat dimension tables required for development
  3. It moves away the design of the Presentation Layer from Conformed Dimension principle, in which dimension are shared between Data Marts, by focusing more on the creation of de-normalized dimension tables which are designed around each Data Mart specific needs. By contrast, the same REF table can be used within different Data Marts, since each one encapsulates an atomic business concept, as a result of the Normalization process.

Well, then were in a pickle, since there are pretty strong arguments here. Except that they are not the full picture of the problem. Which really made me think which are the arguments in favour of a Physical data mart?

  1. Having a Snowflake design, implemented in the Presentation Layer application, as a Logical Data Mart, is deceivingly simple solution. And i would say that the key word in here is deceivingly. Because, while it solves alot of problems, mainly by reducing the time in which a change request is implemented, it also generates a whole lot of problems which are not as obvious at first:
    1. You don't have a distinct separation between the Presentation Layer and the Data Warehouse.

As long as you are mixing data from the Import Layer, Work Layer and the Actual Data Warehouse in the Presentation Layer application, you will no longer have a clear line of separation between the area which is accessible to the users, and the internal Data Warehouse processes which are refreshing this data. And you can bet that this is a recipe for disaster, in mid and long term.

Sooner or later, you will start to notice that, with the increase of your users population, that when the users are running their reports, SELECT locks will appear at row level, which will directly intersect the Internal Data Warehouse processes in charge of refreshing the data. And the very next effect will be to start having deadlocks. And here is the beginning of the road to perdition, which will involve:

                                                               i.      Endless discussions with the DBA team, which will say that they can do nothing on their side, since this is an application design problem

                                                             ii.      Upset users, irritated by the fact that their reports randomly freeze during refreshed, which want the problem fixed yesterday

                                                            iii.      Crashed refresh processes on the Data Warehouse, which, when re-launched, can cause even more downtime for the users

Does it sound like a plausible scenario? I would say so, and the bad thing about it is that you'll have no magical fix for this problem. It will involve going back to the drawing board to re-design the whole solution, which will:

                                                           iv.      Take even more time

                                                             v.      Upset the users even more, because not only that there is no quick fix for their solution, but it means that they'll have to struggle with the problems even more, until the IT finds and implements a solution.

                                                           vi.      Sharply decreases the user confidence in the competency of the IT department. If they are the experts, how could they let this situation happened?

                                                          vii.      Confirm once more that a quick fix on the short term, is a problem waiting to happen in the long term

                                                        viii.      Generate all sorts of philosophical questions inside the IT department:

1.       How could this happen, since all we've done is to design a system which allows faster answer to the users development requests that everything else? These users really don't know what they've requested? Can't they understand that there is no progress without some sacrifice?

And now comes the heavy answer:

·         IT has to move on from the "customer is the king" mentality. The customer will always see the direct problems that he's confronting them. The IT role is to have to have the vision and to always see the bigger picture, to predict which problems a particular design will lead, and to be able to explain each solution advantages and disadvantages to the customer. And then you may see that the customer is not as absurd as you may think so.

On this subject, the best quote i could find comes from Henry Ford: If I'd asked customers what they wanted, they would have said "a faster horse".

·         This will involve a commitment of the IT department management to this approach, which might make them un-popular in the short term, but will result in increased user confidence in the IT department in mid and long term.

But what about the magic brought by the aura of those quick fixes and bonuses? We'll that's the tricky part, you'll have to let it go. And this is why it's so hard to do to right thing.

  1. Having a Physical data mart creates a clear separation between the Presentation Layer and the Data Mart. It involves a degree of data duplication with the Actual Data Warehouse but you will no longer have interferences between the users reports and the internal processes of the Data Warehouse.
    1. The only interference will occur when you will try to refresh the Data Mart, while the users are querying it. If you will go on the standard solution to refresh it, by using INSERT / UPDATE / DELETE or MERGE, you will still end up with data consistency issues. But, in order to fix this, there is a magical solution just around the corner: Table Partitioning. It's far easier to implement Table Partitioning on the smaller number of table from the Data Mart, then to implement it throughout all layers of the Data Warehouse. But more on Table Partitioning on a separated article
    2. Simplifies a lot the design of the Presentation Layer by implementing a Star Design. It's by far simpler to have 1 dimension table for a dimension with 10 levels than having 10 joined REF tables.

And, by the way, the conformed dimension is not really an issue, since nothing prevents you to create a generic dimension (ex. DIM_TIME), which will be shared by all Data Marts. You should have a separate dimension for each Data Mart, when the same business concept doesn't have exactly the same meaning between Data Marts. For example, when you have 2 distinct Data Marts, like Sales and Marketing, you might notice that the Customer concept isn't quite the same. If there is no way of convincing them to have a Shared Customer dimension, than you will end up created a DIM_SALES_CUSTOMER and a DIM_MARKETING_CUSTOMER. And that's ok, since this is a real life scenario.

    1. By having the data grouped in Physical Data Marts, it's much easier to ensure data consistency. If, for example, you have 2 different Business Modules, like Accounting and Sales, some of the biggest challenges will be to ensure data consistency.

Let’s suppose that, due to the implemented business logic, some of the Accounting data must reach the Sales module. As long as you have a classical precedence, where Accounting data is processed first, and Sales afterwards, you will have a moment in time when those fact tables are out of sync. And, since the Data Warehouse is used for Decision Support, this can be a little annoying. Or very, very bad, depending on whom you’re asking.

In order to keep the data in sync, one of the best solutions that I’ve come around is the wide use of partitioning. If you have a Physical data mart, you can perform all the calculations at data mart level into TMP tables, not exposed to the users, and perform, in a single transaction, all meta-data switches to the final tables. And this is possible because all the data you need for the Presentation Layer, is spread horizontally, in a single layer. Since this is a meta-data operation, it will be very fast and will ensure that the users always access consistent data sets.

When implementing a Logical Data Mart, the problem is much more complicated, in fact, so complicated that neither the mother of all table partitioning cannot help you anymore. And this happens because the Presentation Layer is consistently accessing data from all Data Warehouse levels, from Import to Work and Actual Data Warehouse. As long as you have independent processes of data refresh for various Business Modules, it's almost impossible to use partitioning to ensure Data Consistency, since data moves both horizontally and vertically in the Data Warehouse. You will understand this when you move vertically, from the Import Layer to the Work Layer that, while the Import tables have been updated, the associated Work Layer data have not yet been. Use it both in the Presentation Layer, and you will have a data consistency issue.

Instead of a conclusion on this topic, I’ll point to the refrain from one of Faithless songs "You don't need eyes to see, you need vision".

And now back to you, which path would you chose? Which are the pro / cons arguments against each approach?

Tags:

Data Warehouse Best Practices

Pingbacks and trackbacks (1)+

Add comment

biuquote
  • Comment
  • Preview
Loading

RecentPosts

Disclaimer

The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.