At the crossroads: MOLAP vs HOLAP vs ROLAP - Part 2

by Nicolae Guse 20. November 2011 23:01

I've been doing some research lately regarding MOLAP, ROLAP and HOLAP storages, under Microsoft Analysis Services, and I’ve found 2 very interesting articles on Microsoft site, that I highly recommend:

Now, regarding the 3 major storage modes, let’s go into detail about the scenarios in which each of them is best suited, based on their implementation in Microsoft Analysis Services. Most of the characteristics listed here are generic, and match most implementations from various vendors, while others, like the management of ROLAP aggregations, are missing from other vendor products:

MOLAP

Advantages:

  1. Very good query performance by pre-aggregating data into the multi-dimensional structure, during cube process.
  2. Lower I/O due to the use of data compression, when transferred from RDBMS to the MOLAP storage
  3.  Analysis Services Engine can use Partition Slice to pre-determine, during execution plan phase, which cube partitions need to be queried in order to satisfy the user query.
  4. High independency from the relational database which holds the fact table and dimension tables. Since the relational database is only touched during cube process, the entire fact table can be re-created using non-logged operations like SELECT INTO, in order to speed-up ETL processes, without any impact over the users query run on the cube.
  5. The cube partitioning feature works excellently in tandem with the SQL Server table partitioning feature, which allows designing of high performance and elegant ETL solutions. Gone are the days when you had to create a FACT_SALES table for each day, now you can have a single FACT_SALES table, with partitions for each day.
  6. Incremental load can limit the performance problem occurred during large cube full process, by processing just the data which changed

Disadvantages:

  1. Performance is not that good when handling non-aggregately data
  2. Scalability issues when going further than 5 TB. Full processing starting to take days at this size
  3. Cube processing time is the highest from the 3 storage modes, due to the fact that both the leaf data and aggregations are hold into the MOLAP storage
  4. High latency, since a MOLAP cube it's connected to the relational database only during cube process. If the fact table data changes after cube process, the refreshed data won't be visible to the users until next cube process.

In practice, I’ve seen excellent performance on MOLAP cubes while handling aggregated data. And I think that the key word here is aggregated. When being build to satisfy the management information needs, which work with highly aggregated data, the MOLAP storage is the best choice available. In practice, in order to reach such good level of performance, often even the Data Mart feeding the cube contains aggregated fact tables,calculated using SUM() and GROUP BY over the keys that will form the cube main axis of analysis. As long as you're using Hierarchical Dimensions, with just a few levels, and dozens of thousands of members, rather than millions, the MOLAP cube will be very fast.

MOLAP cubes are inherently slow when handling non aggregatable data, or when working with Junk Dimensions. A Junk Dimension is a dimension which contains no hierarchies, only attributes, without any precedence constraints between them. One such example would be an Invoice Detail dimension, where you have attributes like: Invoice No, Invoice Line, Business Department, and Group Account. If you want to have a single cube, that serves every ones needs, from Top Management to Accounting department, you will definitely need such Junk Dimensions, and they will slow down the performance just from the sheer volume of data that will have to be read and returned to the user Front End Applications.

New features like Proactive Caching have been implemented in order to reduce Data Latency issues, by triggering automatic cube process of the affected partitions, as soon as the underlying fact table has changed. In order to reduce Data Latency even more, the users querying can be diverted directly to the relational fact table, while the affected partition is being processed. As soon as the partition process finishes, the traffic is re-directed to the MOLAP storage. Such behavior is a very good example that the storage modes used today are more and more flexible, and allow the Backend Server to decide when is best to change from MOLAP to ROLAP and back to MOLAP.

Additionally, MOLAP storage under Analysis Services 2008 R2 is being pushed beyond Microsoft traditional limits. While Microsoft itself declared that MOLAP may not be the best choice over 5 TB storage, I’ve seen a video of the Day 1 SQLPass 2010 Conference, where Yahoo presented a 12 TB MOLAP cube, which they intended to upscale to 40 TB. All of this, with 80% of the queries running under 10s.At this point, I can only say: RESPECT!!!

ROLAP

Advantages:

  1. Lowest Data Latency from the 3 storage modes. Users are directly accessing the relational Fact Tables data, when browsing the cube
  2. Aggregations implemented automatically as indexed views in MSAS, when implementing Aggregation Design. In here, we have the advantage of automatically creating indexed views which matches the user query patterns. Additionally, DWH developers can create generic Indexed Views, which can be used by different cubes. In fact, MSAS is not aware of these indexed views, and lets the SQL Server Engine decide when to use a particular indexed view. The indexed views which were created as a result of commands issued by MSAS are being re-created during ROLAP cube process, which does create a processing overhead
  3. Viable option for simple cubes, with large data volumes and low dimensional complexity

Disadvantages:

  1. Limited to the SQL syntax, which is not very well suited to handle complex multi-dimensional query's. This can result in MDX query's being translated on SQL Server into imbricate subquery's using SELECT SUM()...GROUP BY, which can lead to significant performance problems.
  2. Best performance is obtained while building partitions using table binding, instead of query binding, which does result in significant maintenance overhead. This is necessary in order to use the advantage of the indexed views. We cannot use table partitioning on a single fact table, in order to add / remove partitions, since it automatically triggers the re-creation of all affected indexed views.

Overall, ROLAP cubes can be a very good alternative to MOLAP cubes, especially when talking about large data volumes, low dimensional complexity, and low aggregatable data. SQL Server is more scalable than Analysis Services, especially when talking processing power (256 vs 64 logical CPU). When properly tuned, performance can be on par or better with the equivalent MOLAP implementation, while having the big advantage of better scalability over large volume of data.

You should be aware that there are very different implementations ROLAP storage out here, and some of them will more quickly hit the storage type limits that others. At the end, browsing a ROLAP cube will result, no matter the implementation, in an SQL query. Practically, the ROLAP Server component (Microsoft Analysis Services, BO Server) encapsulates a semantic layer which must be able to translate the user actions from the Front End component to an SQL Query. This often results in sub-optimal query's being generated, which will drive its own performance problems. MDX language used for querying the MOLAP cubes in Analysis Services is better suited to handle complex dimensional query's that SQL server.

Handling Data Aggregations in different ROLAP implementations can be a very hot topic. Not necessary by the way it's being implemented, but rather by the lack of it. For example, Business Objects doesn’t support automatic aggregation creation, which generates a BIG administrative overhead, in order to avoid the performance drop driven by the lack of aggregations. However, Business Objects, like many other products, does have some tricks on its sleeve, and one of them is to Schedule the Report Refresh. In this way, data is being processed a given intervals, and the users connect to processed reports. From this moment one, as long as the users remain in the data perimeter of the report, all data request are handled directly from the BO Server cache, rather than from the SQL Server, with results in BIG performance increase. Overall, this sort of approach sounds a bit like MOLAP to me, where the users and not connected anymore to the relational database.

However, pre-processing reports it's not always a viable option. If you need to refresh data very often, and Data Latency is an issue, then scheduling is not a viable option. Also, using features like Business Objects Scheduled Reports Refresh to improve performance will not help the users which really need to perform Ad-Hoc querying. Remember the Business Analysts? The reality is that it's up to the IT department to determine the structure of the user base, and the balance between the usage of pre-processed reports and ad-hoc usage. Also, real life scenarios shows that you will have a lot more users that use Standard / Pre-processed reports, refreshed on a daily basis, than users performing ad-hoc reporting, which need the latest possible data.

HOLAP - looks like the best compromise between the MOLAP and ROLAP storages. Data is kept into the relational database while aggregations are calculated and stored into multi-dimensional storage, in order to improve browsing performance. As BI products evolve, more vendor implement HOLAP specific features to their ROLAP products, in order to improve performance. Therefore, even if i haven't seen large implementations of HOLAP cubes under Microsoft Analysis Services, it could prove itself as the hidden gem in the OLAP trunk.

Beyond these traditional OLAP storages, very interesting things started to happen. A new type of storage called In Memory OLAP, has driven by storm the BI world. While most BI products use in memory calculations to speed up the data retrieval, the big change brought by In Memory OLAP is that the actual data processing is performed on the client, rather than the source server. Once the necessary data set is retrieved from source server, all subsequent processing is made on the client application, which can work in a disconnected mode. This is very different from working with traditional cubes, where each user action results in an MDX statement which is sent to the server, there is being processed, and the data returned to the client application.

Example of such applications is Tableau and Microsoft Excel PowerPivot. Microsoft uses Excel PowerPivot as pilot project for a new column oriented data store engine called Vertipaq, which used column storage instead of the traditional row storage, which can results in great performance gains while working in big volumes of data (ex. hundreds of millions of data). This type of column storage already found its way into SQL Server 2012. The biggest promise of products like Microsoft Excel PowerPivot is Self Service BI. As we've talked in the Part 1 of the article, there is always a Shadow BI system, parallel to the Official BI. Users will always have needs that cannot be quickly satisfied by the BI department, since it involves budgets and departmental resources. Using tools like Microsoft Excel PowerPivot, which are based on SharePoint connectivity, allows these tools to be developed by the people who need it and then share it with others, while accesses can be managed via existing SharePoint Administrating infrastructure, and therefore allow the expansion of such tools under a controlled manner.

Does this Self Service BI sound a bit threatening for traditional BI Departments everywhere? Are we all going to lose our jobs? Well, not quite, i would say. This Shadow BI System, existed anyway, and there was very little IT control over it. Now at least, we would have a better way to establish its dimensions, and decide what and when to integrate into the Official System. And I’m pretty sure that there will always be a need for the Data Accuracy and One Version of the Truth provided by official BI projects.

Now, back to the Crossroad where we started, which one of the 3 storage mode is the best? Well, in a PERFECT world, where all vendors would have implemented of 3 of them, I would say that the best storage mode is the one which suits your project. And that is because different projects have different:

·Category of users, and different proportions between them

·Balance between Ad-hoc reporting and Scheduled Reports

·Data Volume- what is suited for a Cube / Data Mart of few GB is not necessarily suited for a few TB one

·Data Latency requirements - for some projects, it's ok to have the data refreshed on a daily basis, while for others, the latency of the whole process from the Source System to the Data Warehouse and Cube refresh must be absolutely minimal, as close as to Near Realtime OLAP as possible

·What IF analysis needs. For Some projects, like Forecasting and Budgeting, the ability to perform What IF analysis via an integrated writeback feature is a MUST, while for others completely not necessary.

In the REAL world, companies are buying BI Platforms, which do have both Front End and Backend servers. Far too often you are severely limited by the storage mode implemented into that Backend server, with little to no room to move around. And this is why you should be aware of all these topics we've discussed, in order to make the right choice. Also, take into account that bio-diversity is important. I don't think there is a perfect BI tool out there, which’s best at doing everything. Therefore, you should choose the tool which allows you the best flexibility, since projects can be very different. If such choice is not possible, that you should buy the one which best suites you current needs, and don't hesitate to add in the future other BI platforms, as long as you're making sure that there is perfect interoperability between them, in order to be able to put all the pieces together.

And now, back to you: Which are your experiences with various OLAP storage modes? What do you think over the new In Memory OLAP trend? Do you prefer to have different BI Platforms under the same company or just One Tool to rule them all?

Tags:

Data Warehouse Best Practices

Comments (2) -

Ramdas
Ramdas
3/26/2013 8:00:03 PM #

Excellent article, detailed info on different cube storage especially addressing cubes with Terabytes of data.

Reply

geuth
geuth
6/18/2013 4:51:36 PM #

Thanks a lot for this article ! It is a real pleasure to read about issues we face everyday as BI managers.

Reply

Pingbacks and trackbacks (1)+

Add comment

biuquote
  • Comment
  • Preview
Loading

RecentPosts

Disclaimer

The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.