Sunday, July 22, 2012

Configurations should be easy...

Surprisingly they are not.

Last week David and I spent about 2 and a half hours discussing how we wanted to set up the infrastructure of our Data warehouse.  We are only on our second iteration, and figured the earlier we could get the structure in place, the better.  Part of that infrastructure will be the capturing of row counts and transfer speeds in order to determine our baselines.  For example, after this iteration, we should know exactly how fast our data warehouse processes will ever be, because right now, those processes are at their simplest and smallest forms.

The way our old system is setup is as follows:
SSIS parent packages are called by SQL Server Agent Jobs.  Any configuration settings we want applied to those packages are configured through the SQL Job's dialogue box which allows command line modifications, setting values, overriding connection strings, etc.  Most of the child packages, however, have hard coded connection strings and little if any inherited values from their parents.  That was fine when we were developing and deploying onto the same production server. Allowing those packages to be transferred to other servers, however, was a bit of a fiasco.

Dave Rodabaugh suggested using xml configuration files, so David and I did a bit of research. Microsoft's  "Understanding Integration Services Package Configurations" page shows up on Google as the number 2 page to look at.  It provides a good bit of information, but unfortunately, it didn't seem comprehensive enough, in my opinion.  I would have liked to see more limitations, comparisons, etc. of the different types.  We aren't against the use of xml configurations, however, we have limited access to the actual server and it's file system, and thought it would be difficult to deploy changes via a third party.  I know.  At this point you are probably thinking, "Wait a second. You are building a data warehouse, but you have limited access to the machines you are building it on?  What does that even mean?".  We can access SQL Server, SSAS, and SSIS through SSMS, but we can't RDP into the box itself.  This makes troubleshooting hardware a bit of a problem, but we work with what we have.  It doesn't make sense, but It is what it is.

So the only good alternative to xml configurations is the SQL Server configuration option.  We wanted one location to manage all of the configurations for every SSIS package that would run. We thought this option would cater to that, until we were configuring our first SSIS package and realized we had to specify a Configuration Filter at Design Time.  It doesn't appear to be an option that can inherit from its parent, so we were beginning to lose our optimism.  We found out if we wanted to deploy to development, or to production, we would have to specify that filter to be applied at design time, and that just doesn't work.  This is the type of thing I would have expected Microsoft to include in their documentation.  Perhaps if we were going to have 1 or 2 SSIS packages we could manage, but with the 10-20 we expect, it's a no-go.  By the way, below is the Syntax for the configuration table.  You can also find it at the link mentioned above.

CREATE TABLE [dbo].[SSIS Configurations]
(
ConfigurationFilter NVARCHAR(255) NOT NULL,
ConfiguredValue      NVARCHAR(255) NULL,
PackagePath          NVARCHAR(255) NOT NULL,
ConfiguredValueType NVARCHAR(20) NOT NULL
)


So we decided to take the 2nd best path and have one of these configuration tables on our production server, and one on our development server.  The configuration filter would actually be the same for both(we don't want to have to touch our ssis packages when deploying to different servers), and the connection string for the table's Database can be inherited from parent packages. This table and other configuration settings will be included and segregated into their own etl databases, which we are managing with Visual Studio's Database Project.  You can see David's post on Database Projects if you are interested in reading a bit more on it.  We've found this VS project type exceptionally useful.  Exceptional may be used a bit strongly here.  It has it's pros and cons, and perhaps one of us will write a bit about our experiences with it later on down the road.

Once we've implemented our iteration 2, I'll write more on our findings, what worked and what didn't, and perhaps help you if you stumble upon this blog early in your datawarehouse design.

Below are some links you may find helpful:
I've included these links after the initial posting of this blog to assist you in determining order configurations are applied, etc..
Defining a Configuration Approach for Integration Services Packages
Best Practices for Integration Services Configurations



Thursday, July 5, 2012

Got Doc?

Still working on documenation.  Granted, I have a lot of other hats to wear at work, but I'm still documenting.  There is a lot of value in documentation, and I don't know if you can really over-document an engineering feat of this magnitude. So what do you document for a data warehouse?  Don't you just start coding in T-SQL and casually declare that it is self documenting? ...  I hope not.  It would be nice.  But for the sake of those who might have to pick up the pieces after you leave, I hope not.

So let me give you an example of the kind of documentation I have in mind, as well as some of the documentation I've already put together.  I know you'd like to see it, but that's not going to happen.  But hopefully this will give you an idea of what I think is good.  I would be interested in hearing from other, their thoughts on documentation as well.  

I've laid out my documentation in roughly four major sections: Standards & Overview, and Conceptual, Logical, and Physical models.  The Standards & Overview section gives a brief overview and introduction, and declares the standards we are going to abide by including table naming standards, SSIS package conventions, Staging Database conventions etc...  Obviously this section will continue to grow and change as we find and agree to more suitable standards for our warehouse.  Each of the Conceptual, Logical, and Physical models contains the definitions and diagrams for sources, staging areas, the ODS, the UDM, and SSAS for the current iteration.  We are only on the first iteration of "Long and Thin", so there is currently only one iteration of documentation, but as we move to the second and third iterations, we will copy and paste previous iterations and add to them as necessary.

Along with the iterations, we're going to capture and document baseline loading metrics for full and incremental loads at all stages of the data warehouse data loads.  My hope is to find a meaningful trend to indicate, for our volume of data, how adding additional dimensions or measures impacts our load speeds or swiftness.  I understand some facts and dimensions might be vastly larger or significantly smaller in size when compared one to another, and the complexity of the ETL processes may vary.  But a guy can hope there is at least a slightly significant correlation to the number of dimensions and facts we are loading to the time it takes to load them.

For now, the documentation I've outlined above has taken longer than the actual implementation of it.  In fact it only took David and I about 2 hours of pair programming to implement our first long and thin.  I've still got a couple of hours of documentation before we proceed to iteration 2, but it's a great start.

One more thing to mention.  When we were pair programming on Tuesday, we were building our fact table which, admittedly, is very basic since it only contained 1 measure and 2 date dimensions.  We decided not to  enforce the foreign key constraints on the date dimensions.  We assumed there would be a negative impact on performance while trying to load a quarter of a billion records.  Over the holiday I was thinking about it, and mentioned to him, David, that we ought to add the constraint, and only remove it when it is absolutely necessary for performance reasons.  One thing we continue to struggle with is assuming something will be slow without evidence that it is.  So we've been reminding each other to do it like it should be done, and dumb down only when necessary.  Adding the constraint had zero noticeable affect on the performance of the load.  Perhaps when we have 15 dimensions and 5 measures we may address it, but for now, better safe than sorry.

Some would say that it's a bit aggressive to turn the constraints on because you don't need them when you have absolute control of the data, or if the data comes from a constrained data source.  In our case, it does not come from a constrained data source, and we're starting from scratch after a 1.5 year failed attempt at a data warehouse, so we're a bit nervous about leaving it unconstrained.  
-- (I just put a SQL comment indicator as if you would know this is a comment) Neither David, nor myself were originally involved in the previous attempt.  If you are reading this, you are probably chuckling or crying, because you are or were previously in similar shoes.

I would welcome any additional feedback on what you think our documentation should look like.  I would love to post it with all of its diagrams and beauty, but like I said earlier: "That's not going to happen.".

I can't wait to blog about the Conformed Dimensional Bus I've been reading about in my Kimball book.  I heard Dave Rodabaugh mention it while he was here as a consultant, but didn't quite get what he meant.  But chapter 3 of Kimball's "The Datawarehouse Toolkit", mentioned in a previous blog, has a good bit of info on conforming dimensions and facts across data marts.  I'm looking forward to chapter 4.

ETL,
Josh

Monday, July 2, 2012

You're going to do what?!?

Today I spent most of my time on documentation and preparation for the new data warehouse.  It's hard to not skip right over the planning and commence building it.  However, I'm certain the outcome will be remarkably better if we spend a lot of time up front on design.

The consultant we had on-site last week, Dave Rodabaugh with Atlas Analytics, was an excellent resource and really helped us redefine the way we viewed the data warehouse.  Our original methodology was, what we thought, a great way to approach the situation.  We were planning on building the OLTP database in a very relational and normalized form.  After we were satisfied with its design, and had it fully populated, we were going to move the data into a de-normalized schema for use with SSAS and to create and populate the cube.  Sounds great, but as it turns out, that rarely works the first time.  Dave's suggestion?  Build it "Long and Thin" and "Reduce".

So what do those terms mean?  For the short amount of time we were able to spend with Mr. Rodabaugh, it was clear that he had an almost religious conviction about his philosophies on data warehousing.  Not to the extent that he was dictating exactly what must be done, but really hammering in some best practices that he found very successful in DW. I'll probably speak of more of his philosophies later, but I think it would help at this stage to expound on what he meant by "Long and Thin" and "Reduce".

"Long and Thin" refers to end to end data development.  If you are familiar with ETL and DW, then it's pretty obvious we're pulling data from multiple sources, massaging the data or, as I've heard it called, "Munging", and serving it up via SSAS or some other multi-dimensional reporting tool.  The idea is to choose a small number ("Thin") of measures and dimensions from the source and move them all of the way ("Long") to the reporting tool.  By building the staging structures, the OLTP, which we are calling the ODS(not pronounced "Odious"), the UDM, and the cube itself, we are able to guage performance, maintainability, and complexity without investing an overwhelming amount of time in any one area.  We'll continue to lay "Long and Thin" layers on top of one another, encountering things we like and dis-like, and fixing previous erroneous methodologies and layers along the way.

"Reduce" refers to looking at and working with the data in smaller pieces.  If you can break a process down into smaller pieces, do it.  Not in excess.  We don't need a pile of crumbs to maintain.  But if an SSIS package is doing many different or potentially unrelated tasks, break it into more than one package.  Create smaller, more maintainable packages that have predictable performance.  More on predictable performance later.

It turns out it is difficult to do "Long and Thin" and "Reduce" if you are used to trying to merge SSIS packages into each other.  It's easy to think "Big Picture" when you are facing an overwhelmingly large DW project.  We found ourselves adding more and more to our first iteration of "Long and Thin" and having to rein ourselves in.  I think we're starting to get it, though.  Hopefully at some point we'll stop having to remind each other, but we're not there yet.

I get my Ralph Kimball book tomorrow.  I'm looking forward to diving into it tomorrow night.  It will be a good read during my July 4th holiday.

ETL,
Josh