Thursday, July 5, 2012

Got Doc?

Still working on documenation.  Granted, I have a lot of other hats to wear at work, but I'm still documenting.  There is a lot of value in documentation, and I don't know if you can really over-document an engineering feat of this magnitude. So what do you document for a data warehouse?  Don't you just start coding in T-SQL and casually declare that it is self documenting? ...  I hope not.  It would be nice.  But for the sake of those who might have to pick up the pieces after you leave, I hope not.

So let me give you an example of the kind of documentation I have in mind, as well as some of the documentation I've already put together.  I know you'd like to see it, but that's not going to happen.  But hopefully this will give you an idea of what I think is good.  I would be interested in hearing from other, their thoughts on documentation as well.  

I've laid out my documentation in roughly four major sections: Standards & Overview, and Conceptual, Logical, and Physical models.  The Standards & Overview section gives a brief overview and introduction, and declares the standards we are going to abide by including table naming standards, SSIS package conventions, Staging Database conventions etc...  Obviously this section will continue to grow and change as we find and agree to more suitable standards for our warehouse.  Each of the Conceptual, Logical, and Physical models contains the definitions and diagrams for sources, staging areas, the ODS, the UDM, and SSAS for the current iteration.  We are only on the first iteration of "Long and Thin", so there is currently only one iteration of documentation, but as we move to the second and third iterations, we will copy and paste previous iterations and add to them as necessary.

Along with the iterations, we're going to capture and document baseline loading metrics for full and incremental loads at all stages of the data warehouse data loads.  My hope is to find a meaningful trend to indicate, for our volume of data, how adding additional dimensions or measures impacts our load speeds or swiftness.  I understand some facts and dimensions might be vastly larger or significantly smaller in size when compared one to another, and the complexity of the ETL processes may vary.  But a guy can hope there is at least a slightly significant correlation to the number of dimensions and facts we are loading to the time it takes to load them.

For now, the documentation I've outlined above has taken longer than the actual implementation of it.  In fact it only took David and I about 2 hours of pair programming to implement our first long and thin.  I've still got a couple of hours of documentation before we proceed to iteration 2, but it's a great start.

One more thing to mention.  When we were pair programming on Tuesday, we were building our fact table which, admittedly, is very basic since it only contained 1 measure and 2 date dimensions.  We decided not to  enforce the foreign key constraints on the date dimensions.  We assumed there would be a negative impact on performance while trying to load a quarter of a billion records.  Over the holiday I was thinking about it, and mentioned to him, David, that we ought to add the constraint, and only remove it when it is absolutely necessary for performance reasons.  One thing we continue to struggle with is assuming something will be slow without evidence that it is.  So we've been reminding each other to do it like it should be done, and dumb down only when necessary.  Adding the constraint had zero noticeable affect on the performance of the load.  Perhaps when we have 15 dimensions and 5 measures we may address it, but for now, better safe than sorry.

Some would say that it's a bit aggressive to turn the constraints on because you don't need them when you have absolute control of the data, or if the data comes from a constrained data source.  In our case, it does not come from a constrained data source, and we're starting from scratch after a 1.5 year failed attempt at a data warehouse, so we're a bit nervous about leaving it unconstrained.  
-- (I just put a SQL comment indicator as if you would know this is a comment) Neither David, nor myself were originally involved in the previous attempt.  If you are reading this, you are probably chuckling or crying, because you are or were previously in similar shoes.

I would welcome any additional feedback on what you think our documentation should look like.  I would love to post it with all of its diagrams and beauty, but like I said earlier: "That's not going to happen.".

I can't wait to blog about the Conformed Dimensional Bus I've been reading about in my Kimball book.  I heard Dave Rodabaugh mention it while he was here as a consultant, but didn't quite get what he meant.  But chapter 3 of Kimball's "The Datawarehouse Toolkit", mentioned in a previous blog, has a good bit of info on conforming dimensions and facts across data marts.  I'm looking forward to chapter 4.

ETL,
Josh

No comments:

Post a Comment