Wednesday, October 10, 2012

SSIS log reporting and benchmarking


We've been using SSIS packages for quite some time and have always setup the packages to use the embedded SSIS Logging.  If you haven't noticed, SSIS Logging can get pretty verbose, so I decided to write some queries and a sproc to display package component execution times over time.  It is by no means the end all solution to SSIS Log reporting, and it does have some short-comings which I will mention during my documentation.  But overall, I've found it to be helpful, and I would encourage any readers to submit their own SSIS log reporting methodologies.

Before I start, I will mention a tool I happened upon a couple of days ago, SSIS Log Analyzer at codeplex.com.  I downloaded it on Monday, installed it, and then tried running it.  It ran on my laptop, but it seemed slow and unresponsive.  The reporting capability it provides is worth further investigation.  It may have been my network connection or the database I was trying to report off of, but it certainly deserves an honorable mention when discussing SSIS log reporting.

Our Setup:
We log to the default SSIS Log table: dbo.sysssislog.  I've listed the structure below in Figure 1.0.
















Figure 1.0


To minimize SSIS Logging's verbosity(is that a word), we only log certain events:
OnError, OnPostExecute, OnPreExecute, OnTaskFailed, and OnWarning.  See Figure 1.1 for a screenshot of the SSIS Logs Event Details.








































Figure 1.1
Every package has the same events logged to the same table in the same database, so we can easily, I suppose that is a relative term, see what our packages are doing at any given time.  If you've ever looked at the table mentioned in Figure 1.0, you will probably agree that it is not very helpful in and of itself.
Hence, this article...

Basic Querying of the sysssislog table:
If you have a lot of packages, or a lot of components, or a lot of both, I recommend you trim down your result set a bit with dates.  A common query I run if I'm watching a specific package or viewing what is current is as follows:

SELECT event,computer,operator,source,executionid,starttime,endtime,message
FROM dbo.sysssislog
 WHERE starttime >= Dateadd(d, -1, Getdate())
 ORDER BY starttime DESC




I'm not going to explain the query, if you don't know what it is doing, you are probably on the wrong blog.

Querying the sysssislog table for durations:
We use the PreExecute and PostExecute events as a way of determining execution time.  The following query displays duration in seconds for all packages that have started since "DATEADD(d,-1,GETDATE())" or more specifically "Yesterday 00:00:01 and later".  NOTE:  This also differentiates between multiple executions of the same package.  So if a package runs more than once, you will see multiple entries in the resultset of this query.

SELECT pre.starttime
      ,Cast(Datepart(YY, pre.starttime) AS VARCHAR(4)) + '-' + Cast(Datepart(MM, pre.starttime) AS VARCHAR(2)) + '-' + Cast(Datepart(DD, pre.starttime) AS VARCHAR(2)) AS dtm
      ,pre.source
      ,Datediff(s,pre.starttime,post.endtime) AS duration
FROM       (SELECT starttime
                   ,endtime
                   ,executionid
                   ,source
            FROM   dbo.sysssislog
            WHERE  [event] = 'OnPreExecute') pre
INNER JOIN (SELECT starttime
                   ,endtime
                   ,executionid
                   ,source
            FROM   dbo.sysssislog
            WHERE  [event] = 'OnPostExecute') post
ON pre.executionid = post.executionid
  AND pre.source = post.source
WHERE pre.starttime > Dateadd(d, -1, Getdate())

This is great, but I'm really interested in seeing total execution times regardless of the number of times they were executed.  Emphasis on that previous statement, keep in mind the next query is duration regardless of the number of times executed.

SELECT dtm,source,Sum(duration) duration
FROM  (SELECT pre.starttime
             ,Cast(Datepart(YY, pre.starttime) AS VARCHAR(4)) + '-' + Cast(Datepart(MM, pre.starttime) AS VARCHAR(2)) + '-' + Cast(Datepart(DD, pre.starttime) AS VARCHAR(2)) AS dtm
             ,pre.source
             ,Datediff(s, pre.starttime, post.endtime) AS duration
       FROM       (SELECT starttime
                          ,endtime
                          ,executionid
                          ,source
                   FROM   dbo.sysssislog
                   WHERE  [event] = 'OnPreExecute') pre
       INNER JOIN (SELECT starttime
                          ,endtime
                          ,executionid
                          ,source
                   FROM   dbo.sysssislog
                   WHERE  [event] = 'OnPostExecute') post
          ON pre.executionid = post.executionid
             AND pre.source = post.source
       WHERE      pre.starttime > DATEADD(d,-1,GETDATE())) details
GROUP  BY dtm ,source
ORDER  BY dtm ,source


Notice I'm ordering by start date and source, so I'll have an entry per day of each package.

That was all well and good, but it will return a very tall resultset if there are a lot of packages and a lot of days to report on.  Without pulling it into excel, modifying it into a pivot query, or doing some CASE WHEN statements, I still can't see what my packages are doing over time.

Querying the sysssislog table for durations over time:
I created a stored procedure for this which I have included below.  The sproc accepts a start date and end date, and basically stores the results of the above query to a temporary table.  It then builds dynamic SQL to create a column for each day between my start and end date.  Finally it executes the dynamic SQL and provides a nice display of my SSIS package durations over time.

CREATE PROCEDURE [audit].[get_ssis_durations]
   @debug     CHAR(1) = 'N',
   @start_dtm SMALLDATETIME,
   @end_dtm   SMALLDATETIME
AS
   BEGIN
      SET XACT_ABORT ON

      IF @debug = 'Y'
         SET NOCOUNT OFF ELSE

      SET NOCOUNT ON
      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'BEGIN PROCEDURE audit.get_ssis_durations'

            EXEC debug.Logval
              
'@start_dtm',
               @start_dtm

            EXEC debug.Logval
              
'@end_dtm',
               @end_dtm

            EXEC debug.Logmsg 'CREATING TABLE #SSIS_DETAILS...'
         END

      CREATE TABLE #SSIS_DETAILS
         (
             dtm       DATETIME
             ,source   VARCHAR(200)
             ,duration BIGINT
         )

      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'Inserting into #SSIS_DETAILS...'
         END

      --Get the details into a temporary table
      INSERT INTO #SSIS_DETAILS
      SELECT dtm
             ,source
             ,CAST(Sum(duration) AS BIGINT) duration
      FROM   (SELECT Cast(Datepart(YY, pre.starttime) AS VARCHAR(4)) + '-' + Cast(Datepart(MM, pre.starttime) AS VARCHAR(2)) + '-' + Cast(Datepart(DD, pre.starttime) AS VARCHAR(2)) AS dtm
                    ,pre.source
                    ,CAST(Datediff(s, pre.starttime, post.endtime) AS BIGINT) AS duration
              FROM       (SELECT starttime
                                 ,endtime
                                 ,executionid
                                 ,source
                          FROM   dbo.sysssislog
                          WHERE  [event] = 'OnPreExecute') pre
              INNER JOIN (SELECT starttime
                                 ,endtime
                                 ,executionid
                                 ,source
                          FROM   dbo.sysssislog
                          WHERE  [event] = 'OnPostExecute') post
                 ON pre.executionid = post.executionid
                    AND pre.source = post.source
              WHERE      pre.starttime >= @start_dtm) details
      GROUP  BY
         dtm
         ,source
      ORDER  BY
         dtm

      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'GENERATING dynamic SQL...'
         END

      --Setup my dynamic SQL
      DECLARE @sql VARCHAR(MAX) = 'SELECT DISTINCT source AS component'
      DECLARE @dayAdd INT = 0;
      DECLARE @strDate VARCHAR(20);

      --Loop through each day and build my sql statement.
      WHILE Dateadd(d, @dayAdd, @start_dtm) <= @end_dtm
         BEGIN
            SELECT @strDate = Cast(Datepart(m, Dateadd(d, @dayAdd, @start_dtm)) AS VARCHAR(2))

            SELECT @strDate = @strDate + '/' + Cast(Datepart(d, Dateadd(d, @dayAdd, @start_dtm)) AS VARCHAR(2))
            SELECT @strDate = @strDate + '/' + Cast(Datepart(yyyy, Dateadd(d, @dayAdd, @start_dtm)) AS VARCHAR(4))
            SELECT @sql = @sql + ',(SELECT duration  FROM #SSIS_DETAILS i WHERE i.dtm = ''' + @strDate + ''' and i.source = o.source) [' + @strDate + ']' + Char(13)
            SELECT @dayAdd = @dayAdd + 1;
         END

      SELECT @sql = @sql + ' FROM #SSIS_DETAILS o ORDER BY component'
      --Display the sql I'll be executing
      PRINT @sql

      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'EXECUTING dynamic SQL...'
         END

      --Execute the query
      EXEC (@sql)

      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'DROPPING #SSIS_DETAILS...'
         END

      --Drop the temp table
      DROP TABLE #SSIS_DETAILS

      IF @debug = 'Y'
         BEGIN
            EXEC debug.Logmsg 'END PROCEDURE audit.get_ssis_durations'
         END
   END


Apologies for the poor formatting.  Unfortunately Blogger doesn't handle formatted SQL directly from SSMS 2012 very well.  Pasting the above queries into SSMS and running the Beautifier formatting tool against them is a small price to pay.  These queries are relatively new, and I certainly invite any feedback or criticism you may have.
over time I'll have benchmark data on all of our packages so I can see how changes to the database or to SSIS packages affect performance.

Regards,
Josh




Friday, August 10, 2012

DB change scripts... Blah..

So, it's 5 am, and I'm supposed to be exercising.  But blogging sounds more interesting to me. Nothing like stretching your mind in the morning.

I've got a few topics I'd like to go over:
1.  Our SSIS Package design/infrastructure.
2.  Database Projects in Microsoft Visual Studio.
3.  Configurations...continued...

But I think I'll go with #2.  Database Projects in Microsoft Visual Studio.  Since you looked at the title before reading this, I'm sure you felt a little prophetic and instinctively knew I would choose #2. Perhaps I should mix things up a bit and go with #1 just to kill your buzz...  Cue Debbie Downer music.

SQL Server 2008 Database Projects in Microsoft Visual Studio 2010 are great.  Maybe I should have said that with a bit more umph, or at least used a better word than "great", like "wonderful" or "magnificent", but I think great will suffice.  It's not like it does my job for me, but it does make my job a lot easier.

So what's the point of Database Projects?  Why do I want to migrate from something that I know works, like maintaining change scripts between releases, to something that is new to me?  And my answer is, forcefully, because you don't have to maintain change scripts between releases. Did you get that?  You don't have to maintain change scripts between releases.  I feel like I should say it again, because I know you are still 4 sentences behind thinking, "why change what I've been doing?".  I need to blog about change sometime.  Why do people hate change so much.  I digress..  

If you've finally heard me, you are probably skeptical.  "You mean I don't have to meticulously add alter statements to a release script and hope everyone is documenting their database changes?" Yes.  "I don't have to keep telling my developers to 'Update the change script when you make db changes'?"  That is correct.  "How is that possible?"  Uh, that's what this blog is about, keep reading.  

In Database Projects, I'll refer to them as DB Projects from here on out, you build your database the way you want it to be.  You write the create table, create procedure, add index statements, make sure the project builds, and check in your code to a source code repository like TFS.  As database changes are requested or needed, you update the DB Project itself, incrementally.  If you need to add a column to a table, you simply add a column to a CREATE TABLE definition.  You don't need to check if it already exists, just add the column.  If you come from a developer background, this will seem normal to you because this is how application development works.  If your background is in the db world, this is probably a tough pill to swallow.  You don't have to be mindful of what the destination database will look like.  You don't have to keep in mind that this database project will be deployed to 5 different servers.  All you need to know is what the database should look like when you are done.  

So how does it work behind the scenes?  I mentioned earlier that your db project has to build, so what happens during a build.  
  • Syntax is checked across the entire database project.
  • All references in the database, i.e. SELECT * FROM [database].[table] are verified.
  • Proper data type conversions are checked.
  • A .dbschema file is generated.  The .dbschema file is basically an xml file that contains the structure or schema of the database you intend to deploy.  If you open the file in notepad or some other text editor of your choice, you will find definitions for users, stored procedures, tables, etc..
     
"But I still don't see where change scripts are magically appearing." 

After a build, the next step is to deploy the db project to a database on a server somewhere.  This can be done both from the command line, and from the VS IDE.  Essentially what happens is this:
  1. The .dbschema file that was generated during the build is compared against a newly created .dbschema file of the database you are trying to deploy to.  
  2. Microsoft's deployment engine generates a delta script which contains all of the changes necessary to make the deployment database match exactly to the db project.
  3. If you have the Deployment Action property in the project property pages set to create script and deploy changes, then the script that was generated in step 2 will be applied to the destination database immediately by the deployment engine.

Before I go further, a very necessary assumption needs to be made.  The database project, is, and will remain, exactly what you want your database to look like at any given time.  It is the Gospel. It is the Holy Grail.  It is the Truth.  No existing database, production or development, supersedes the db project.  It is the blueprint for which all of your databases must come into alignment with.  If you deviate from this, you will wake up screaming at 3 in the morning because of the nightmare you just had, and that's only if you can actually fall asleep.

I say this because if you look at the database script that is generated, it may contain DROP Table statements if someone has created tables manually rather than through the database project.  It will remove columns, indexes, schemas, etc... during the deployment process if those columns, indexes, and schemas do not exist in the db project.  In other words, check your deployment script before deploying...  "But you said I didn't have to maintain change scripts anymore."  That's correct, no more meticulous maintenance of them, but reviewing the generated change scripts is a must.  I suppose this is why db projects are "great", but not "Magnificent".
"So what do I do if someone makes changes manually to our development server behind my back?"  
There are two options, the first of which is probably something someone unfamiliar with db projects would recommend. The second, is neat.

1.  Manually change the db project itself by changing your object definitions.
2.  Use Visual Studio's Schema Compare option.

If you want to spend a lot of time, choose option 1.  If you want Visual Studio to do the heavy lifting, choose option 2, which is what I'm about to explain.

To use VS's Schema Compare, you need to add a Schema Compare object to your db project. Right-click the project and choose Add->Schema Compare.
This will open a dialog box asking you to pick a source and a target (destination).  A source and a target can be any combination of the following:  a .dbschema file, a database project, or an actual database.  Since you are trying to re-sync your db project to development to incorporate some naughty developer's manual changes, you would want your source to be the development database, and your target to be your db project.  At this point you can choose additional options, like what types of objects to ignore during the compare, but typically I just continue on.  

Visual Studio's deployment engine will create a delta script that could be used to make your target(db project) into alignment with your source(the development server).  If you don't see the delta/change script, there are two small icons in the toolbar that look like scrolls from Lord of the Rings. One opens the window to display the change script, and one updates the change script as you modify your requirements through the user interface.  It also gives you a nice user interface that can be filtered to show you missing objects, different objects, equal objects, objects marked as skip, and objects not marked as skipped.  You will probably see a left and right panel which contain your source and destination, and then a tree of objects.  Now you go through each object and choose to Update, Drop, or Skip it if there is a difference.  Something David and I found out this past week, is that rather than manually choosing an action for each difference, you can right-click on the root folder of the objects and choose an action there.  This can be great if you are only wanting to change one thing when there are hundreds of differences.  In this case, if you followed the assumption I gave earlier, you should have very few changes to propagate to your db project. But this same technique can be followed to generate your deployment scripts manually, where you may not want to deploy everything at once.  NOTE:  This is also how you will create your db project to begin with.  "You mean I don't have to type it all in?"  That is correct.  The Schema compare will do it all for you.  I think you can also import a .dbschema file to do the same thing, but I have not tried this.

Once you have chosen what to change, there is an option on the toolbar, forgive me for not knowing the name of it, but it is something along the lines of "Execute Deployment Script".  This will cause your target to match your source based on the changes you have made to your deployment script through the user interface.

A couple of final thoughts to leave you with and then a link at the end to help you learn more about database projects in Visual Studio 2010.
  1. Database projects can be setup with unit tests.  I haven't done this.  I don't know much about it, but I like the idea, and I'll probably get into it later and will perhaps blog about it when the time comes.
  2. Database projects can be setup in TFS to automatically build on a schedule, after each checkin, through continuous integration, etc..  This is good because now we can ensure consistent quality over time, and have a backup of previous builds that we can always fall back on.  Old .dbschema files that are generated during scheduled builds can be the target of a Schema Compare to migrate the database structure back to a solid state should any issues arise.

Well, it's time to head to work, so take a look at the link below, and let me know what you think,

ETL,
Josh


Useful link(s) of study:
You'll probably want to navigate to the Documentation tab and grab the 
  1. Visual Studio 2010 Database Projects Guidance Document_(xps) file
  2. Visual Studio 2010 Database Projects HOL Documentation_(xps_pdf_docx)
I would hlink them, but I'm afraid the links might die.  I don't want to turn into a hospice for links on my blog. 


Sunday, July 22, 2012

Configurations should be easy...

Surprisingly they are not.

Last week David and I spent about 2 and a half hours discussing how we wanted to set up the infrastructure of our Data warehouse.  We are only on our second iteration, and figured the earlier we could get the structure in place, the better.  Part of that infrastructure will be the capturing of row counts and transfer speeds in order to determine our baselines.  For example, after this iteration, we should know exactly how fast our data warehouse processes will ever be, because right now, those processes are at their simplest and smallest forms.

The way our old system is setup is as follows:
SSIS parent packages are called by SQL Server Agent Jobs.  Any configuration settings we want applied to those packages are configured through the SQL Job's dialogue box which allows command line modifications, setting values, overriding connection strings, etc.  Most of the child packages, however, have hard coded connection strings and little if any inherited values from their parents.  That was fine when we were developing and deploying onto the same production server. Allowing those packages to be transferred to other servers, however, was a bit of a fiasco.

Dave Rodabaugh suggested using xml configuration files, so David and I did a bit of research. Microsoft's  "Understanding Integration Services Package Configurations" page shows up on Google as the number 2 page to look at.  It provides a good bit of information, but unfortunately, it didn't seem comprehensive enough, in my opinion.  I would have liked to see more limitations, comparisons, etc. of the different types.  We aren't against the use of xml configurations, however, we have limited access to the actual server and it's file system, and thought it would be difficult to deploy changes via a third party.  I know.  At this point you are probably thinking, "Wait a second. You are building a data warehouse, but you have limited access to the machines you are building it on?  What does that even mean?".  We can access SQL Server, SSAS, and SSIS through SSMS, but we can't RDP into the box itself.  This makes troubleshooting hardware a bit of a problem, but we work with what we have.  It doesn't make sense, but It is what it is.

So the only good alternative to xml configurations is the SQL Server configuration option.  We wanted one location to manage all of the configurations for every SSIS package that would run. We thought this option would cater to that, until we were configuring our first SSIS package and realized we had to specify a Configuration Filter at Design Time.  It doesn't appear to be an option that can inherit from its parent, so we were beginning to lose our optimism.  We found out if we wanted to deploy to development, or to production, we would have to specify that filter to be applied at design time, and that just doesn't work.  This is the type of thing I would have expected Microsoft to include in their documentation.  Perhaps if we were going to have 1 or 2 SSIS packages we could manage, but with the 10-20 we expect, it's a no-go.  By the way, below is the Syntax for the configuration table.  You can also find it at the link mentioned above.

CREATE TABLE [dbo].[SSIS Configurations]
(
ConfigurationFilter NVARCHAR(255) NOT NULL,
ConfiguredValue      NVARCHAR(255) NULL,
PackagePath          NVARCHAR(255) NOT NULL,
ConfiguredValueType NVARCHAR(20) NOT NULL
)


So we decided to take the 2nd best path and have one of these configuration tables on our production server, and one on our development server.  The configuration filter would actually be the same for both(we don't want to have to touch our ssis packages when deploying to different servers), and the connection string for the table's Database can be inherited from parent packages. This table and other configuration settings will be included and segregated into their own etl databases, which we are managing with Visual Studio's Database Project.  You can see David's post on Database Projects if you are interested in reading a bit more on it.  We've found this VS project type exceptionally useful.  Exceptional may be used a bit strongly here.  It has it's pros and cons, and perhaps one of us will write a bit about our experiences with it later on down the road.

Once we've implemented our iteration 2, I'll write more on our findings, what worked and what didn't, and perhaps help you if you stumble upon this blog early in your datawarehouse design.

Below are some links you may find helpful:
I've included these links after the initial posting of this blog to assist you in determining order configurations are applied, etc..
Defining a Configuration Approach for Integration Services Packages
Best Practices for Integration Services Configurations



Thursday, July 5, 2012

Got Doc?

Still working on documenation.  Granted, I have a lot of other hats to wear at work, but I'm still documenting.  There is a lot of value in documentation, and I don't know if you can really over-document an engineering feat of this magnitude. So what do you document for a data warehouse?  Don't you just start coding in T-SQL and casually declare that it is self documenting? ...  I hope not.  It would be nice.  But for the sake of those who might have to pick up the pieces after you leave, I hope not.

So let me give you an example of the kind of documentation I have in mind, as well as some of the documentation I've already put together.  I know you'd like to see it, but that's not going to happen.  But hopefully this will give you an idea of what I think is good.  I would be interested in hearing from other, their thoughts on documentation as well.  

I've laid out my documentation in roughly four major sections: Standards & Overview, and Conceptual, Logical, and Physical models.  The Standards & Overview section gives a brief overview and introduction, and declares the standards we are going to abide by including table naming standards, SSIS package conventions, Staging Database conventions etc...  Obviously this section will continue to grow and change as we find and agree to more suitable standards for our warehouse.  Each of the Conceptual, Logical, and Physical models contains the definitions and diagrams for sources, staging areas, the ODS, the UDM, and SSAS for the current iteration.  We are only on the first iteration of "Long and Thin", so there is currently only one iteration of documentation, but as we move to the second and third iterations, we will copy and paste previous iterations and add to them as necessary.

Along with the iterations, we're going to capture and document baseline loading metrics for full and incremental loads at all stages of the data warehouse data loads.  My hope is to find a meaningful trend to indicate, for our volume of data, how adding additional dimensions or measures impacts our load speeds or swiftness.  I understand some facts and dimensions might be vastly larger or significantly smaller in size when compared one to another, and the complexity of the ETL processes may vary.  But a guy can hope there is at least a slightly significant correlation to the number of dimensions and facts we are loading to the time it takes to load them.

For now, the documentation I've outlined above has taken longer than the actual implementation of it.  In fact it only took David and I about 2 hours of pair programming to implement our first long and thin.  I've still got a couple of hours of documentation before we proceed to iteration 2, but it's a great start.

One more thing to mention.  When we were pair programming on Tuesday, we were building our fact table which, admittedly, is very basic since it only contained 1 measure and 2 date dimensions.  We decided not to  enforce the foreign key constraints on the date dimensions.  We assumed there would be a negative impact on performance while trying to load a quarter of a billion records.  Over the holiday I was thinking about it, and mentioned to him, David, that we ought to add the constraint, and only remove it when it is absolutely necessary for performance reasons.  One thing we continue to struggle with is assuming something will be slow without evidence that it is.  So we've been reminding each other to do it like it should be done, and dumb down only when necessary.  Adding the constraint had zero noticeable affect on the performance of the load.  Perhaps when we have 15 dimensions and 5 measures we may address it, but for now, better safe than sorry.

Some would say that it's a bit aggressive to turn the constraints on because you don't need them when you have absolute control of the data, or if the data comes from a constrained data source.  In our case, it does not come from a constrained data source, and we're starting from scratch after a 1.5 year failed attempt at a data warehouse, so we're a bit nervous about leaving it unconstrained.  
-- (I just put a SQL comment indicator as if you would know this is a comment) Neither David, nor myself were originally involved in the previous attempt.  If you are reading this, you are probably chuckling or crying, because you are or were previously in similar shoes.

I would welcome any additional feedback on what you think our documentation should look like.  I would love to post it with all of its diagrams and beauty, but like I said earlier: "That's not going to happen.".

I can't wait to blog about the Conformed Dimensional Bus I've been reading about in my Kimball book.  I heard Dave Rodabaugh mention it while he was here as a consultant, but didn't quite get what he meant.  But chapter 3 of Kimball's "The Datawarehouse Toolkit", mentioned in a previous blog, has a good bit of info on conforming dimensions and facts across data marts.  I'm looking forward to chapter 4.

ETL,
Josh

Monday, July 2, 2012

You're going to do what?!?

Today I spent most of my time on documentation and preparation for the new data warehouse.  It's hard to not skip right over the planning and commence building it.  However, I'm certain the outcome will be remarkably better if we spend a lot of time up front on design.

The consultant we had on-site last week, Dave Rodabaugh with Atlas Analytics, was an excellent resource and really helped us redefine the way we viewed the data warehouse.  Our original methodology was, what we thought, a great way to approach the situation.  We were planning on building the OLTP database in a very relational and normalized form.  After we were satisfied with its design, and had it fully populated, we were going to move the data into a de-normalized schema for use with SSAS and to create and populate the cube.  Sounds great, but as it turns out, that rarely works the first time.  Dave's suggestion?  Build it "Long and Thin" and "Reduce".

So what do those terms mean?  For the short amount of time we were able to spend with Mr. Rodabaugh, it was clear that he had an almost religious conviction about his philosophies on data warehousing.  Not to the extent that he was dictating exactly what must be done, but really hammering in some best practices that he found very successful in DW. I'll probably speak of more of his philosophies later, but I think it would help at this stage to expound on what he meant by "Long and Thin" and "Reduce".

"Long and Thin" refers to end to end data development.  If you are familiar with ETL and DW, then it's pretty obvious we're pulling data from multiple sources, massaging the data or, as I've heard it called, "Munging", and serving it up via SSAS or some other multi-dimensional reporting tool.  The idea is to choose a small number ("Thin") of measures and dimensions from the source and move them all of the way ("Long") to the reporting tool.  By building the staging structures, the OLTP, which we are calling the ODS(not pronounced "Odious"), the UDM, and the cube itself, we are able to guage performance, maintainability, and complexity without investing an overwhelming amount of time in any one area.  We'll continue to lay "Long and Thin" layers on top of one another, encountering things we like and dis-like, and fixing previous erroneous methodologies and layers along the way.

"Reduce" refers to looking at and working with the data in smaller pieces.  If you can break a process down into smaller pieces, do it.  Not in excess.  We don't need a pile of crumbs to maintain.  But if an SSIS package is doing many different or potentially unrelated tasks, break it into more than one package.  Create smaller, more maintainable packages that have predictable performance.  More on predictable performance later.

It turns out it is difficult to do "Long and Thin" and "Reduce" if you are used to trying to merge SSIS packages into each other.  It's easy to think "Big Picture" when you are facing an overwhelmingly large DW project.  We found ourselves adding more and more to our first iteration of "Long and Thin" and having to rein ourselves in.  I think we're starting to get it, though.  Hopefully at some point we'll stop having to remind each other, but we're not there yet.

I get my Ralph Kimball book tomorrow.  I'm looking forward to diving into it tomorrow night.  It will be a good read during my July 4th holiday.

ETL,
Josh



Saturday, June 30, 2012

What is It?

Ever heard of Business Intelligence?  Data Warehousing?  How about the definition of a "tuple"? Even dictionary.reference.com doesn't have a good definition of the word in terms of BI.  This past week I've heard phrases like
"An attribute must have a direct or transitive relationship to the key attribute", 
or
"What is your philosophy in terms of UDM?".  
What does that even mean?  The challenge, to think at a higher intellectual level and to work on articulating those thoughts, has given me the motivation to move up to the next level in my career.

I'm a Sr. Software Engineer who is starting to get, or perhaps has already acquired, a passion for working with large data.  It's a Saturday afternoon and I'm blogging about work.  Either I'm crazy, or I've found an itch that I can't help but scratch.  Either way, this blog is to document what it is I'm doing, what I have learned, how I learned it, and how it can help you.

This will probably be a bit deep for those that are not familiar with BI.  And, since I'm just starting out, it will probably be a bit shallow for those that are.  Perhaps one day I'll reach a level of understanding in the field and think exceptionally high of myself.  Then I'll call a mentor in the field and be reminded of my inferiority.  Either way, when I'm done, I'll have a clear path to success, or a clear path to failure.  Both of which, I'm told, are valuable.


ETL,
Josh