I had hoped to jump right in to the next installment to focus on the technical aspect of where we have been with the EDW but got sidetracked by production problems.
Our Banner data warehouse is populated using Informatica as the ETL tool. I've been impressed with the performance the tool has for moving data around. I don't know how much the initial design of the ETL process came from the consultants that were used during construction but we ended up with the following architecture:
BANNER ---(Extract)---->Staging DB-----(Transform)---->Staging DB---(Load)---->EDW
The staging DB is the same database with what we like to call 2 sides, one side looks exactly like the Banner source tables and the other looks exactly like the EDW tables. Indexes, constraints, security, etc. is excluded of course but the table structures, PKs and such are all present.
Besides the 14 hour ETL window I mentioned earlier the other obvious problem with this design is space. We've duplicated somewhere around 1600 Banner and homegrown tables in the staging DB and all 700 tables in the EDW are duplicated as well. Our Banner DB is 275 GB, our staging DB is 500 GB, and our EDW DB is 350 GB. That middle half terabyte is almost pure duplication.
Other issues working left to right include difficulty in extracting only recently modified data from Banner. As you probably know many of the Banner tables have no activity date and if they do have one it is sometimes unreliable. Because of that we have some extract maps that truncate the table on the staging DB and do a complete reload of the Banner table. This was not too bad when we first started out but it got bad quickly. Some extracts were coded that way when they could have probably used some activity date logic. I don't know all the reasons why but I suspect lack of knowledge on if the activity date was reliable and looming deadlines probably were big factors when first developed. The ETL developers have steadily replaced those complete refreshes where the data has grown to large volumes and where they now have good logic to ensure they don't miss something. Lack of useful indexes has also made extracts perform poorly. Most tables have no indexes on activity date so you have to do full table scans to find the data you want to process. Take the NHRDIST table for example, ours is over 87 million rows. If you strictly use activity date you have to scan that entire table in order to extract the recently added rows on the end. Once you realize that the table has an Oracle sequence number on it which is indexed you can store the last sequence number (with a doc code or interfaced ind in our case) so the next extract can simple do an index scan from that point forward. Or the FGBTRND and FGBTRNH tables, ours are 121 million and 36 million respectively, neither of which have a good index to extract new data with. Since we partitioned our tables on fiscal year and fiscal period we worked with the ETL developers to determine which periods are open or have recently closed and they supply that as a predicate to their extract so they only scan the sub-partitions which makes things considerably faster. In general the problem with the extract portion of the ETL, other than the duplication mentioned earlier, is determining the delta (changed rows) since the last extract. For some reason our EDW consultants did not focus on trying to design something sustainable here so we're all trying to fix that. We've got some things in place and another technology in the works to solve that permanently but that is the next installments.
On the transform side, we also struggled with processing only delta. If you do a full extract each night in most cases you have to do a full transform each night. Writing to the same staging DB is also a duplication of space. Even where there is delta logic in place for the transform in many cases there is no suitable index in place to use so full table scans and poor performance have plagued this area. For a number of reasons there has just never been time for a full out performance improvement project that focuses only on the transform SQL, indexes, and delta logic, although it is in the not too distant future. Look for the last installment of where we're going.
On the load side it is a simple process, a straight copy of data from the staging DB to the EDW DB in most of the cases. There are some marts that are populated as well but they are a fraction of the load end. In many cases where the transform logic was changed to process delta the load logic was not, so a good deal of our load step is truncate and reload of the EDW objects. Not efficient at all.
Many of you may wonder why we don't use triggers to detect delta in Banner. We have had a prime directive of not changing the baseline product since day 1 that has stuck pretty well except when we're in dire straights. You may think a 14 hour ETL window is dire straights but remember that the EDW department is separate from the department that maintains the ERP so there is political turf there. The thought is that triggers on the ERP violate the prime directive of not changing the baseline but it also add an extra workload to every transaction against those 1600 tables. Yes, they would be very simple triggers, but how much load would that add to a shop our size and remember the prime directive and that the EDW department is an outside department. Lastly, triggers have been given an evil persona here. Too many cases of triggers run amok in other previous applications that have forever tainted the name "trigger" here. OK, so you should get the picture now, right.
So, we cannot use triggers and we cannot always rely on date logic and even when we can there are no indexes. Whatever will we do!!! Stay tuned for the next installment of what we're doing now to find out and even more exciting, where we're going next.
Recent comments
2 days 11 hours ago
1 week 6 days ago
2 weeks 16 hours ago
2 weeks 1 day ago
3 weeks 3 days ago
4 weeks 12 hours ago
5 weeks 5 days ago
5 weeks 6 days ago
6 weeks 1 day ago
6 weeks 1 day ago