One of the things I have appreciated about my “new job” is getting to work with a technically astute large-scale mission-critical enterprise architecture. Part of this includes data warehousing and business intelligence, two areas I have had interest in since reading “Super Crunchers” by Ian Ayres (a very interesting look at data mining to find extremely useful and actionable information).
OLTP vs. OLAP
When first introduced to data warehousing, my mental model was of a super complex, massive farm of database servers. The reality is that data warehouses sprang up because of their different needs from transactional operations:
- Online Transaction Processing (OLTP). This mirrors how a database would be used in a typical web-based production environment such as placing orders within Amazon. A large number of short-lived transactions in a write-oriented database that is probably barely coming up with its capacity needs. With limited capacity, do you really want to be executing in-depth analysis queries against your production transaction-oriented database? Even if you did, how could you possibly tune it to be as fast as possible from both a read and write perspective? You cannot.
- Online Analysis Processing (OLAP). Instead of doing write-oriented update transactions, OLAP focuses on the more read-oriented queries and statistical analysis. For this, you want your database to be optimized for read operations as these queries can take a while to execute, and the organization will want to be able to identify trends or other issues from this near real-time data. For example, think about how Amazon mines your purchase patterns to make suggestions, or to increase or decrease the price of books and other offerings. This is actionable business intelligence leading to greater agility and returns.
Thus, your data warehouse is typically a separate write-oriented database containing near-real time information (typically anywhere from less than an hour to a day stale) with a much larger time horizon.
One of the inputs can be your OLTP database, but a data warehouse is typically composed of numerous data feeds.
This is where Extract Transform and Load (ETL) typically comes in:
- Extract. Extract the data from multiple data sources
- Transform. Transform and clean the data. Different data sources can have different representations of the same conceptual entity. Furthermore, they can contain data errors (e.g., data entry input errors) and other related problems that need to be addressed when trying to put together an integrated picture from many different data sources
- Load. Load the transformed data into the data warehouse
Business Intelligence (BI) and Data Mining
Business Intelligence is the use of the data in the data warehouse to derive actionable business level information. This can include analysis along a number of different dimensions (e.g., sales per region, sales across timing, trending) as well as forecasting the future based upon the past. Analysts can use ad hoc queries to see what is going on and to do “what-if” kind of scenarios.
Data Mining can be employed to identify trends and other relationships within the data that would not be so readily obvious.
For update-oriented operations such as a web site handling placing orders, you want a normalized schema for greater efficiency. However, for the analysis queries performed in a data warehouse, you typically want a mix between a normalized schema and a star schema. So what is a star schema?
I am going to use an example from the book “Oracle Essentials”, which I just finished reading. Here is a typical query (which shows the advantages of a star schema):
Show me how many sales of widgets (a product type) were sold by a store chain (a sales channel) in Louisiana (a geography) over the past 3 months (a time)
This query involves many dimensions:
- product type
- sales channel
In the star schema, you would have a central fact table (representing sales transactions) with four connected dimension tables (e.g, Product, Channel, Geography and Time).
For efficiency, data within these dimensions is usually hierarchical (e.g., for the time dimension, day rolls up into week, which rolls up into month,which rolls up into quarter, which rolls up into year). If your data is looking for a particular quarter, it can be executed against that summary as opposed to all the more granular data related to weeks and days. Hence these are referred to as summary tables.
Once again, I recommend reading “Super Crunchers” by Ian Ayres for more real world uses of data mining.