This article
attempts to explain the rudimentary concepts of data warehousing in the form of
typical data warehousing interview questions along with their standard answers.
After reading this article, you should gain good amount of knowledge on various
concepts of data warehousing.
Let us begin
with the simplest questions first, we will gradually move towards more complex
concepts later.
What is data warehouse?
Ø A data warehouse is a electronic storage of
an Organization's historical data for the purpose of reporting, analysis and
data mining or knowledge discovery.
Ø Other than that a data warehouse can also be
used for the purpose of data integration, master data management etc.
Ø According to Bill Inmon, a datawarehouse
should be subject-oriented, non-volatile, integrated and time-variant.
Explanatory
Note
Note here,
Non-volatile means that the data once loaded in the warehouse will not get
deleted later. Time-variant means the data will change with respect to time.
What is the benefit of data
warehouse?
A data
warehouse helps to integrate data and store them historically so that we can
analyze different aspects of business including, performance analysis, trend,
prediction etc. over a given time frame and use the result of our analysis to
improve the efficiency of business processes.
Why Data Warehouse is used?
For a long
time in the past and also even today, Data warehouses are built to facilitate
reporting on different key business processes of an organization, known as KPI.
Data warehouses also help to integrate data from different sources and show a
single-point-of-truth values about the business measures.
What is the difference
between OLTP and OLAP?
OLTP is the
transaction system that collects business data. Whereas OLAP is the reporting
and analysis system on that data.
OLTP systems
are optimized for INSERT, UPDATE operations and therefore highly normalized. On
the other hand, OLAP systems are deliberately denormalized for fast data
retrieval through SELECT operations.
Explanatory
Note:
In a
departmental shop, when we pay the prices at the check-out counter, the sales
person at the counter keys-in all the data into a "Point-Of-Sales"
machine. That data is transaction data and the related system is a OLTP system.
On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system
On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system
What is data mart?
Data marts
are generally designed for a single subject area. An organization may have data
pertaining to different departments like Finance, HR, Marketting etc. stored in
data warehouse and each department may have separate data marts. These data
marts can be built on top of the data warehouse.
What is ER model?
ER model or
entity-relationship model is a particular methodology of data modelling wherein
the goal of modelling is to normalize the data by reducing redundancy. This is
different than dimensional modelling where the main goal is to improve the data
retrieval mechanism.
What is dimensional modelling?
Dimensional
model consists of dimension and fact tables. Fact tables store different
transactional measurements and the foreign keys from dimension tables that qualify
the data. The goal of Dimensional model is not to achieve high degree of
normalization but to facilitate easy and faster data retrieval.
Ralph Kimball
is one of the strongest proponents of this very popular data modelling
technique which is often used in many enterprise level data warehouses.
What is dimension?
A dimension
is something that qualifies a quantity (measure).
For an
example, consider this: If I just say… “20kg”, it does not mean anything. But
if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th
April (date)", then that gives a meaningful sense. These product,
customer and dates are some dimension that qualified the measure -
20kg.
Dimensions
are mutually independent. Technically speaking, a dimension is a data element
that categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is
something that is quantifiable (Or measurable). Facts are typically (but not
always) numerical values that can be aggregated.
What are additive,
semi-additive and non-additive measures?
Non-additive Measures
Non-additive
measures are those which can not be used inside any numeric aggregation
function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind
of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc.
A non-numerical data can also be a non-additive measure when that data is stored
in fact tables, e.g. some kind of varchar flags in the fact table.
Semi Additive Measures
Semi-additive
measures are those where only a subset of aggregation function can be applied.
Let’s say account balance. A sum() function on balance does not give a useful
result but max() or min() balance might be useful. Consider price rate or
currency rate. Sum is meaningless on rate; however, average function might be
useful.
Additive Measures
Additive
measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.
What is Star-schema?
This schema
is used in data warehouse models where one centralized fact table references
number of dimension tables so as the keys (primary key) from all the dimension
tables flow into the fact table (as foreign key) where measures are stored.
This entity-relationship diagram looks like a star, hence the name.
Consider a
fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be the measure here and keys from customer,
product and time dimension tables will flow into the fact table.
What is snow-flake schema?
This is
another logical arrangement of tables in dimensional modelling where a
centralized fact table references number of other dimension tables; however,
those dimension tables are further normalized into multiple related tables.
Consider a
fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be the measure here and keys from customer,
product and time dimension tables will flow into the fact table. Additionally
all the products can be further grouped under different product families stored
in a different table so that primary key of product family tables also goes
into the product table as a foreign key. Such construct will be called a
snow-flake schema as product table is further snow-flaked into product family.
Note
Snow-flake increases degree of normalization in the design.
Snow-flake increases degree of normalization in the design.
What are the different
types of dimension?
In a data
warehouse model, dimension can be of following types,
1.
Conformed
Dimension
2.
Junk
Dimension
3.
Degenerated
Dimension
4.
Role Playing
Dimension
Based on how
frequently the data inside a dimension changes, we can further classify
dimension as
1.
Unchanging or
static dimension (UCD)
2.
Slowly
changing dimension (SCD)
3.
Rapidly
changing Dimension (RCD)
What is a 'Conformed
Dimension'?
A conformed
dimension is the dimension that is shared across multiple subject area.
Consider 'Customer' dimension. Both marketing and sales department may use the
same customer dimension table in their reports. Similarly, a 'Time' or 'Date'
dimension will be shared by different subject areas. These dimensions are
conformed dimension.
Theoretically,
two dimensions which are either identical or strict mathematical subsets of one
another are said to be conformed.
What is degenerated
dimension?
A degenerated
dimension is a dimension that is derived from fact table and does not have its
own dimension table.
A dimension
key, such as transaction number, receipt number, Invoice number etc. does not
have any more associated attributes and hence can not be designed as a
dimension table.
What is junk dimension?
A junk
dimension is a grouping of typically low-cardinality attributes (flags,
indicators etc.) so that those can be removed from other tables and can be
junked into an abstract dimension table.
These junk
dimension attributes might not be related. The only purpose of this table is to
store all the combinations of the dimensional attributes which you could not
fit into the different dimension tables otherwise. Junk dimensions are often
used to implement Rapidly Changing Dimensions in data warehouse.
What is a role-playing
dimension?
Dimensions
are often reused for multiple applications within the same database with
different contextual meaning. For instance, a "Date" dimension can be
used for "Date of Sale", as well as "Date of Delivery", or
"Date of Hire". This is often referred to as a 'role-playing
dimension'
What is SCD?
SCD stands
for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and
Type 6, although Type 1, 2 and 3 are most common.
What is rapidly changing
dimension?
This is a
dimension where data changes rapidly.
Describe different types of
slowly changing Dimension (SCD)
Type 0:
A Type 0
dimension is where dimensional changes are not considered. This does not mean
that the attributes of the dimension do not change in actual business
situation. It just means that, even if the value of the attributes change,
history is not kept and the table holds all the previous data.
Type 1:
A type 1
dimension is where history is not maintained and the table always shows the
recent data. This effectively means that such dimension table is always updated
with recent data whenever there is a change, and because of this update, we
lose the previous values.
Type 2:
A type 2 dimension
table tracks the historical changes by creating separate rows in the table with
different surrogate keys. Consider there is a customer C1 under group G1 first
and later on the customer is changed to group G2. Then there will be two
separate records in dimension table like below,
Key
|
Customer
|
Group
|
Start Date
|
End Date
|
1
|
C1
|
G1
|
1st Jan 2000
|
31st Dec 2005
|
2
|
C1
|
G2
|
1st Jan 2006
|
NULL
|
Note that
separate surrogate keys are generated for the two records. NULL end date in the
second row denotes that the record is the current record. Also note that,
instead of start and end dates, one could also keep version number column (1, 2
… etc.) to denote different versions of the record.
Type 3:
A type 3
dimension stored the history in a separate column instead of separate rows. So
unlike a type 2 dimension which is vertically growing, a type 3 dimension is
horizontally growing. See the example below,
Key
|
Customer
|
Previous Group
|
Current Group
|
1
|
C1
|
G1
|
G2
|
This is only
good when you need not store many consecutive histories and when date of change
is not required to be stored.
Type 6:
A type 6
dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to
type 2, but only you add one extra column to denote which record is the current
record.
Key
|
Customer
|
Group
|
Start Date
|
End Date
|
Current Flag
|
1
|
C1
|
G1
|
1st Jan 2000
|
31st Dec 2005
|
N
|
2
|
C1
|
G2
|
1st Jan 2006
|
NULL
|
Y
|
What is a mini dimension?
Mini
dimensions can be used to handle rapidly changing dimension scenario. If a
dimension has a huge number of rapidly changing attributes it is better to
separate those attributes in different table called mini dimension. This is
done because if the main dimension table is designed as SCD type 2, the table
will soon outgrow in size and create performance issues. It is better to
segregate the rapidly changing members in different table thereby keeping the
main dimension table small and performing.
What is a fact-less-fact?
A fact table
that does not contain any measure is called a fact-less fact. This table will
only contain keys from different dimension tables. This is often used to
resolve a many-to-many cardinality issue.
Explanatory
Note:
Consider a
school, where a single student may be taught by many teachers and a single
teacher may have many students. To model this situation in dimensional model,
one might introduce a fact-less-fact table joining teacher and student keys.
Such a fact table will then be able to answer queries like,
1.
Who are the
students taught by a specific teacher.
2.
Which teacher
teaches maximum students.
3.
Which student
has highest number of teachers.etc. etc.
What is a coverage fact?
A
fact-less-fact table can only answer 'optimistic' queries (positive query) but cannot
answer a negative query. Again consider the illustration in the above example. A
fact-less fact containing the keys of tutors and students cannot answer a query
like below,
1.
Which teacher
did not teach any student?
2.
Which student
was not taught by any teacher?
Why not?
Because fact-less fact table only stores the positive scenarios (like student
being taught by a tutor) but if there is a student who is not being
taught by a teacher, then that student's key does not appear in this table,
thereby reducing the coverage of the table.
Coverage fact
table attempts to answer this - often by adding an extra flag column. Flag = 0
indicates a negative condition and flag = 1 indicates a positive condition. To
understand this better, let's consider a class where there are 100 students and
5 teachers. So coverage fact table will ideally store 100 X 5 = 500 records
(all combinations) and if a certain teacher is not teaching a certain student,
the corresponding flag for that record will be 0.
What are incident and
snapshot facts
A fact table
stores some kind of measurements. Usually these measurements are stored (or
captured) against a specific time and these measurements vary with respect to
time. Now it might so happen that the business might not able to capture all of
its measures always for every point in time. Then those unavailable
measurements can be kept empty (Null) or can be filled up with the last
available measurements. The first case is the example of incident fact and the
second one is the example of snapshot fact.
What is aggregation and
what is the benefit of aggregation?
A data
warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is termed as granularity. But all
reporting requirements from that data warehouse do not need the same degree of
details.
To understand
this, let's consider an example from retail business. A certain retail chain
has 500 shops accross Europe. All the shops record detail level transactions
regarding the products they sale and those data are captured in a data
warehouse.
Each shop
manager can access the data warehouse and they can see which products are sold
by whom and in what quantity on any given date. Thus the data warehouse helps
the shop managers with the detail level data that can be used for inventory
management, trend prediction etc.
Now think
about the CEO of that retail chain. He does not really care about which certain
sales girl in London sold the highest number of chopsticks or which shop is the
best seller of 'brown breads'. All he is interested is, perhaps to check the
percentage increase of his revenue margin across Europe. Or maybe year to year
sales growth on eastern Europe. Such data is aggregated in nature. Because
Sales of goods in East Europe is derived by summing up the individual sales
data from each shop in East Europe.
Therefore, to
support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means
showing the slice of a data, given a certain set of dimension (e.g. Product)
and value (e.g. Brown Bread) and measures (e.g. sales).
Dicing means
viewing the slice with respect to different dimensions and in different level
of aggregations.
Slicing and
dicing operations are part of pivoting.
What is drill-through?
Drill through
is the process of going to the detail level data from summary data.
Consider the
above example on retail shops. If the CEO finds out that sales in East Europe
has declined this year compared to last year, he then might want to know the
root cause of the decrease. For this, he may start drilling through his report
to more detail level and eventually find out that even though individual shop
sales has actually increased, the overall sales figure has decreased because a
certain shop in Turkey has stopped operating the business. The detail level of
data, which CEO was not much interested on earlier, has this time helped him to
pin point the root cause of declined sales. And the method he has followed to
obtain the details from the aggregated data is called drill through.
3 comments:
Thanks for sharing interview questions and also answers swathi
It was really a nice article and I was really impressed by reading this article We are also giving all software Course Online Training. The Data Warehousing Courses Online Training is one of the leading Online Training institute in the world.
Great article,Thank You,,,
Keep Updating...
MSBI Online Training Hyderabad
Post a Comment