Data Cleansing Does Not Create Lasting Quality
This entry was posted on 8 Nov 2006 and is filed under Data Quality.
I have seen it many, many
times. Organizations in the middle of
implementing a new system find they are unable to load, migrate or integrate
data from their existing systems. After
a little research they find the system is working as designed, the problem is
that their data is “dirty” and causing the system to fail. Suddenly the organization realizes it has no
choice but to “fix” the data. Now, in a
panic to maintain the project schedule and minimize costs they scramble to find a tool or method to
clean the data. A variety of options
exist here – using de-duplication tools, developing algorithms and scripts to
“fix” the data, manually cleaning the
data, or changing functionality or business rules to eliminate the need for
some of the problem data. So, the
organization pushes through with one or more approaches and eventually massages
the data into a usable form. But, does
this really fix the data quality problem?
Or does this approach solve the immediate crisis by creating an
additional time-bomb set to go off during the next system implementation?
Sure, cleansing the data to address
the immediate problem does offer some relief, but it does not actually solve
the underlying problem. Merely cleansing
the data does nothing to correct the organizational and user behavior issues
that caused the data problems in the first place. What happens to our data quality after the new
system is live for a few months? Sure,
we cleaned the data for the initial load, but does it mean our data stays
clean? NO WAY!
The problem is that our data
cleansing efforts merely treated the symptom, but left the underlying problem
untouched. Data cleansing did nothing to
change user behavior that caused the data problems, thus ensuring users will
continue to pollute the data going forward.
We need to solve the root problem.
We need to adjust the organizational forces so that users act in manner
that results in high quality data.
While there is indeed a time
and a place to clean dirty data to meet an immediate need, we should not fool
ourselves into thinking that this is a lasting solution for our data quality
problems. Whenever we undertake a data
cleansing effort we need to make sure we also adopt a data quality program that
focuses on changing user behavior.
Otherwise we will find ourselves in an infinite loop of
cleaning-corrupting-cleaning our data.