Dave Langton, VP Product, Matillion
With more and more data generated by individuals and enterprises each day, its importance to businesses also looms ever larger. When used and managed in the correct way, data is now the world’s most precious commodity. But because of its sheer volume, organisations often run the risk of incomplete or inconsistent datasets, which ultimately mean businesses missing out on significant market opportunities and profits. Because of this, research from the DAMA Data Management Body of Knowledge found that businesses invest 10 to 30% of revenues on solving these data quality problems.
Modern data teams, now recognising the importance of data integrity, are increasingly focusing their efforts on preserving it as they work to prepare data for analysis. If you’re not familiar with the term, ‘Data Integrity’ encompasses the accuracy, completeness, consistency, and compliance of data within systems. It’s an aspirational state that data teams aim to achieve, which also incorporates the processes that are used to achieve it. The definition comprises several aspects of data, from its physical integrity (how safely it is stored), to its logical integrity (its accuracy, completeness and correctness) and matters of compliance (whether it meets necessary standards, such as GDPR). Many modern distributed data systems have actually relaxed built-in support for checking logical integrity in the interests of maximising performance, leaving teams to explore other strategies for ensuring correctness.
Achieving data integrity is ultimately a way to ensure better performance, reliability, and access for an organisation. As teams embark on data integrity initiatives, there are four key risks that they should be cognizant of:
- Assigning accountability – Without uniform standards, inputting and working with data can create inconsistencies throughout the data system. Accountability is key to any organisation’s success and is especially important when it comes to managing data. Without it, there will likely be uncertainty about who is ultimately responsible for the integrity of your data.
- Outdated and inconsistent systems – Consistency is another tenet of data integrity, most often compromised by overlapping and outdated systems. Are important details stored in a standardised format across the database? Are different groups within your organisation working with the same datasets? Inconsistent data inhibits quality by creating duplicate records, data that is invalid for certain criteria, or data that is inaccessible at a given time.
- Inaccurate or incomplete data – Taking on more data can increase the difficulty of spotting incomplete or inaccurate records. Unifying data that was captured from multiple disparate systems at different points in time can also leave blind spots or inaccuracies that become buried deeper and deeper in the growing data pool. Integrity requires not only being correct, but being able to withstand the demands on your data further down the line.
- Keeping track of data – The complications brought on by trying to track those mistakes down and resolve them weeks, months, or years in the future can be even more costly than the original errors. Not having reliable audit trails for your data means uncertainty about who made changes and when. Some establish audit trails without reviewing them, rendering them equally ineffective.
Once teams are aware of the areas to watch when it comes to maintaining data integrity, implementing a plan to achieve and maintain it is critical. Because data touches every aspect of the organisation — and data teams are under pressure to manage and deliver it properly — establishing a comprehensive plan to keep data clean is essential. There are four strategic steps to a data integrity plan that modern data teams should adopt:
- Invest in integration as your datasets grow – As a long-term investment, the time and resources required to integrate data now can pale in comparison to the money and manpower an organisation can save as datasets grow. Solutions such as data preparation and ETL applications can improve consistency by not only organising data, but cleansing it in the process to help remove inconsistencies. ETL is a critical step as data volumes increase and data types vary more widely.
- Factor in a ‘data steward’ – Give employees a place to turn by appointing a ‘data steward’ to oversee a specific set of data – or the organisation’s data system as a whole. In addition, regular training sessions with employees can minimise errors at the point of entry, and establish a system of accountability and a clear framework for managing data. As the data teams grow, a data catalogue can help democratise data usage further by building trust in the datasets that matter.
- Audit trails and validation – Stewards can also monitor audit trails and take quick corrective action. Audit trails reveal what changes have been made, and by whom, tracking alterations down to the date they were made. Inaccurate or incomplete data is not only identified, but tracked to its source. Through this process, stewards can also confidently validate the data being relied upon to guide the organisation’s future.
- Create a robust testing system – It is clear that audit trails aren’t as effective when they’re not being reviewed on a regular basis. Avoid guessing at data accuracy by creating a regular testing system that augments a strong validation process. This helps ensure, for example, that data hasn’t been entered into conflicting field types for weeks or months before being found out. Just like going to the doctor, finding the problem early is often the best way to tackle it.
Planning for the future is all about spotting and tackling potential obstacles before they grow into major issues. If data represents the new way of doing business, then success and profitability in that environment requires organisations to spend time making sure that their data integrity is prioritised alongside the tools needed to navigate a changing world – giving your organisation a place in it.