Writen by Tomaz Lago, Data Engineer,
4 minutes of reading
Your data science project cannot succeed if you don’t invest in data engineering
Data engineering is the foundation for all today’s data-driven projects.
In the book “Big Tech”, Evgeny Morozov takes a very alarmist view on the monopoly of data and transmitting information by technology giants, such as Google and Facebook. However, although the book takes a very dystopian approach, those who work in this field know that it is not just Big Tech that takes advantage of the amount of data collected and stored. Even the cost of data storage is no longer a financial or technical issue, because Cloud Computing has become quite commonplace for organizations,
Therefore, executives from many different industries have begun to invest in big data and data science, so that they can make decisions based on data, stay ahead of competitors and, who knows, they may discover a new business model or proposition from such projects. However, why does Nick Heudecker, a data analyst at Gartner, say that 85% of these types of project are failures? Mostly, the issue is the lack of a data engineering expert or team on the project.
When we talk about failures in big data and data science projects, there are four main reasons, according to Andy Patrizio: poor data integration; technical and business skills gaps; technology generation gaps and software architecture limitations; and poorly defined goals. Of these, only ‘poorly defined goals’ can be solved without a data engineering expert. The other three issues can be corrected with the help of this field or workers exclusively dedicated to it, on your project.
Therefore, before we go any further we need to clarify what data engineering does. Let us use the Gartner definition (free translation) so we have a reliable source: “The data engineer’s primary responsibility is to create data pipelines that provide quality information ready for analysis. You often need several iterations to create a data pipeline. It involves enriching and combining datasets, so that you have meaningful data to input when you develop a model. You have to have a strong focus on combining, modeling and optimizing the data, as well as on quality, governance and security.”
Let us go back to our issues, examine them and see how data engineering can help in a project.
Poor Data Integration
The job of enriching data from different systems (such as ERP, accounting, HR, social media, etc.) can be simple, such as cross-linking information between two or more systems, or it can be complex, such as using machine learning algorithms to create data clusters.
It is also extremely important to ensure the quality of the data being delivered, including dealing with blank and invalid results, depending on the format.This sort of activity is technically complex and it can distract the data science team, who should be thinking about the business.
Technical and Business Skills Gaps
A data science expert is considered something of a unicorn, because they need to combine three important areas of expertise: the business, statistics and programming. However, once the data science team’s hypothesis has been processed, we need to hand the result over to the product team, management, and the directors.
CI/CD (continuous integration/continuous deployment) is a well-known process for development and operations teams and is part of the principles behind DevOps. However, when we are talking about CI/CD in relation to data, we are talking about a pipeline that turns the data science team’s dataset deliverable into something that can be used by other areas, in the types of repositories that these teams need and in an automated, controlled and managed way.
Technology generation and software architecture gaps
There are plenty of technical development and software architecture issues that have to be constantly considered. Data engineering deals with technical issues, and the ability to maintain and develop the pipelines and governance of the data, which ensures that the deliverables are always consistent and up-to-date. This frees up the data science team to find answers to the business issues they are investigating. It is an alignment designed to ensure that the business continues to work fluidly.
Poorly Defined Goals
It is quite common in projects of this type, that you do not know what to ask from the data. Many companies want to have Big Data, Analytics and Data Science projects, but once they have them, they do not know how to use them.
It is extremely important, therefore, that the data scientists are connected to the organization’s strategy and that they understand the product roadmaps, and are in close communication with the business teams. In this instance, the data engineering team acts as support, ensuring that the goals are achieved from data that is available, reliable and usable.
Data engineering is the foundation for all today’s data-driven projects. This field can guarantee that your projects are successful. In short: there is no successful data science project without good engineering.