Being a Data Engineer at SpazioDati

Andrea Scarpino
2 nov 2022
Tempo di lettura: 5 min

Aggiornamento: 6 mar 2023

Frontend developers write programs that interact with the user, whereas backend developers write programs that interact with other programs and work behind the scenes, making use of databases to organize the data of the application. If an application is an e-commerce platform, the data it works with would describe the list of customers of the platform, the list of goods being sold, the list of orders made by the customers, and so on. What about data engineers then? Let’s see how this role appeared and evolved in SpazioDati.

In SpazioDati’s early days our motto was “all you need is data”. We had a dream of creating an easily accessible marketplace of data of different kinds. The data we were collecting represented knowledge about the world – the enterprise world, to be precise. At first, it was the official information about companies in Italy, then also publicly accessible data (like companies’ websites). We would produce a new unified database of knowledge about such companies, which would (and still does) power up our main product – Atoka.

data engineer in the wild — We used to take photos of our branded “All you need is data” hoodies from all over the world (here in Quito)

At this point, we were describing our jobs as “Backend Engineers” (some of us also added “& Data Scientist”) on the team page of the website (check it out on WebArchive if you are curious!). For us, this was just another way of writing programs that are not facing the user – what the backend engineers usually do.

Then our CTO Davide Setti, while updating the company website a few years ago, said: “Look, guys, you are not actually Backend Engineers, but rather Data Engineers!”. He was right, and this is how, with a simple git commit and git push by Davide in the code repository of our website, a few of us officially became Data Engineers! At the time, did we realize what that role meant? Most certainly, not.

Over the years our data system grew to have more than a hundred different datasets, coming from dozens of data providers: our main database now occupies almost 2 terabytes of disk space. For example, a typical database of an e-commerce website may occupy hundreds of MB, maybe a few GBs, but not 2 TB! We had to come up with a system that is easy to scale, that allows us to add new datasets with the least possible effort, and with code unified as much as possible. We also had to learn to use ugly, disposable scripts wisely: that is, not to over-engineer short-lived code.

We were truly standing on the shoulders of giants. On the technology side, I must note, first of all, AWS which has been the backbone for all our systems in SpazioDati in general. In particular, we have been heavily using Postgres, Elasticsearch, Spark, Kafka, Kubernetes, and our favorite programming languages: python and Java/Scala. Working with data experts (often known as Data Owners) was also crucial: in SpazioDati we even have a special role called Data Officer.

Although we have been using Agile from the beginning, we developed a special knick for pair programming, where n (usually 2) people would work on the same problem at the same time, switching sides. Pair programming would allow us to get through the toughest parts of the tasks which we would struggle to do individually. Anyway, every task would usually be spread among several members of the team, because different phases of the work would be normally done by different people on the team, thus allowing for knowledge and responsibility sharing.

We had to face imperfections in the data. Sometimes one would find an active director of a company that was more than 100 years old, which would happen if a company was not operative for many years and the data simply wasn’t updated. Sometimes the data would contain typos (like a surname written Fragoletii instead of Fragoletti). Sometimes the data would be incomplete or inconsistent, like a person's record would have name/surname set to “DA INDICARE” (“to be specified”). Sometimes we would even find blatantly wrong data. For example, the VAT ID of our company, SpazioDati S.r.l., was used many times by people documenting public tenders to indicate the VAT ID of a tender participant (which was in fact not SpazioDati). How would we discover that? Simple, in Atoka we have a page with the list of all public tenders a company participated in, and surprise, according to that data SpazioDati appeared to have participated in dozens of them (which is not true).

Dimension of the data in our master Postgres database

But what does a present workday look like? Let’s look at a day when a new dataset comes and the data engineers have to incorporate it into our systems.

First of all, a few questions must be answered. What’s the format of the data? What’s the meaning? What is the ID of each record (and is there an identifier at all)? How to match it with the existing knowledge base? What are the data types of particular fields? A data engineer must obtain some domain knowledge to answer these questions. For example, when we had to integrate RNA data, in particular information about COVID-related funding, we had to work closely with data experts in Cerved Group to come up with a set of rules to extract relevant information. The most important question that the customer was asking was: does this company still have this kind of state funding available (or did it use it up to the limit)? That was not the easiest thing to grasp 😅

Besides, a data engineer often needs to know what is the possible set of values for a given field in a dataset. In this case, we perform an analysis using – surprise! – data science tools like jupyter notebooks with pandas (other than plain requests to databases).

Finally! The data is organized, safely stored in our databases, and even integrated as a field or group of fields in the Atoka knowledge database. It is really satisfying when these new data start bringing value to Atoka users!

Once the new dataset is integrated into the system, engineers return to the more common routines – refactoring of the data processing scripts and components, adding monitoring, continuous integration, unit testing, keeping tools and libraries updated, tackling tech debt, and so on.

A screenshot of one of our dashboards in Grafana, a staple tool for monitoring and analysing performance

Then eventually some of the systems start struggling with performance, and the engineers need to gather performance data, analyze the metrics, identify the bottlenecks and the weak points in the design, and come up with an improvement: it may be adding more resources, or adding a workaround, or, if nothing else works, developing an alternative, more performant solution. Some improvements may take months of development, even years, like the migration from an ftp-based data transfer to a streaming, Kafka-based one.

Understanding the data, importing it, processing it, adding something from it to the Atoka knowledge base, and devising the machinery to keep the data safe and sound. That’s what keeps Atoka’s data alive and kicking, and that’s what SpazioDati’s Data Engineers do on a daily basis.

That, and also discovering yet another way a human can make a typo during data entry 🙈 What can I say, no dataset is perfect!

Are you a data engineer who loved the previous description of the role in SpazioDati? Are you looking for a challenging but very interesting work environment? Then please send us your CV. Or check out our other open positions!

Being a Data Engineer at SpazioDati

Post recenti

Comments