top of page

B2B lead generation through Data Science and Machine Learning

A few years ago, when we received the first request to perform Business-to-Business lead generation, we knew we had to come up with something completely new, taking advantage of both our data and a novel data science/ML approach. I am going to describe our solution and the challenges we faced. But first, let's make it clear what is the problem that we want to solve.

Imagine you are a company A. You already have a list of companies that are your clients - let's call them C. You would like to get a list L of "leads", i.e. other companies which are very likely to become new clients of yours. Your interest, of course, lies in the possibility to reach out to the companies in L, proposing your services/products, and gain new clients: indeed, not only would you like to have new contacts, but contacts who are very likely to become your future clients (otherwise you would waste your effort). So you go to SpazioDati and say: "Here is C (the list of my current clients). Based on your company data, can you figure out by yourselves (studying C) what is the ideal profile of my clients, and then suggest new leads L to me?". Quite an interesting problem to solve!

A high-level view of our approach: the current clients of company A are used to train a Machine Learning model, which is then used to generate a list of promising leads for A.

Traditionally this problem was tackled by human experts, who had the domain knowledge to evaluate possible contacts and understand the most promising ones. While the results may have been good, this could be a lengthy and costly process; besides the expert could not have had access to a great number of companies to pick from. The advent of digitalization opened up new possibilities (e.g. inbound marketing techniques); however, they still required quite an effort on the part of marketing/sales teams, and the universe of companies to pick from could be only a small subset of the existing ones.

On the contrary, we wanted a solution with the following characteristics:

  • L must be tailored to company A and its list C;

  • The process must be as automatized as possible, and yet provide good, useful leads;

  • The contacts in L must be as numerous as possible, therefore our solution must analyze every existing company to expand L;

  • As an added benefit, we return the leads as a list on Atoka, so the sales team has all the possible information to quickly get an idea of the lead, and the phone numbers/emails/addresses to contact it.

Our approach was perfected along several iterations of performing B2B lead generation and comprises a combination of Data Science (to gain insights from the input dataset) and Machine Learning (to create a model which takes in input some info about a company and returns a score which "quantifies" how good a lead it would be). Let's go into more detail and look at the steps that our data scientists/ML engineers go through.

The main steps of our workflow.

The process starts with the matching of the input dataset C (usually a CSV file) with the Atoka database. This step is trivial if a unique identifier (like the vat id) is available for each company, but it may be challenging otherwise, both because real-life datasets can contain errors and because the information for each company in C may not be enough to easily match it with Atoka: for example, the company name contained in the CSV file may be different from the official one, or many different companies may have the same name, etc. Therefore, more data analysis and pre-processing may be needed here.

Once the matching is done, we analyze the companies in C by leveraging the rich and diverse data about them that are available in Atoka. In this phase the objective is to create a profile of the input companies, trying to discover some insights that will become useful later in the feature engineering of the ML model. The analyses performed go from simple, common ones (e.g. the geographical distribution) to more specific, tailored ones which depend on the input dataset (e.g. identifying outliers, or understanding the seasonality of purchases by companies in C).

We share the results of such analyses with the stakeholders in A: they always get some insights on aspects that they never considered about their client base, while we get useful feedback about possible features to include/exclude from the ML model. As an example, let's say that A has most of its current clients in a specific Italian region, but the stakeholders tell us that they would be happy to extend their business to other regions: if we are not careful when engineering the model, then it could learn to suggest companies in the same region exclusively.

Thanks to the previous analysis we have enough information and we can start creating the ML model. But a big issue comes up: for the training and testing of a model we need both positive and negative examples, i.e. companies that are good leads and companies that are not. While the positive case is easy (we can use the current clients for that), the negative one is definitely not. In order to identify such negative examples, we developed general heuristics which need to be adapted to the specific input dataset C.

Then we need to build a model. For each company in Atoka, we have a wealth of information: economic data, social data, info about their activity type, etc (look here for more info). This rich information can be encoded as input features of the model: we have boolean features (e.g. if the company has a website), numerical features (e.g. what is the last revenue), and so on. And actually, it has been encoded already for each company in Atoka, available for the ML engineer to use: at the moment of writing we can extract about 600 generic features for each company. Of course, not every company has a value for all the features (e.g. for certain categories of companies the economical data are quite limited), therefore some features may be missing. And while there are some features that we specifically want to add or exclude after the feedback received from A (e.g. the geographical location in the previous example), we also want to make this step as much automatized as possible, without having to manually select them one by one but instead relying on the model to understand which are the most important ones. In other words, we built a model that can perform well both when some features are missing for an input company, and when it is fed several features which are not useful for the prediction.

Information about a company is encoded as input features of the ML model, whose result is a score in the interval [0, 1]

Once the model is trained and ready, we feed all the existing companies to it. For each one, as output, we get a score which tells the likelihood that the company is a good lead. We sort the companies according to their score, we select the top ones and voilà: B2B lead generation is done!

How useful are the generated leads? The only way to know for sure is for company A to contact those leads and see if they are converted to clients. We can proudly say that we always received excellent feedback: all the companies for which we performed B2B lead generation found it very useful, the model can produce high-quality leads, and it performs well even when the input dataset is relatively small (i.e. around 300 companies in C).


If you are a Machine Learning engineer and you would like to work on challenging problems like the one just described, send us your CV. Or have a look at our other open positions. We face interesting problems on a day-to-day basis, you definitely won't get bored in SpazioDati!

155 visualizzazioni0 commenti

Post recenti

Mostra tutti


bottom of page