top of page
chenderson92

Data Science, Data Mining, Knowledge Discovery... What are we talking about?



If there are many terms whose definitions overlap almost completely with each other, there is

obviously fundamental confusion. The difference between Data Science, Data Mining, Data

Analytics, and Knowledge Discovery is usually associated with the set of tools or languages

used (do we program? do we use visual tools that facilitate modelling? open-source or

proprietary software?), the profiles of those who perform the tasks (do they come from hard

science? are they developers? or analysts?) and minor details associated with the work to be done (is it predictive or prescriptive? does it use machine learning algorithms?).


Those of us who have been doing these tasks for some time consider that none of this is important and that even the differences that are raised are not real: in the same project, a little bit of those mentioned above are probably used. And if not, it's not that important. So unfortunately for the reader, we will not clarify the confusion.


Now, what is sought under this umbrella of names? Simply to extract implicit knowledge from data, which can be as simple as the correlation between two variables or something more complex like predicting real-time demand across all cities and neighborhoods in the world.


If you are feeling a bit dizzy, you can use the following diagram to orient your ideas:




Basic Concepts


Although we can identify two main methodologies for the development of these tasks

(SEMMA and CRISP-DM, both originating from Data Mining), the following scheme is currently used to summarize the activities carried out in analytical projects.

It is important to emphasize that this is not a replacement for the mentioned methodologies, but rather the current way of identifying the typical components of an analytical project.


Collection, processing, and cleaning


"80% of the time is spent processing the data; 20% understanding it, and the rest of the time

modelling." Famous quote, unknown author.


The scenario for data collection and processing has definitely changed over the past few years. The landscape is now completed by the rise and massiveness of social networks and Web 2.0, thanks to which we have additional sources that characterize our customers, as well as those of the competition!


Open Data initiatives, public data, and web applications that are increasingly oriented towards exposing their information to anyone who wishes to consume it, help us complement our own data with contextual information. The constant and real-time flow of information, whether from our machinery in the production plant, the alarm system, our fleet’s GPS system, microcomputers used for specific purposes, or applications such as Twitter that are constantly sending us data, will all inform us of what is happening at the precise moment.


This new landscape would be incomplete if we did not mention the current availability of new tools and technologies that enable the processing and analysis of such sources. Together, they open up a range of new possible analyses. It is what we do with them that will ultimately differentiate us from our competition or point the way to new opportunities. If we want to consider information as the most important asset, we must start by processing and storing it as such.


Data Exploration


The goal of this phase is to achieve a complete understanding of the data, what they represent, how they do it, and all their subtleties.


- What do they represent? Understanding the object of analysis from the perspective of the available data.

- How do they do it? Understanding the granularity of the data, their keys and conditions.

- Their subtleties. Identifying and discovering problems associated with data quality and availability and making the necessary decisions to overcome them.


To achieve this, different techniques will be used, with the most classic being graphical analysis, univariate (or descriptive statistics), and multivariate analysis.


The importance of graphical analysis was already mentioned compared to descriptive metrics analysis, so we can summarize that it will be through this way that conclusions associated with data distribution (Do the data have bias?), the behavior of outliers (Do we have outliers? Are they valid? What do we do with null values?), data seasonality (Does the phenomenon we are analyzing have strong increases at specific times of the year that repeat successively?), value trends, and more can be easily extracted.


It will be through univariate analysis that we will focus on the variables independently to observe their own characteristics. To do this, we will mainly use frequency distributions,

measures of central tendency (mean, median, mode, maximum, minimum), and measures of

dispersion (variance, standard deviation).


Finally, through multivariate analysis, we will analyze the relationship between two or more

variables. In this way, we will seek to confirm known relationships -validating the available data- and discover new relationships, in all cases quantifying the degree of that relationship.


Model Development


The development of models will iterate with data processing and exploratory analysis,

adjusting and resolving data issues as they are discovered during modelling.

We can group the types of models to be built as follows:


Descriptive Models

These models aim to quantify existing relationships in a dataset, ideally identifying previously unknown characteristics and behaviors.


Examples of these models include clustering and segmentation models, where the goal is to obtain data clusters with as homogeneous characteristics as possible within each group, and as heterogeneous as possible outside it. These models can involve a few variables (two-way

tables) or be fully multivariate (using multiple combined variables).


Predictive Models

The purpose will be to identify the probability of future behavior or attribute occurrence based on past behaviors. Examples of this include churn prediction models (where customers likely to cancel in the future are identified), offer acceptance prediction models (identifying customers most receptive to a specific commercial action), credit scoring models (predicting customers most likely to default on payments), and fraud detection models.


Prescriptive Models

These models build upon previous models, but not only predict possible outcomes but also

associate each with a predetermined action. By associating an action with each possible

scenario, we can evaluate the consequences of each one, predicting not only the results but

also their implications, optimizing them, and suggesting courses of action. An example of this could be a system for managing simultaneous, multi-channel, and multi-wave commercial actions. We will aim to predict the best result, considering the acceptance rate of all possible actions, considering the rules associated with managing each channel, customer preferences and behavior characteristics, and their rest periods.


Recommendations

If data is the most important asset of this century, it is essential to treat it as such. Given that activities related to Data Analytics are complex and specific, where any mistake can have a direct impact on the outcome (wrong data used for management) or on the scalability of the solution (short-term solutions that require starting from scratch every so often).


That is why it is advisable to use purpose-specific tools and complement the existing team in an organization with specialists who can bring specific know-how.

 

By Minimalistech´s editorial team.


Minimalistech has more than 10 years of experience in providing a wide range of technology solutions based on the latest standards. We have built successful partnerships with several SF Bay Area, top performing companies, enhancing their potential and growth by providing the highest skilled IT engineers to work on their developments and projects.

5 visualizaciones0 comentarios

Comments


bottom of page