Data Mining

What is it?


minaData Mining is a set of techniques used to extract the useful, unknown, implicit and hidden information contained in the data. 


QUESTION: Should the mine be the data and should the mineral be the information. What is the miner's name?



Set of techniques

Data mining is a set of techniques because it uses resources from varied disciplines such as logic, mathematics, computer science and physics. Data mining is neither a science nor technology.


Data is not information. Data are numbers recorded in a register. Information is what allows us to understand the data and those relationships contained in itself.


The information that we get from the data should be useful to find what we are looking for.


The information obtained is usually unknown, ie, there is no a priori evidence of the conclusions found. They are hidden behind the data.

Implicit Implicit information means that it is not easy to deduce, it is not obvious. Data mining reveals what is not obvious. hiddenlogo

Data is usually obtained from databases, which allows us to compute and process large amounts of data.









 Classification    Training    Prediction

In this stage, the data is classified into categories according to the numerical values ​​of the attributes of the data. Each category includes a number of examples from which the relevant characteristics of the category will be determined.


The core of the data mining is the mathematical training algorithm, also known as machine learning. The algorithms finds the relevant properties of the examples in each category.


After training, the algorithm can automatically determine the category to which a new case belongs.



Data Mining can study what was done in the past, whether a decision was right or not and what the current decision should be. This is especially useful when you need to make decisions that have to take into account a large number of variables.



Data Mining can collect and systematically organize skills and knowledge of an individual (or a team.) This is useful to record all the knowledge that has been learned thoughout a career.

Machine learning

Data Mining can be used to design algorithms in which a computer can learn automatically. It's all about giving examples to the computer and letting the algorithm learn so that the computer develops an intelligent behavior.

Automating tedious tasks

In some cases, we need to make repetitive decisions which are well defined by a protocol but certainly tedious. Data Mining can automate those tasks that must take into account multiple variables simultaneously.

Revealing rules

Why do the members of a group look alike? Is there a rule or a set of rules that determine belonging to the group? Data mining can reveal what the members of a group have in common.


What is the origin of a problem, an illness or a breakdown of a machine? This question is often solved gathering evidences or symptoms to determine where the problem lies. Data Mining can be used to match the current situation with the previous recorded ones in order to make a diagnosis.


Classification without theory

The first time we see new data, we might not know how to deal with it. Data mining can classify the data into categories without any theoretical explanation behind it. So we can set the bases of future theories that will describe the data.


There are examples of data mining applications in several fields. Here are some of them.

  •  Science and technology 
    1. Iris flower data set. A classical example (Fisher, 1936) is the classification of three species of iris depending on the length and the width of thier petals and sepals.
    2. Image filtering. Analysis of images is a tedious task for humans. Data mining is effective in detecting oil slicks at the sea from satellite images or radar. The detection algorithms analyze the evolution of the shape, size to detect oil slicks.
    3. Electricity supply. Electric energy can't be stored, so it must be consumed as soon as it is generated. The electriciy supply industry needs to predict the demand for power every single day. Data Mining can estimate the electricity demand using the records of consumed energy in the past.
    4. Fault diagnosis. Data mining is also useful in determining the origin of the breakdown of industrial machines. Combining the experience of mechanical engineers with measurements, it is possible to obtain a protocol to diagnose faults.
    5. Diagnosis of diseases. Data Mining can be used to diagnose new patients, based on the symptoms of the previously diagnosed ones.
    6. Cataloging of celestial objects. In astronomy, we can use the known objects (stars, planets, etc..) to identify new objects, looking for lights in the sky that have not been registered, yet.
    7. Structure of molecules. The structure of new molecules can be inferred from NMR or X-rays images of already known molecules. This facilitates the discovery of new drugs.
    8. Risk factors. Data mining can be used to perform environmental or genetic studies to discover what might cause the development of a disease.
  •  Web 
    1. Spam. An email is classified as spam or not by analyzing different aspects of it: words that it contains, sender, time at which was sent, etc.
    2. Preferences. Based on the preferences of Internet users, the content of a webpage can be automatically adapted to other users.
  •  Finance, Marketing and Sales 
    1. Customer attrition. The purchasing frequency, volume and time evolution of customers can help us to determine whether we will lose them or they will continue purchasing. So we can personalize offers to them to avoid the loss of customers.
    2. Stock level. Data Mining can be used to analyze past sales over time, new trends, etc.. in order to make predictions of inventories that will be needed at any time of the year.
    3. Shopping Cart. Some products are simultaneously bought by customers. Data Mining can found those products and suggest ideas to offer them together.
    4. Solvency. Analyzing finantial and personal information about a loan applicant, Data Mining can be used to determiner whether the applicant will be able to repay their loans.