Definitions

Data Mining is a methodical approach to identifying patterns in data. In the past, a good business analyst would look through data for trends, but with modern databases it is hard to work with data manually. Data mining allows you to instruct the computer to comb through that data and identify patterns that are of interest. Data mining tools, such as data manipulation, auditing, and visualization of the data, hypothesis testing, offer a number of data discovery techniques to provide expertise to the data and to help identify a relevant set of attributes in the data.

Extract Transform and Load (ETL) tool - is a useful tool for implementing workflow processes wherein data is moved and undergoes changes through that process such as consolidation to a denormalized design or data cleansing.

A data warehouse is a system that actually performs some ETL operations: extract, clean, conform and deliver source data into a dimensional data store and then support and implement querying and analysis for the purpose of decision making.

ETL in data mining consists of the construction of new data subsets derived from existing data sources.

ETL stands for the whole process of taking data from various sources and combining it, transforming it, and loading big data using database tools.

  • Extract is to get data out of different data sources.
  • Transform means to change the data format in order to better support querying and analysis.
  • Load is to get this data into a target storage.

We can safely assume that the indirect process element transporting gets important.

Always plan ETL phase properly!

There is abundant evidence that this is, in particular, relevant for extracting and then transporting big data to the location of the new database. Geographically dispersed organizations face challenges in the transportation of large quantities of data. The indirect process element transport can be relevant between each of the other ETL process elements.

etl process

  • The extraction part of the process is very important as it has a great influence on all the other processes. It is also called reading as in many cases the data is read from one database in order to store the data into another one.
  • The transformation part of the process is considered to be difficult because data is converted into a new format. In many cases, additional data is combined with the original data. This means that the head of the project should go over the design of the format as it must be proper to support business operations with the data they need.
  • On the other hand, the loading into a database part of the process is easier. But it needs to be ensured for operational effectiveness, that the data is stored using a proper database management tool. After the ETL process is finished, the big data stored inside, a database is used daily for data analysis.

It is vital to note that one of the Microsoft products - SSIS (SQL Server Integration Services) - is useful for ETL operations. By the way, ETLs are usually written by any programming language (we had them in Python). These three operations are considered to be the front end of lots of DW (data warehousing) and BI (business intelligence) solutions. 70-80% of BI (or DW) project is a reliable ETL process.

As data mining usually implies using the data from the integrated sources to infer information that would not be obvious from transactional data (via the integration of multiple sources giving more "dimensions" to the data, it is usually focused on using some large quantity of data to either predict future answers or better understand patterns in existing data. On the other hand, heads of small projects use SSIS as a convenient way to load legacy data or data from other repositories or files.

To summarize, it's definitely a great area to take up, but not something you can catch up without some intensive study of math and algorithms. ETL in data mining is an approach to discovering data behavior in large data sets by exploring the data, fitting different models and investigating different relationships in vast repositories. The information extracted with a data mining tool can be used in a lot of different areas.

Write US