Any business decision is made on the basis of data that must be collected, analyzed and included in business analysis management. Often, the successful functioning of any company depends on how relevant and correct the data is, how correctly they are systematized. For this reason, the ETL concept is becoming more and more popular in business.
In light of recent trends, there is a rapid movement of business towards blockchain technology. This means an increase in the number of different infrastructure solutions and, as a result, the effective configuration and use of ETL processes. Thanks to the development of cloud technologies, any company, any user can not waste time and resources on creating ETL streams, but use the etl service.
Problems of initial information
Notoriously, that always, at the input of the process of collecting and analyzing information, there is one, and most often several sources of data that need to be extracted and processed. In other words, there are some repositories that contain the “raw” (unimproved) data we need. Such raw data, if applied in its original form, will almost always create problems for the analysts who work with it. First of all, a lot of various errors will appear, which are inevitable due to all kinds of bugs, as well as due to a result of introduction or transfer of information. In addition, problems can be exacerbated by the fact that data sources often differ among themselves, just as the level of detail of data differs. Experts outlined two frequently occurring tasks that must be solved for successful information processing:
– the task associated with the plurality of data,
– the task associated with different data storage locations.
Below are two examples to illustrate these tasks.
Imagine that a certain ABC company has the data you need, which are stored in a certain class of storage with established characteristics regarding the speed of writing and reading data. As an analyst, you need to collect and analyze this data. The data is constantly updated in real time, its number is growing, and the characteristics of the storage are not changed. And there comes a point when computing power can no longer cope with the load caused by your requests. And as a result, either the time to wait for a response to a request increases many times, or the requests are just rejected by the storage. That is, you have received the task associated with the plurality of data.
Now let’s describe another example. You are an analyst for a company that conducts online trainings. Most students register for such trainings using their previously created accounts. As you know, in a personal account, among other things, there is information about the age of the user. And very often information in the account is formed much earlier than the database of the results of the training. As an analyst, you are faced with the task of generating a report on the influence of the student’s age on the result of his training. This means that you need to collect data from different sources, which, moreover, are separated by a time period. There is the task associated with different data storage locations.
The ETL process allows to resolve both the first and the second task. Let’s take a closer look at its architecture. So, at the very beginning, there are some data stores available or the first layer: OLTP – OnLine Transaction Processing. This is a database that stores the original information. Further, as a rule, ODS or Operation Data Storage is created. In essence, it is a direct replica of all the original OLTP data. This is done in order not to “drop” the transactional system at the time of sending requests. In other words, the request goes not to OLTP, but to the replica, while both ODS and OLTP work independently of each other. The next layer of the architecture is the DDS or Detailed Data Storage base. In this layer all data is stored in a form that is convenient for analysis. There are several paradigms of what format is best to add data to DDS, that is, different models for building data warehouses are used. After placing the data in accordance with a certain DDS model, a layer consisting of data marts is formed. Showcases have a different look, based on the goals that are needed for analysis. For example, data can be thrown into some files, or you can create a data market for a specific order or for a specific report. It should be noted that, in general, if we take any more or less developed data warehouse, then it will just consist of approximately the same layers that we talked about above. Accordingly, between each of these layers there are some tools or scripts that are engaged in either uploading and transferring data, or processing them.