This is a brief guide about how to implement dependent replication jobs.
One major application is to improve the running time and system resource consumption for queries and views that are either often required or involve data sources with a huge amount of data (several millions of rows).
This is an example based on the Adventure Works sample database. In this scenario, there are two data sources hosted on a MySQL server and we want to improve the performance of a simple join between two tables from these sources.
The guide is a step by step walkthrough and by no means complete. In the environment used for the explanation, there is a fresh setup of Data Virtuality Server and Studio with just the DWH connection and two data sources. If you are reading this and have an environment already in use with several data sources, materialized tables and more, you will find some steps that you might skip (tables already materialized or replication jobs already existing and they just need some tweaking).
FROM "mysql_aw_two.product" Prod INNER JOIN "mysql_aw_one.purchaseorderdetail" Ord ON
Prod.ProductID = Ord.ProductId
The product table and the purchaseorderdetail table reside on different data sources.
Keep in mind that a join on two tables from the same data source could be done on the source server. Materializing one table but not the other one would almost certainly reduce the query performance. If the source server can be burdened, then only materialize the finished join. Otherwise, materialize both tables and let the DWH perform the join.
We have executed the select statement and our goal is to materialize the whole join with dependent replication jobs.
1. Find the query in the query tab and see the actual query plan. This is also useful if you want to optimize a query that has already been executed several times and
2. you need to find the tables involved.
This guide's pictures are based on a small environment. You may have to use the filter in order to get to the desired query more quickly. You can then find out which tables are involved in two ways:
Either you select the properties tab at the bottom
or you double click the table scan node from the graph and search through the query part
After all the tables are identified, the simples (not always best!) approach is to order the tables bottom-up. Tables on the same level of the execution graph are equal and you can order them whichever way you like. This guide sorts the query parts as follows:
- INNER JOIN (mysql_aw_one.purchaseorderdetail, mysql_aw_two.product)
3. If the tables are not part of any optimization, you have to enable the materialization and create a replication job. You need to do this step and the following repeatedly for the tables involved. Locate the table in the data source and enable the materialization. This guide will start with the table mysql_aw_one.purchaseorderdetail.
Now the table can be found in the optimizations tab. Alternatively, you can right-click on the table and select 'find optimizations', which will immediately open the optimizations tab and filter for everything that contains this table.
4. Now you create a replication job with any desired configuration. For the sake of the guide, we will decide to have the replication without any further modification from the defaults. The replication job creation is achieved by selecting the entry from the optimizations tab which contains only the table itself (note the table symbol in the materialization type column). In the context menu, you will find the option to 'Create replication job', which you have to select.
5. The next step simply consists of finding this replication job and add a schedule to it. Since the Data Virtuality Studio makes it easy to use, the following three pictures should be self-explanatory:
The selected schedule is just an example of what you can choose. But it is advisable to think about the scheduled jobs that are stored in the system. You should try to minimize the number of parallel jobs. This is also true for jobs that are run sequentially but the preceding job takes too much time and will still be running once the next one has started. This is another reason why dependent jobs are such a useful means of optimizing the system's resource consumption.
6. Now you repeat the steps 3-4 for the next table (or expression, depending on the optimization you want to perform yourself). The main interesting part is the schedule for the replication job. In this scenario, the replication job for the table materialization of mysql_aw_two.product. When creating the schedule, you select the tab labeled 'Dependent schedule'. You should filter for the job you just created (or that already exists) and then chose a specific schedule from the job. That's right, one job may have several schedules. You select the job from the first table that was set to start at 15:00.
This concludes the schedule creation. This job will be automatically triggered after the selected job and schedule run with the status finished. If the other job fails then this job will stay in sleep mode. You may, alternatively, set this schedule to run only, if the selected schedule fails. This is the major part of the whole guide. You can, of course, enable materialization of the tables first and you can set up the replication jobs without schedules parallel. But this, nevertheless is where you have to be very careful to mix up the schedules with the execution plan you have in mind. We encourage you to go through everything provided in this guide in order to have a clear order of this task.
You can now start again at step 5. Since the JOIN itself is found in the optimizations tab after the query was executed once. If you want to optimize your own queries and views, you might even have everything you need in the optimizations tab or you might even have already replication jobs that you can use. After you created the three replication jobs with their according schedules, the job tab might look something like this: