viernes, 21 de octubre de 2016

ETL - Pentaho PDI, parallelizing database input steps

We have plenty documentation in the web about parallelizing and clustering Pentaho PDI destination steps, but no much information about performing database input steps.

By example, if we want to improve performance of a Pentaho PDI ETL process in a MongoDB Input step by using parallelism, we could specify the "Number of Copies to Start" (right click on MongoDB step) to 10:



Only with this configuration we will have 10 steps consuming the same dataset, we need something to partition the dataset, so we could use the "mod" function over a field matching with the step instance copy at execution time, we´ll do this in the "Query" tab of the step:


{$where:"this.field%10 == ${Internal.Step.CopyNr}"}

So, at execution time, this javascript criteria will allow to retrieve subsets of the original dataset based on a mod partition over the parallelism specified and the step instance copy (parameter Internal.Step.CopyNr) .

Whe need a field with numbers with high cardinality to make this technique effective.