viernes, 21 de octubre de 2016

ETL - Pentaho PDI, parallelizing database input steps

We have plenty documentation in the web about parallelizing and clustering Pentaho PDI destination steps, but no much information about performing database input steps.

By example, if we want to improve performance of a Pentaho PDI ETL process in a MongoDB Input step by using parallelism, we could specify the "Number of Copies to Start" (right click on MongoDB step) to 10:



Only with this configuration we will have 10 steps consuming the same dataset, we need something to partition the dataset, so we could use the "mod" function over a field matching with the step instance copy at execution time, we´ll do this in the "Query" tab of the step:


{$where:"this.field%10 == ${Internal.Step.CopyNr}"}

So, at execution time, this javascript criteria will allow to retrieve subsets of the original dataset based on a mod partition over the parallelism specified and the step instance copy (parameter Internal.Step.CopyNr) .

Whe need a field with numbers with high cardinality to make this technique effective.

miércoles, 27 de julio de 2016

Script - Kill inactive non dedicated connections

This killing procedure will allow us to close connections from inactive and non dedicated connections, (if we have to kill active or dedicated connections we already have database/so tools to do this (db.killOp, kill)):

     1) Locate the process :
  
netstat -lnap|grep 127

tcp        0      0 127.0.0.1:27017  127.0.0.1:55798 ESTABLISHED 2516/mongod


    2) Locate the the socket's FD (file descriptor)

lsof -np 2516|grep 127

COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
mongod  2516 mongod   98u  IPv4 458536      0t0    TCP 127.0.0.1:27017->127.0.0.1:55798 (ESTABLISHED)
  
    3) Connect to the master process, close the socket and exit:
  
gdb -p 2516
call close(98u)
quit

martes, 22 de marzo de 2016

Script - Comprobar parámetros de otra sesión

Podemos comprobar los valores de los parámetros de cualquier sesión de nuestra base de datos Oracle haciendo uso de la vista v$ses_optimizer_env, en el ejemplo podemos ver que para la sesión 183 se ha establecido un valor de 500Mb para el parámetro hash_area_size:

select * from v$ses_optimizer_env where sid=183

SID  ID NAME                           ISDEFAULT  VALUE
--- --- ------------------------------ ---------- --------------
183   2 parallel_execution_enabled     YES        true
183   9 optimizer_features_enable      YES        10.2.0.5
183  11 cpu_count                      YES        24
183  12 active_instance_count          YES        1
183  13 parallel_threads_per_cpu       YES        2
183  14 hash_area_size                 NO         500000000
183  15 bitmap_merge_area_size         YES        1048576
183  16 sort_area_size                 YES        65536
183  17 sort_area_retained_size        YES        0
183  24 pga_aggregate_target           YES        3145728 KB
183  25 _pga_max_size                  NO         629140 KB
183  35 parallel_query_mode            YES        enabled
183  36 parallel_dml_mode              YES        disabled
183  37 parallel_ddl_mode              YES        enabled
183  38 optimizer_mode                 YES        all_rows
183  48 cursor_sharing                 NO         force
183  50 star_transformation_enabled    YES        false
183  66 optimizer_index_cost_adj       YES        100
183  67 optimizer_index_caching        YES        0
183  70 query_rewrite_enabled          YES        true
183  71 query_rewrite_integrity        YES        enforced
183 101 workarea_size_policy           YES        auto
183 105 optimizer_dynamic_sampling     YES        2
183 112 statistics_level               NO         basic
183 114 skip_unusable_indexes          YES        true
183 165 optimizer_secure_view_merging  YES        true