lunes, 4 de septiembre de 2017

Páginas a la Carta - Lectópolis

El 8 de sept a partir de las 14:00 se abren las votaciones de la Plataforma Talento de ElPais.com, si te gusta la idea, no dudes en votarla en el siguiente enlace: https://t.co/1WTvLi7Kh6 (debes registrarte previamente en elpais.com).


gracias!!!

jueves, 22 de junio de 2017

Script - PostgreSQL multiple async parallel execution in PL/pgSQL

If you need to split and parallelize a pl/pgsql procedure execution to get better performance (due to the fact that a single pl/pgsql executes only in 1 thread cpu, it´s constraint to 1 single thread cpu), you can try the use of dblink async calls, which will save you a lot of time, without need to do more complicated solutions.

By example, we need to do a  massive insert into a table with a insert as select query, the server has a lot of IO capacity that isn´t used by a single pl/pgsql execution. Having a method to do a partition (by index date ranges or similar), you could do this:

-- Create the dblink extension
CREATE EXTENSION dblink;
GRANT EXECUTE ON FUNCTION dblink_connect_u(text) TO <user>;
GRANT EXECUTE ON FUNCTION dblink_connect_u(text, text) TO <user>;

select dblink_connect_u('dblink1','dbname=<database> user=<user>');
select dblink_connect_u('dblink2','dbname=<database> user=<user>');
select dblink_connect_u('dblink3','dbname=<database> user=<user>');



-- These are the aync calls, they will return immediatly and will begin to execute in backend
select * from dblink_send_query('dblink1','INSERT INTO <dest_table> SELECT * FROM <orig_table> WHERE AUD_FEC BETWEEN <FEC_INI1> AND <FEC_FIN1>');
select * from dblink_send_query('dblink2','INSERT INTO <dest_table> SELECT * FROM <orig_table> WHERE AUD_FEC BETWEEN <FEC_INI2> AND <FEC_FIN2>');
select * from dblink_send_query('dblink3','INSERT INTO <dest_table> SELECT * FROM <orig_table> WHERE AUD_FEC BETWEEN <FEC_INI3> AND <FEC_FIN3>');

-- Here we wait for each finished results
select * from dblink_get_result('dblink1') as t1(a text);
select * from dblink_get_result('dblink2') as t1(a text);
select * from dblink_get_result('dblink3') as t1(a text);

SELECT dblink_disconnect('dblink1');
SELECT dblink_disconnect('dblink2');
SELECT dblink_disconnect('dblink3');


If we have a date interval (fec_ini, fec_fin) and orig_table is indexed by aud_fec, we can use this to do an automatic date partition to parallelize up to maxParall variable inside a PL/pgSQL procedure:

SELECT cast((fec_fin-fec_ini)/maxParall as text) INTO vInterval;

FOR numParall IN 1..maxParall LOOP
    select dblink_connect_u('dblink'||numParall,'dbname=database user=user') into vText;
    select * from dblink_send_query('dblink'||numParall,'INSERT INTO dest_table SELECT * FROM orig_table WHERE AUD_FEC >= '||fec_ini||'+'||numParall-1||'*CAST('||chr(39)||vInterval||chr(39)||' AS interval) AND AUD_FEC < '||fec_ini||'+'||numParall||'*CAST('||chr(39)||vInterval||chr(39)||' AS interval)') into vInt;
END LOOP;       
FOR numParall IN 1..maxParall LOOP
    select * from dblink_get_result('dblink'||numParall) as t1(a text) into vText;
    SELECT dblink_disconnect('dblink'||numParall) into vText;
END LOOP;       

Hope this can help you.

miércoles, 22 de febrero de 2017

Script - MongoDB mongorestore performance


The following script is necessary if we want a great level of parallelism in the full restoration of a mongodb database.

For mongodb databases (without sharding) we can use the following technique to improve the performance of full type restorations.

It is based on a vertical scaling for mongodb databases without sharding (I know that mongo was not designed for vertical scaling, but sometimes you need this ...), so, if we deploy a server (aws or similar) with a lot of cpu threads, memory and iops capacity, we can get very low recovery times.

In this case we will avoid that mongorestore rebuild the indexes, so that it is the system itself that updates them as records are recovered, having them created before mongorestore.


  • Index definition extraction (from same database or another with similar dbname, collections and indexes)

echo "db.getCollectionNames().forEach(function(c) { if (c.indexOf('system.') == -1)
{ind=db.getCollection(c).getIndexes();
print('db.runCommand({ createIndexes:\"'+c+'\",indexes:'+JSON.stringify(ind)+'});');
}
});"|mongo <server>:<port>/<database> --username=<username> --password=<password> --authenticationDatabase=admin >> indexes.txt


  • Collections drop

echo "db.getCollectionNames().forEach(function(c) { if (c.indexOf('system.') == -1) db[c].drop();});"|mongo <server>:<port>/<database> --username=<username> --password=<password> --authenticationDatabase=admin

  • Index creation (without collections, empty database)

cat indexes.txt|mongo <server>:<port>/<database> --username=<username> --password=<password> --authenticationDatabase=admin

  • MongoRestore command without indexes and with high degree of parallelism

mongorestore -u <username> -p <password> --authenticationDatabase=admin <backup_dir>/<database> -d <database> --numParallelCollections 5 --numInsertionWorkersPerCollection 20 --gzip --noIndexRestore

viernes, 21 de octubre de 2016

ETL - Pentaho PDI, parallelizing database input steps

We have plenty documentation in the web about parallelizing and clustering Pentaho PDI destination steps, but no much information about performing database input steps.

By example, if we want to improve performance of a Pentaho PDI ETL process in a MongoDB Input step by using parallelism, we could specify the "Number of Copies to Start" (right click on MongoDB step) to 10:



Only with this configuration we will have 10 steps consuming the same dataset, we need something to partition the dataset, so we could use the "mod" function over a field matching with the step instance copy at execution time, we´ll do this in the "Query" tab of the step:


{$where:"this.field%10 == ${Internal.Step.CopyNr}"}

So, at execution time, this javascript criteria will allow to retrieve subsets of the original dataset based on a mod partition over the parallelism specified and the step instance copy (parameter Internal.Step.CopyNr) .

Whe need a field with numbers with high cardinality to make this technique effective.

miércoles, 27 de julio de 2016

Script - Kill inactive non dedicated connections

This killing procedure will allow us to close connections from inactive and non dedicated connections, (if we have to kill active or dedicated connections we already have database/so tools to do this (db.killOp, kill)):

     1) Locate the process :
  
netstat -lnap|grep 127

tcp        0      0 127.0.0.1:27017  127.0.0.1:55798 ESTABLISHED 2516/mongod


    2) Locate the the socket's FD (file descriptor)

lsof -np 2516|grep 127

COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
mongod  2516 mongod   98u  IPv4 458536      0t0    TCP 127.0.0.1:27017->127.0.0.1:55798 (ESTABLISHED)
  
    3) Connect to the master process, close the socket and exit:
  
gdb -p 2516
call close(98u)
quit

martes, 22 de marzo de 2016

Script - Comprobar parámetros de otra sesión

Podemos comprobar los valores de los parámetros de cualquier sesión de nuestra base de datos Oracle haciendo uso de la vista v$ses_optimizer_env, en el ejemplo podemos ver que para la sesión 183 se ha establecido un valor de 500Mb para el parámetro hash_area_size:

select * from v$ses_optimizer_env where sid=183

SID  ID NAME                           ISDEFAULT  VALUE
--- --- ------------------------------ ---------- --------------
183   2 parallel_execution_enabled     YES        true
183   9 optimizer_features_enable      YES        10.2.0.5
183  11 cpu_count                      YES        24
183  12 active_instance_count          YES        1
183  13 parallel_threads_per_cpu       YES        2
183  14 hash_area_size                 NO         500000000
183  15 bitmap_merge_area_size         YES        1048576
183  16 sort_area_size                 YES        65536
183  17 sort_area_retained_size        YES        0
183  24 pga_aggregate_target           YES        3145728 KB
183  25 _pga_max_size                  NO         629140 KB
183  35 parallel_query_mode            YES        enabled
183  36 parallel_dml_mode              YES        disabled
183  37 parallel_ddl_mode              YES        enabled
183  38 optimizer_mode                 YES        all_rows
183  48 cursor_sharing                 NO         force
183  50 star_transformation_enabled    YES        false
183  66 optimizer_index_cost_adj       YES        100
183  67 optimizer_index_caching        YES        0
183  70 query_rewrite_enabled          YES        true
183  71 query_rewrite_integrity        YES        enforced
183 101 workarea_size_policy           YES        auto
183 105 optimizer_dynamic_sampling     YES        2
183 112 statistics_level               NO         basic
183 114 skip_unusable_indexes          YES        true
183 165 optimizer_secure_view_merging  YES        true