30 March 2015

Ignoring duplicate inserts with Postgres when processing a batch

I'm busy on a project which involves importing fairly large datasets of about ~3.3GB at a time.  I have to read a CSV file, process each line, and generate a number of database records from the results of that process.

Users are expected to be able to rerun batches and there is overlap between different datasets.  For example: the dataset of "last year" overlaps with the dataset of "all time".  This means that we need an elegant way to handle duplicate updates.

Searching if a record exists (by PK) is fine until the row count in the table gets significant.  At just over 2 million records it was taking my development machine 30 seconds to process 10,000 records.  This number steadily increased as the row count increased.

I had to find a better way to do this and happened across the option of using a database rule to ignore duplicates.  While using the rule there is a marked improvement in the performance as I no longer need to search the database for a record.

17 March 2015

Adding info to Laravel logs

I am coding a queue worker that is handling some pretty large (2gig+) datasets and so wanted some details in my logs that Vanilla laravel didn't offer.

Reading the documentation at http://laravel.com/docs/4.2/errors wasn't much help until I twigged that I could manipulate the log object returned by Log::getMonolog();.

Here is an example of adding memory usage to Laravel logs.

In app/start/global.php make the following changes
 $log = Log::getMonolog();  
 $log->pushProcessor(new Monolog\Processor\MemoryUsageProcessor);  

You'll find the Monolog documentation on the repo

12 March 2015

Support for Postgres broken in HHVM 3.6.0

On my desktop machine I run my package upgrades every day.  The other day my Hiphop version got updated to 3.6.0 and suddenly my Postgres support died.

Running Hiphop gave a symbol not found error in the postgres.so file ( undefined symbol: _ZTIN4HPHP11PDOResourceE\n ) exactly like the issue reported on the driver repository (here).

I tried to recompile the postgres driver against Hiphop 3.6.0 but hit a number of problems, mostly to do with hhvm-pgsql-master/pdo_pgsql_statement.cpp it seems.

The fix for the incompatibility was unfortunately rolling back to my previous version of Hiphop.  To do this on Mint/Ubuntu just do this:

  1. Run cat /etc/*-release to get your release information
  2. Download the appropriate package for your distro from http://dl.hhvm.com/ubuntu/pool/main/h/hhvm/
  3. Remove your 3.6.0 installation of hhvm: sudo apt-get remove hhvm
  4. Install the package you downloaded : sudo dpkg -i <deb package>
After that everything should be installed properly and you can start up hhvm without a problem.