30 March 2015

Ignoring duplicate inserts with Postgres when processing a batch

I'm busy on a project which involves importing fairly large datasets of about ~3.3GB at a time.  I have to read a CSV file, process each line, and generate a number of database records from the results of that process.

Users are expected to be able to rerun batches and there is overlap between different datasets.  For example: the dataset of "last year" overlaps with the dataset of "all time".  This means that we need an elegant way to handle duplicate updates.

Searching if a record exists (by PK) is fine until the row count in the table gets significant.  At just over 2 million records it was taking my development machine 30 seconds to process 10,000 records.  This number steadily increased as the row count increased.

I had to find a better way to do this and happened across the option of using a database rule to ignore duplicates.  While using the rule there is a marked improvement in the performance as I no longer need to search the database for a record.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.