I'm busy on a project which involves importing fairly large datasets of about ~3.3GB at a time. I have to read a CSV file, process each line, and generate a number of database records from the results of that process.
Users are expected to be able to rerun batches and there is overlap between different datasets. For example: the dataset of "last year" overlaps with the dataset of "all time". This means that we need an elegant way to handle duplicate updates.
Searching if a record exists (by PK) is fine until the row count in the table gets significant. At just over 2 million records it was taking my development machine 30 seconds to process 10,000 records. This number steadily increased as the row count increased.
I had to find a better way to do this and happened across the option of using a database rule to ignore duplicates. While using the rule there is a marked improvement in the performance as I no longer need to search the database for a record.
Subscribe to:
Post Comments (Atom)
-
Azure Active Directory is a great product and is invaluable in the enterprise space. In this article we'll be setting it up to provide ...
-
There are so many different problems that people have with the Doctrine error message: exception 'Doctrine\ORM\ORMInvalidArgument...
-
While debugging and setting up Puppet I am still running the agent and master from CLI in --no-daemonize mode. I kept getting an error on...

No comments:
Post a comment
Note: only a member of this blog may post a comment.