Bulk RDBMS Upserts with Spring

Posted on Aug 28, 2020

image

Upsert is a fairly common terminology in databases, meaning Update if the record exists or Insert the new record. Upserts make more sense in case of simple object save requests with new information.

Why Bulk?

If we talk about any data sync, data migration or bulk data update jobs, we are bound to have a bulk upsert scenario to update whatever we have in the database and insert all the new rows.

Solutions

We will be discussing about the solutions present in the Spring Boot environment and inferences we make out of them.

For the testing, I will be using MS SQL Server as the database and will be limited to it’s functionality but the concepts are fairly generalisable.

#1 Standard saveAll() Solution using Spring JPA

Consider we have a price table with a composite unique key/constraints having like with structure:

image

Then, our entity’s composite primary key and entity classes would look like:

image

Assume we are consuming a stream of record batches to upsert. To mock this, we have a supplier to generate a random Price1 object batches of size 1000. The ingestion code:

image

Let’s see the results:

image

Now, on looking closely the time taken to persist each batch increases as the table get filled up! What is this?

The culprit is the the method .save() of SimpleJPARepository class:

image

Since, the save performs both the insert and update operation, it has to check whether an entity is “new” or not. For which, it has to either check in the persistence context or query the database, which will get complex in time as the table get filled up.

#2 Optimisations for Bulk save

Our bottleneck in the previous approach was continuous reads from the DB for a primary key combination to check whether to perform .persist() / insert or .merge() / update.

To avoid additional querying on the table for .isNew() we could have another auto generated unique field (row) which is independent of the business logic. So that, every new object will have a unique id and will always do .persist() for them.

Let’s do the modifications on a completely new table (keeping the business columns as they are), with an additional auto-incremented “identity” column :

image

Here is our entity:

image

Testing out similar ingestion code will yield the result like below:

image

Hooray, Constant time and faster inserts! But wait,

image

There are now duplicate rows with upc, store_id combinations

image

Which is understandable, as we are doing no updates.

#3 Plain inserts then merge

The above 2 experiments have encouraged us to keep the insert model for faster inserts and somehow merge (do updates) later.

To achieve this, we can have a “stage” table having a unique auto-generated id for inserts, separate from our target main table. And a pos-ingestion job for merging the records after de-duplicating.

image

Plain inserts can go as the done above, we will now try to write a “merge” step. This step can very well be database stored procedure due to following reasons:

  1. To avoid movement of data and process the bulk data where it resides.
  2. Database specific optimisations are built-in.

Following is an example naive implementation of a merge stored procedure for MS SQL, similar merge query features are present most of the mainstream databases.

image

Executing this took less than a second!

image

Important Notes: The above stored procedure implementation is a very naive approach just to demonstrate and misses on a lot of the aspects to be taken care of in a production environment, like:

  • Over utilising transaction log size due to huge data in the merge statement, **batch-wise merging with id range**could be implemented here.
  • Monitoring and logging procedure failures, a TRY-CATCH based procedure with logging failures in a procedure-audit logging table could be used.
  • A successfully merged batch-range delete instead of truncating the stage table.

#Extra: Further Improving batch inserts

If we try to log the hibernate generated SQL statements for our .saveAll() operation, we will get something like this:

image

Here, we are firing single insert statements with values to insert, each going over the network.

There are some improvements that can be done here:

  1. Batch the queries and fire in call to the database over network.
  2. Rewrite the single queries into a form of single multi-row query.

For first, we can make use of hibernate properties:

spring.jpa.properties.hibernate.jdbc.batch_size=1000
spring.jpa.properties.hibernate.order_inserts=true

For second, there are bulk (multi-row) insert query options available in most of the mainstream database solutions (Postgres, MySQL, Oracle). With syntax like:

insert into myschema.my_table (col1, col2, col3)
values
(val11, val12, val13),
(val21, val22, val23),
....
(valn1, valn2, valn3);

While, Postgres and MySQL do support this features with the help of JDBC flag: reWriteBatchedInserts=true

But unfortunately, according to this resource, ms-sql JDBC driver does not support the multi-row rewrite of the queries. So, if we want to do this, we would have to write the insert queries manually.

Manual insert query creation could look like:

image

The ingestion code would use entityManager.createNativeQuery() method:

image

Let’s test it out:

image

Wow!! Those batches of 1000 records took less than a second, let’s look at the milli seconds now:

image

Conclusion

We can safely infer from our above experiments that:

  1. Achieving bulk Upserts are really complicated in relational databases.
  2. Spring JPA and Hibernate provided out-of-the-box save methods won’t scale for huge loads.
  3. A decoupled (from business columns/fields) independent primary key will help improve the insert performance of the inserts, but will eventually insert duplicate records.
  4. Separating insert-only and read (actual) table can be used to improve ingestion performance.
  5. A post insert, merge strategy can be used for de-duplicating inserted records. Stored procedures work well in this scenario.
  6. Native SQL queries will give the most performant result, use solutions which are as close to the database as possible where performance is critical.

Testing Environment

  • Macbook Pro 2016 model 15" with 16 GB RAM.
  • Application uses Java 1.8 with Spring Boot 2.3.3.
  • MS SQL server 2017 database in a Docker container.

Thanks for Reading :)