Few months ago, one of our clients switched from Citus DB to using Snowflake DB for most of their workflows.
If you are not familiar with it, Snowflake DB has gained quite some popularity in recent years, at least in data warehousing /data lake business. They describe their product as:
Snowflake’s Data Cloud is powered by an advanced data platform provided as Software-as-a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.
The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud.
Given that our client is in data analytics business, most of their data workflows don't require real-time modifications, but can work with periodic mass inserts/merges. Other analytic and business logic workflows then work with that fact, and usually process that data once available, in bulk.
Snowflake also decouples data processing from data storage, so one can easily add more processing capacity without disturbing rest of the system/already running queries. Apart from that, Snowflake offers some nice features when it comes to data sharing and data governance, so decision was made to do the switch.
But where does Rails fit all of this? Well, main application where users and admins would log in to see reports/dashboards and do administrative stuff is a Rails monolith, that now needed to learn to speak Snowflake. That app is also responsible for scheduling and executing lots of business oriented workflows and logic, such as data ETLs, bulk number crunching, report generation, and a lot more.
The switch
Note: All timings that we talk about in this blog post were observed while using Snowflake in a Rails 5.2 application, using ruby-odbc gem under the hood. Your timings might differ a bit, but the gist of it is there.
To say that it is different than Postgres (on which CitusDB builds upon) would be an understatement 😅 For starters, its a SaaS product that you can't run locally, as you would Postgres (or CitusDB, with their docker image). If you want to have development environment, every dev needs to have its dev account set up, with their test + development database. Same goes for CI environment.
Also, every query comes with a predefined cost you kinda can't escape from - want to insert just one row to DB? Yes you can, but it will take anywhere from 0.3-2s to do it. Compare that to Postgres where it takes milliseconds (to be fair, Snowflake was not designed for these workflows, but still, its' a paradigm shift when it comes to application that uses those databases).
Want to speed up test DB so queries are faster (but if DB crashes data could be lost), like using fsync=off, wal_level=minimal etc? - nope, there is no way not to pay that 0.3-2s penalty, even if you use temporary/transient tables, as explained here
Used to doing EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) and plugging that info into tool like pev2? Well, snowflake does not have anything like ANALYZE or BUFFERS (it does show some data, like partitionsTotal/partitionsAssigned), and format of returned data is totally different (last I checked). There is a query profiler build into their web UI that even looks a bit like pev2, but for complex queries it can get tedious to scroll over UI to see entire flow. And apart from that, sometimes query gets split into more than one stage, each stage having its own UI.
Want to know the id of item you just inserted (even if you don't, Rails will)? Tough luck, Snowflake does not support RETURNING SQL keyword, so you are left doing something like this, which can give you wrong results, so don't use it!
And the list goes on.
In order for Rails app to talk to Snowflake DB, it has to use its ODBC driver, as they offer native drivers/connectors for Python, Go, Java, .NET and Node.js, but not for Ruby. There is an option for using SQL over their REST API, but at the time of writing this (August 2022) its still kinda new, so we didn't want to lean on it so much.
So, ODBC driver it is.
If you google around a bit using "Rails" and "Snowflake" keywords, chances are you will bump into this blog post from 2017. Following their example, we started out making SnowflakeAdapter work in our app. but the road was bumpy, as Rails required lots of additional changes from the ones described, and apart from that some Rails methods changed their signature starting from version 5+, so there is that as well. That part deserves a separate blog post in itself.
Luckily, Rails monolith was not doing so many one-by-one data manipulation workflows, at least to data that ended up living in Snowflake. It was kinda already using COPY command for those workflows, and other types of workflows were using SQL (either in raw form, or coming from ActiveRecord).
But there was one part of app that was changing data in-place, one-by-one: tests.
Running cucumber-based scenarios on Snowflake
Rails monolith we are talking about here has lots of feature specs, written as gherkin scenarios in following form:
Given that the following people exist in DB:
| id | country_id | is_inactive | entered_date |
| PERSON_A | 1 | false | 2016-03-01 |
| PERSON_B | 1 | false | 2016-03-01 |
And that the following products exist in DB:
| id | country_id | creation_date | ...
| PATIENT_A | 1 | 2016-02-01 | ...
| PATIENT_B | 1 | 2016-02-01 | ...
They are detailed as they are because we have more than a thousand different scenarios for lots of different features, each scenario trying to cover some edge cases - where only one column changed can result in a different number. They are also integration tests, as steps that are backing them up are using real DBs and only external systems outside of our control (Cloud Services, various API endpoints) are being mocked. That way we can be sure that if we have certain state in DB + have certain responses from external services, we would actually get those results.
So basically every scenario had to first insert from 5 to over 100 different records, execute business logic and then check the results.
Once we started running them, we noticed that they are taking a veeerrrry loooong time to finish. We added some timing logic around test data persistence, and noticed the costs associated with doing OLTP workflows on OLAP database:
Creating Snowflake::TransactionTag took 827ms
Creating Snowflake::TransactionTag took 942ms
Creating Snowflake::TransactionTag took 1022ms
... (lot more lines) ...
Creating Snowflake::Payment took 1282ms
Creating Snowflake::Payment took 1130ms
Creating Snowflake::Payment took 951ms
... (lot more lines) ...
Creating Account took 12ms
Creating Account took 13ms
Creating Account took 6ms
... (lot more lines) ...
8 scenarios (8 passed)
100 steps (100 passed)
17m22.919s
You can kinda guess that Account is still living in Postgres, while Payment and TransactionTag are living in Snowflake, as it takes them ~50-100x more time to get inserted. Sometimes it even takes more than 6 seconds for one insert, when Snowflake sneezes I guess. Just to reiterate, Snowflake can't be run locally, so this test is connecting to remote Snowflake test account, but not that we have a choice - locally or on CI environment, test will have to connect to remote instance. These are just timings for raw inserts, DB cleaning strategy is same for both entities, and it proved not to take too much time compared to inserts.
So we are left with a choice:
1) run tests in Postgres, hoping that once things go live in production Snowflake will behave the same
2) run tests locally in Postgres but connect to Snowflake on CI
3) run both local and CI tests on Snowflake
Option 2 looked like best of both worlds: we can have quick and in 95%+ of cases correct results from testing locally, while we also have safety net on CI environment, where we can tolerate longer test cases (that are at least done in parallel) But even with option 2) we still need tests to finish on CI, and with no optimisations in place they would take hours.
Speeding up gherkins, part I
One thing we noticed is that we already have generic step for creating records in DB, something in lines of:
def add_to_db(entity_class, table)
table.hashes.each do |attrs|
attributes = create_attributes_from_hash(entity_class, attrs)
final_attributes = fabricator_attributes(entity_class, attributes).merge(timestamp_values(entity_class))
start = Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond)
entity_class.new(final_attributes).save!
puts "Creating #{entity_class} took #{Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond) - start}ms"
end
end
If there is a predefined cost to every snowflake insert statement, lets try to insert entire table of records in one go, something in lines of:
def bulk_add_to_db(entity_class, table)
attributes_array = create_attributes_from(table.hashes)
# here we create string n-tuple with all attributes, with NULL instead of nil
all_column_values = attributes_array.map do |single_entity_attributes|
quoted_values = single_entity_attributes.map { |column_name, column_value| cast_value(entity_class, column_name, column_value) }
"(#{quoted_values.join(',')})"
end.join(',')
# attribute hash will be the same for entire array, we can take first to figure out column names
column_names = prune(attributes_array.first.keys).join(',')
start = Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond)
bulk_insert_expression = "INSERT INTO #{entity_class.table_name} (#{column_names}) VALUES #{all_column_values}"
Snowflake::Base.connection.execute(bulk_insert_expression)
puts "Creating #{table.hashes.size} #{entity_class}-s took #{Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond) - start}ms"
end
Some housekeeping/utility methods are omitted, but the bulk (no pun intended) of it is there. Instead of inserting values one-by-one, we transform entire gherkin table into one INSERT INTO {TABLE} ({COLUMNS}) VALUES ({VALUES}). When we check logs, we can see something like:
... (lot more lines) ...
Creating 21 Snowflake::TransactionTag-s took 995ms
Creating 12 Snowflake::Payment-s took 699ms
... (lot more lines) ...
scenarios (8 passed)
100 steps (100 passed)
3m33.389s
Looks like a big win: we went from 17min 22s to 3min 33s, around 5x faster.
Speeding up gherkins, part II
We can expand this idea further: what if we collected entities from all tables, only inserting them once we hit When step? Speed improvements probably won't be as drastic as previous one, as we're mostly saving on network round trips (just a guess), but it might be worth a shot.
In order to do that, we will need a way of detecting, for any gherkin, when we are about to execute When step. Also, change should be localised in one place, so we won't have to add changes to all existing gherkins. Luckily, we found something that can be used for that: custom filters
If we change our code to look something like:
def bulk_add_to_db(entity_class, table)
# ... pretty much same as above ...
column_names = prune(attributes_array.first.keys).join(',')
@snowflake_to_insert ||= {}
insert_statement_values = { expression: "INSERT INTO #{entity_class.table_name} (#{column_names}) VALUES #{all_column_values}", size: attributes_array.size }
if @snowflake_to_insert.key?(entity_class)
@snowflake_to_insert[entity_class][:expression] = "#{@snowflake_to_insert[entity_class][:expression]};#{insert_statement_values[:expression]}"
@snowflake_to_insert[entity_class][:size] = @snowflake_to_insert[entity_class][:size] + insert_statement_values[:size]
else
@snowflake_to_insert[entity_class] = insert_statement_values
end
end
With this in place, we saved insert expressions for every step, while also keeping track of how many entities of certain type we are about to insert. After that, we define filter:
class SnowflakeBulkInsertFilter < Cucumber::Core::Filter.new
# From all test steps, find 'When'/'Examples' one and before calling its action, spill data to snowflake
def test_case(test_case)
intercepted_when_steps = test_case.test_steps.map do |test_step|
action = test_step.instance_variable_get('@action')
test_step.with_action do
if when_step?(test_step) && !snowflake_to_insert.blank?
start = Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond)
setup_unlimited_statements
Snowflake::Base.connection.execute(bulk_insert_expression)
puts "Creating #{bulk_insert_entity_list} took #{Process.clock_gettime(Process::CLOCK_MONOTONIC, :millisecond) - start}ms"
clear_import_expressions
end
original_action = action.instance_variable_get('@block')
raise ArgumentError, "Could not find '#{test_step.to_s}' action defined" if original_action.nil?
original_action.call
end
end
test_case.with_steps(intercepted_when_steps).describe_to receiver
end
# gets the value of World object... this is where all instance variables defined in steps live
def current_world
receiver.instance_variable_get('@hooks').instance_variable_get('@registry').current_world
end
def current_ivar_value(name)
current_world.instance_variable_get("@#{name}")
end
# we cache this one, so that we only call this once during each scenario
def setup_unlimited_statements
@setup_unlimited_statements ||= Snowflake::Base.connection.execute('alter session set MULTI_STATEMENT_COUNT = 0;')
end
def clear_import_expressions
current_world.instance_variable_set("@snowflake_to_insert", {})
end
def snowflake_to_insert
current_ivar_value('snowflake_to_insert')
end
def bulk_insert_expression
snowflake_to_insert.values.map { |hash| hash[:expression] }.join(";")
end
def bulk_insert_entity_list
snowflake_to_insert.map { |entity_class, values| "#{values[:size]} #{entity_class}-s" }.join(',')
end
# Is this step a point where we need to spill data to database?
def when_step?(test_step)
step_keyword = test_step.instance_variable_get('@source')[2].keyword
%w[When Examples].any? { |target_word| step_keyword.include?(target_word) }
end
end
Note that we need to issue alter session set MULTI_STATEMENT_COUNT = 0; in order to have multiple statements in one query. This can be done only once though, as its a session-wide parameter. Here we basically add custom logic around normal execution of When step, cleaning up any insert expressions after step has finished. One can use bulk-insert gem for something like this, but just beware that it omits id columns by default, I'm guessing it expects that auto-incrementing id column already exists.
Only thing left to do is to register this filter, and you do it by adding
AfterConfiguration do |config|
config.filters << SnowflakeBulkInsertFilter.new
end
to your test configuration file.
Lets check how tests perform with this newest change:
... (lot more lines) ...
Creating 21 Snowflake::TransactionTag-s,12 Snowflake::Code-s,12 Snowflake::Payment-s ... (104 records in total).. took 5320ms
... (lot more lines) ...
8 scenarios (8 passed)
100 steps (100 passed)
3m16.142s
Looks like its faster, but not by a huge margin, around 8%.
Apart from using one seed file to populate DB once and use that data for all tests, this is most that we could achieve with bulk inserts. Seed file was not an option for us because of sheer number of tests that sometimes have totally different preconditions, but maybe it could work for some other use cases.
Having done this, we could now focus our attention on fixing another thing that we noticed is considerably slower on Snowflake when compared to Postgres/Citus: fetching large amounts of data (for example, 2 years worth of payments for a single client).
But that will be part of another blog post, stay tuned!
0 comments