Rails + Snowflake DB = why so slow? (Part II - bulk data fetching)

Important note:
All techniques described here are ok to do with app running locally, do not use them in production environment! They require changing permissions and/or capabilities that you may not want to do in production. There are other ways of doing these in production if you need to (like running separate, privileged container in same namespace as the original one) that are more involved, so I did not cover them here.


In the previous post, we described a way to speed up snowflake-based integration tests in a typical Rails app. This did help us a lot, but we were left with some other issues related to snowflake, and one of them was fetching large amounts of data from it in one go. We will give a brief overview of several methods of looking into a running ruby process, which tools are used for doing that and so on. 

Because of some historical reasons, lot of our number-crunching workflows are being done by pulling in large amounts of data (for a single tenant, so number-crunching is per-tenant) and then using that data in-memory. If we discover that we need additional data in order to calculate some numbers, we would be pulling in and memoizing that additional data, and so on. In general we wanted to have as few queries per tenant as possible, and have those queries be as simple as possible, because we encountered many issues trying to scale DB layer to handle many complex queries. It does sound counterintuitive, but it worked for us.

So a typical workflow would pull in something like 5 years worth of data from the biggest table we had for that tenant (anywhere from 100k to 3m records), calculate hundreds of business-relevant numbers, something in lines of:

  • how many new clients did this tenant observe during last month?
  • out of all active clients this month, how many came after X months of inactivity?
  • how does that number compare with same month previous year?
  • etc...

To get a feeling on how this performed, most of our jobs finished in 10-25 minutes while we were on citus DB, there were some outliers that were taking hours, but they were usually related to biggest tenants. As this was happening on a month-to-month basis, we could handle this workload in one day (for couple of thousands of tenants) by scaling our background worker fleet.

One thing that we noticed immediately were absolutely ABYSMAL times that our bulk-fetch queries were taking: for example, getting ~913.000 records while plucking 12 columns from a given table would take 1059 seconds 😱. And when we checked snowflake admin UI, it was reporting:

- Bytes Scanned: 101.8MB
- Total Duration: 3.7s

So something is clearly wrong here. Either snowflake is reporting really optimistic numbers, (like, maybe its just the time taken to find the data, but not the time of from receiving a query from the wire and putting last resulting byte back to client) or our Rails app is doing something weird. Hint: it's not snowflake 🙂

Components of interest

As explained in previous post, our Rails app is connecting to Snowflake using ODBC and somewhat modified version of Rails adapter described here. That adapter uses ruby-odbc gem under the hood. Also, in order to connect to Snowflake you need their proprietary ODBC driver, which is an .so library that you register with your OS's ODBC driver manager, in our case we were using unixODBC implementation.
Essentially most of the work is being done by the ruby-odbc gem, which talked to the driver we registered for our data source, in this case libSnowflake.so library. Data flow goes something like this:

Investigating with rbspy

This was the first thing that we thought of when trying to figure out what's going on: lets run the bulk fetching query, attach rbspy to the running process and get a flamegraph of what is getting executed and for how long. If you are not familiar, rbspy is an amazing tool that you can plug into your running ruby process and it will record what is going on in it. It's a sampling profiler, to be more precise.

In order to run rbspy inside our dockerized app, we need to install it in our docker image. We can do this by adding:

RUN curl -LJO https://github.com/rbspy/rbspy/releases/download/v0.7.0/rbspy-x86_64-unknown-linux-gnu.tar.gz \
    && tar -xzf rbspy-x86_64-unknown-linux-gnu.tar.gz \
    && install rbspy-x86_64-unknown-linux-gnu /usr/bin/rbspy \
    && setcap 'cap_sys_ptrace+eip' /usr/bin/rbspy \
    && rm -f rbspy-x86_64-unknown-linux-gnu \
    && rm -f rbspy-x86_64-unknown-linux-gnu.tar.gz

Note that this is an older version, but it worked for us out of the box, as we are also using older version of Debian. You can install newer versions if its an option for you. There is only one more thing to do, that is to add:

      - SYS_PTRACE

to our docker-compose file. This will signal docker to run our container with tracing on, otherwise we would get something in lines of bash: /usr/bin/rbspy: Operation not permitted when we run rbspy. 

After we exec into running container and run bulk data fetch on it, we attach rbspy to that process by issuing (for example) rbspy record -p {PID} --rate 50 -f flamegraph-odbc.svg and we get the following (after ~30s of execution):

Not very informative 🤔. Ok, we are in bulk fetch method (block in tenant_data - /myproject/lib/lotr/data_container.rb:29 part), we are doing plucking + we are using our version of snowflake rails adapter, but we can't even see ODBC driver being called. Everything interesting is being hidden behind  start [c function] - (unknown):0 section. 
Let's try to poke around a bit more using some other tools, maybe we will find something interesting along the way.

Investigating with GDB

GDB is a debugger tool that we can use to see what's going on inside a running program, stop its execution, set breakpoints, add print statements to it without recompilation, etc. Its a MASSIVE and very useful tool,  and people have been working on it for last 36 years 😯. If one takes a look at its online documentation one can get a feeling of what can be done with it.

In ruby/rails community, its often used to debug gnarly issues originating in native gems written in C, or even in ruby VM itself (also written in C). If you are interested, you can check out following articles:

For our purposes, we just wanted to check what's going on at arbitrary points in time, which may help us see what the driver is doing. So, for research purposes, we installed it using apt install gdb and added privileged: true to our compose file. Attaching gdb to our rails app and calling info threads while doing bulk fetching gives us:

    Id   Target Id         Frame
  34   Thread 0x7f130affd700 (LWP 54) "jemalloc_bg_thd" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  33   Thread 0x7f1309dff700 (LWP 53) "jemalloc_bg_thd" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  32   Thread 0x7f13197fa700 (LWP 52) "jemalloc_bg_thd" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  28   Thread 0x7f130bfff700 (LWP 48) "ruby" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  25   Thread 0x7f1319ffb700 (LWP 45) "ruby" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  24   Thread 0x7f131a7fc700 (LWP 44) "ruby" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  23   Thread 0x7f131affd700 (LWP 43) "ruby" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  20   Thread 0x7f131b7fe700 (LWP 40) "ruby" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  17   Thread 0x7f131bfff700 (LWP 37) "jemalloc_bg_thd" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  16   Thread 0x7f1365ddc700 (LWP 8) "ruby-timer-thr" 0x00007f1364f9aaed in poll ()
    at ../sysdeps/unix/syscall-template.S:81
  15   Thread 0x7f135b284700 (LWP 11) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  14   Thread 0x7f135ae5f700 (LWP 12) "ruby" 0x00007f1364f9c893 in select ()
    at ../sysdeps/unix/syscall-template.S:81
  13   Thread 0x7f135ac5d700 (LWP 13) "ruby" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  12   Thread 0x7f135ab5c700 (LWP 14) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  11   Thread 0x7f135a6aa700 (LWP 15) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  10   Thread 0x7f135a5a9700 (LWP 16) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  9    Thread 0x7f135a4a8700 (LWP 17) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  8    Thread 0x7f135a3a7700 (LWP 18) "connection_poo*" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  7    Thread 0x7f135aa5b700 (LWP 19) "ruby" 0x00007f1364f9c893 in select ()
    at ../sysdeps/unix/syscall-template.S:81
  6    Thread 0x7f1359e9c700 (LWP 20) "ruby" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5    Thread 0x7f1359d9b700 (LWP 21) "ruby" 0x00007f1364f9c893 in select ()
    at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7f1359c9a700 (LWP 22) "ruby" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3    Thread 0x7f1359b99700 (LWP 23) "ruby" 0x00007f1364f9c893 in select ()
    at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7f1359a98700 (LWP 24) "ruby" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7f1365dce700 (LWP 1) "ruby" rgengc_check_relation (objspace=0x5611c5848ad0,
    obj=94634328621240) at gc.c:4200

Background memory allocation threads, connection pools...we poked around, and when issuing backtrace while switched to Thread #1, we would be getting something in lines of:

 	(gdb) backtrace
#0  rgengc_check_relation (objspace=0x5611c5848ad0, obj=94634532961120) at gc.c:4200
#1  0x00005611c4244d5d in rgengc_check_relation (obj=94634532961120, objspace=0x5611c5848ad0)
    at gc.c:4291
#2  gc_mark_ptr (objspace=0x5611c5848ad0, obj=94634532961120) at gc.c:4283
#3  0x00005611c4245f09 in gc_mark_children (objspace=objspace@entry=0x5611c5848ad0,
    obj=<optimized out="">) at gc.c:4416
#4  0x00005611c4246f3d in gc_mark_stacked_objects (incremental=<optimized out="">,
    count=<optimized out="">, objspace=<optimized out="">) at gc.c:4534
#5  gc_mark_stacked_objects_all (objspace=<optimized out="">) at gc.c:4574
#6  gc_marks_rest (objspace=objspace@entry=0x5611c5848ad0) at gc.c:5397
#7  0x00005611c4247545 in gc_marks (full_mark=<optimized out="">, objspace=0x5611c5848ad0)
    at gc.c:5457
#8  gc_start (objspace=objspace@entry=0x5611c5848ad0, full_mark=full_mark@entry=1,
    immediate_mark=immediate_mark@entry=1, reason=<optimized out="">, reason@entry=1024,
    immediate_sweep=<optimized out="">) at gc.c:6242
#9  0x00005611c42499c7 in gc_start (reason=1024, immediate_sweep=<optimized out="">,
    immediate_mark=1, full_mark=1, objspace=0x5611c5848ad0) at gc.c:6169
#10 garbage_collect (reason=1024, immediate_sweep=<optimized out="">, immediate_mark=1, full_mark=1,
    objspace=0x5611c5848ad0) at gc.c:6160
#11 gc_start_internal (argc=<optimized out="">, argv=<optimized out="">, self=<optimized out="">)
    at gc.c:6475
#12 0x00005611c437d3c7 in vm_call0_cfunc_with_frame (ci=<optimized out="">, cc=<optimized out="">,
    argv=<optimized out="">, calling=<optimized out="">, th=<optimized out="">) at vm_eval.c:131
#13 vm_call0_cfunc (argv=<optimized out="">, cc=<optimized out="">, ci=<optimized out="">,
    calling=<optimized out="">, th=<optimized out="">) at vm_eval.c:148
#14 vm_call0_body (th=0x5611c58486c0, calling=0x3, ci=0x5611c5882a70, cc=0x3c,
    cc@entry=0x7fff4f03b260, argv=0x0) at vm_eval.c:186
#15 0x00005611c437e423 in vm_call0 (me=<optimized out="">, argv=0x0, argc=0, id=20785,
    recv=94634328468080, th=0x5611c58486c0) at vm_eval.c:61
#16 rb_call0 (recv=94634328468080, mid=20785, argc=0, argv=0x0, scope=<optimized out="">,
    self=<optimized out="">) at vm_eval.c:351
#17 0x00005611c437ed42 in rb_call (scope=CALL_FCALL, argv=<optimized out="">, argc=<optimized out="">,
    mid=<optimized out="">, recv=<optimized out="">) at vm_eval.c:637
#18 rb_funcall (recv=<optimized out="">, mid=<optimized out="">, n=<optimized out="">) at vm_eval.c:835
#19 0x00007f135c8a42e3 in start_gc () at ../odbc.c:1836
#20 do_fetch (q=0x5611d73b95f0, mode=-776438944, mode@entry=0) at ../odbc.c:6930
#21 0x00007f135c8a73b8 in stmt_fetch1 (bang=0, self=94634560151280) at ../odbc.c:7477
#22 stmt_fetch_many (arg=8, self=94634560151280) at ../odbc.c:7616
#23 stmt_fetch_all (self=94634560151280) at ../odbc.c:7629
#24 0x00005611c436e607 in vm_call_cfunc_with_frame (ci=<optimized out="">, cc=<optimized out="">,
    calling=<optimized out="">, reg_cfp=0x7f1365dcb450, th=0x5611c58486c0) at vm_insnhelper.c:1638
.... (other lines)...
#94 0x00005611c422465b in main (argc=3, argv=0x7fff4f03f528) at main.c:36

One thing sticks out, the lines 19-24. Looks like this is the part of the code that calls ruby-odbc gem, and seems that its doing garbage collection, while inside do_fetch C function. Also, at line 10 we can see that GC is indeed called. Poking around a bit, we can see that snowflake driver is probably downloading data in arrow format:

 	(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f9f313ba479 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f9f313ba2a0 in __GI___pthread_mutex_lock (mutex=0x555ade8d8158)
    at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f9eed088ac3 in sf::Mutex::lock() () from /usr/lib/snowflake/odbc/lib/libSnowflake.so
#4  0x00007f9eed0f85eb in sf::ResultChunkDownloaderArrow::getNextChunkToConsume(unsigned long&amp;) ()
   from /usr/lib/snowflake/odbc/lib/libSnowflake.so
#5  0x00007f9eed0fe403 in sf::ResultSetArrow::next() ()
   from /usr/lib/snowflake/odbc/lib/libSnowflake.so
#6  0x00007f9eecfd3bcc in Simba::Snowflake::SFQueryResult::MoveToNextRow() ()
   from /usr/lib/snowflake/odbc/lib/libSnowflake.so
#7  0x00007f9eec615374 in Simba::DSI::DSISimpleResultSet::Move (this=<optimized out="">,
    in_direction=<optimized out="">, in_offset=<optimized out="">) at Client/DSISimpleResultSet.cpp:110
#8  0x00007f9eec90fd6c in Simba::ODBC::ForwardOnlyCursor::FetchRowset (this=0x555adf65a9c0,
    in_orientation=<optimized out="">, in_offset=<optimized out="">, in_rowsetSize=<optimized out="">,
    in_ard=<optimized out="">, in_rowStatusPtr=<optimized out="">, in_rowsProcessedPtr=0x0)
    at Cursor/ForwardOnlyCursor.cpp:352
#9  0x00007f9eec94542b in Simba::ODBC::QueryManager::FetchRowset (this=0x555ade539750,
    in_orientation=1, in_offset=0, in_rowsetSize=1, in_rowStatusPtr=0x0, in_rowsProcessedPtr=0x0)
    at QueryManager/QueryManager.cpp:87
#10 0x00007f9eec963fdb in Simba::ODBC::StatementStateCursor::DoFetchScroll (this=<optimized out="">,
    in_fetchOrientation=1, in_fetchOffset=0) at Statement/StatementStateCursor.cpp:823
#11 0x00007f9eec96cafc in Simba::ODBC::StatementState6::SQLFetchScroll (this=<optimized out="">,
    in_fetchOrientation=<optimized out="">, in_fetchOffset=<optimized out="">)
    at Statement/StatementState6.cpp:56
#12 0x00007f9eec94ee55 in Simba::ODBC::Statement::SQLFetchScroll (this=0x555addb692f0,
    FetchOrientation=1, FetchOffset=0) at Statement/Statement.cpp:1019
#13 0x00007f9eec8d78c6 in DoSynchronously (in_params=..., in_stmt=...)
    at CInterface/SQLFetchScrollTask.h:82
#14 DoTask<simba::odbc::sqlfetchscrolltask> (in_functionName=0x7f9eedad1e4f "SQLFetchScroll",
    in_handle=0x3, in_parameters=...) at CInterface/CInterface.cpp:684
#15 0x00007f9eec8a45ff in SQLFetchScroll (StatementHandle=<optimized out="">,
    FetchOrientation=<optimized out="">, FetchOffset=<optimized out="">) at CInterface/CInterface.cpp:1799
#16 0x00007f9f286e7dc5 in SQLFetchScroll () from /usr/lib/x86_64-linux-gnu/libodbc.so.2
#17 0x00007f9f282a90f4 in F_SQLFETCHSCROLL (args=<optimized out="">) at ../odbc.c:1186
#18 0x0000555acb5982e9 in call_without_gvl (fail_if_interrupted=0, data2=<optimized out="">,
    ubf=<optimized out="">, data1=0x7fff851080b0, func=0x7f9f282a90e0 <f_sqlfetchscroll>)
    at thread.c:1289
#19 rb_thread_call_without_gvl (func=func@entry=0x7f9f282a90e0 <f_sqlfetchscroll>,
    data1=data1@entry=0x7fff851080b0, ubf=ubf@entry=0x7f9f282a7f90 <empty_ubf>,
    data2=data2@entry=0x7fff851080b0) at thread.c:1399
#20 0x00007f9f282b52ba in SQLFETCHSCROLL (offs=0, dir=1, hstmt=<optimized out="">) at ../odbc.c:1199
#21 stmt_fetch1 (bang=0, self=93848698222600) at ../odbc.c:7454
...(more lines)...

It would be great if we could access that data and do in-memory calculations directly on those...another time, perhaps 😁. At this point we could dig deeper into ruby-odbc code to check line 7616 of odbc.c file to see when and why GC is being called, or set some counters in GDB so that we can see how many times it gets called. But we can also try to get a better flamegraph of execution, so that we can get a feeling how much time is actually spent in GC during bulk data fetching. Lets use one last tool called perf to determine this.

Investigating with Perf

Perf is another tool that is often used to figure out what's going on while some code is running on a system. In other words, it can help us troubleshoot and answer the following questions (taken from Brendan Gregg's blog, has some amazing content, do check it out) :

  • Why is the kernel on-CPU so much? What code-paths?
  • Which code-paths are causing CPU level 2 cache misses?
  • Are the CPUs stalled on memory I/O?
  • Which code-paths are allocating memory, and how much?
  • What is triggering TCP retransmits?
  • Is a certain kernel function being called, and how often?
  • What reasons are threads leaving the CPU?

Our use-case will not involve answering any of these questions though, we are only interested to see if its true that our app is spending lots of time in GC. For that we will be using perf in a similar way we did as when we used rbspy: we will attach it to a running process and record what's going on. We will then pass those recorded results to some python tools, which in turn will spit out SVG file containing flamegraph.

We install it by issuing apt update && apt install linux-perf, but note that it can happen that the versions might get mixed up, so calling perf will try to find it as a bin with host kernel version in the name, not the one from container. If that happens try doing this. You then attach to running Rails container and issue command looking something like perf record -F 99 -p 1 --call-graph dwarf where the number after p flag indicates process id, number behind F indicates sampling rate (99 Hz in this case) and in this case we use dwarf format for unwinding call stacks.

When running perf you may encounter following message:

    perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted)
    perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted)
    No permission to enable cycles event.
You might need to issue echo 0 > /proc/sys/kernel/kptr_restrict, while also keeping changes we did to docker compose file described in the GDB section (if not, you might get bash: /proc/sys/kernel/kptr_restrict: Read-only file system message).

After letting perf run for ~60s, we stop it and then run perf report to see the call stack information. Perf can create lot of events so be gentle with sampling rate and with time you record, as it might take a lot of time to analyze and present all of that information. Once summary is complete, user is shown something like this:

It does show that more than 90% of time is being spent in GC, and looks like its coming from odbc_utf8.so shared library. First appearance of Snowflake driver (libSnowflake.so shared library) shows up  later on, with 2.07%, so it shows it's not the culprit. If we have left perf record entire execution, percentage of time spent in GC would probably be even higher. To get to flamegraph, we need grab one of the utilities Brendan created, that takes output of perf tool and spits out flamegraph SVG file. So we do the following commands, after perf has run:

 root@2b6ee3b64358:/myproject# git clone https://github.com/brendangregg/FlameGraph
 root@2b6ee3b64358:/myproject# mv perf.data FlameGraph/ && cd FlameGraph/
 root@2b6ee3b64358:/myproject/FlameGraph# perf script | ./stackcollapse-perf.pl > out.perf-folded && cat out.perf-folded | ./flamegraph.pl > perf-output.svg
Here is what the flamegraph looks like:

Those tiny rectangles in the beginning that are so small that we need to zoom-in to even see what they are, and they are indeed data fetching part, rest is just do_fetch -> start_gc part.
Ok, so now that we have confirmed it from multiple angles, lets try to fix the issue.


Looking at https://github.com/larskanis/ruby-odbc/blob/master/ext/odbc.c#L5930 (I'm no C expert by any means, so take this analysis with a big grain of salt), we see that there is a counter incremented every time do_fetch is called, and when we hit 500 we set it back to 0 and do a GC. It was probably done to keep memory consumption to a minimum, but our Rails monolith is quite big and each GC pause is felt in a big way. 

When we look at our use case (fetching large amounts of data), it would make more sense to just let it run its course and call GC once after everything is fetched. And because we already monkey-patched snowflake ODBC adapter-s exec_query method, we can as well add explicit GC there:

 def exec_query(sql, name = 'SQL', binds = [], prepare: false)
        log(sql, name) do
          sql = bind_params(binds, sql) if prepared_statements
          stmt =  @connection.run(sql)

          columns = stmt.columns
          values  = stmt.fetch_all || []

          values = dbms_type_cast(columns.values, values)
          column_names = columns.keys.map { |key| format_case(key) }
          GC.start # this was added
          ActiveRecord::Result.new(column_names, values)

and remove all instances of start_gc() from the the gem itself. One can probably omit it entirely and let Ruby VM take care of itself, but we went with the more cautious route. Note that this will still issue a GC for every query: even if they take 100ms, it can add up, especially in tests. You can conditionally start it if query fetches more than, say, 100 records: that way tests are probably not going to trigger it at all. Another thing that we noticed while upgrading our app from Ruby 2.3.1 to Ruby 2.5.9 is that code for this gem (from github) no longer compiles: the line that does rb_scan_args(argc, argv, (op == -1) ? "11" : "01", &val, &val2); is now raising compilation error, so we applied same fix as codevolumes folks did.

And now when we test it, flamegraph looks lot more healthier:

And the best part is: it fetches data in ~27s now, so we can even let perf run all the way until we get results back, nice.

We managed to go from 1059s for fetching ~913k records to ~27s, a 40x improvement in speed! 
Snowflake DB cache can skew the results, so we were careful to use same cluster with results already cached there.

Future considerations

Turns out, slowness that we were seeing when fetching data from Snowflake was caused by excessive garbage collection pauses, called explicitly from ruby-odbc gem. In general, that gem is a bit weird, as the code that is currently on github no longer compiles under Ruby 2.5, which itself is EOL-ed. Newer code can be downloaded as a tarball from  http://www.ch-werner.de/rubyodbc/, but that is not really a good method of distributing source code anymore - we would need to download tarball every time we build docker image, and I have more confidence in availability of sites like github, as opposed to some random website. Website in question is also HTTP only.

It is also not quite clear where the development of it takes place, as last commit on github shows that it happened on 
Nov 18, 2011 (almost exactly 11 years before this blog post is written), but the newest version on rubygems was uploaded in 2020. It is also very complex, as it has single odbc.c file that has ~8500 LOC. We are yet to see if it will need any additional changes when we bump our app to use Ruby 2.6+. Not trying to blame creators/maintainers here as they did all the heavy lifting and probably nobody payed them to do it anyway, but we see this part of our app as something that is a risk for future upgrades. Finding/creating a replacement for it would be quite beneficial.

One option would be to create a wrapper for a odbc connectivity library written in Rust, like this one, using rutie or magnus libraries. Rust is picking up momentum quite a bit in recent years due to its performance and memory safety guarantees (and CTO of Azure agrees). There are more and more examples of mixing Ruby with Rust, like this initiative in rubygems world to allow building native gems using Rust, or Shopify's porting of their YJIT compiler from C to it.