Streaming GeoJSON with GDAL, Rack::Chunked, and Rails Live Streaming
TL;DR
For you busy folks who just want it to work, here is the gist.
Background
A few weeks ago, I was at NationBuilder building a GeoJSON API for organizers. The end goal for this project is to provide accurate political districting information through this API and exposes a map management interface for district updates.
Application Architecture
This application is built on top of the popular Ruby On Rails framework, and uses PostGIS as geographic data storage backend. GeoJSON conversion is done by using rgeo-geojson gem since it fits nicely with activerecord-postgis-adapter.
Performance Issues
The application was initially running smoothly, but problems started to emerge when the generated GeoJSON data became larger over time. The API server would often time out, stop responding, or run out of memory when GeoJSON response size approached a couple MB under light load. As a result, map rendering became so slow and unstable, it was painful to use.
Memory Bloat
Our GeoJSON data was initially loaded from database by ActiveReocrd, and then converted into RGeo::Geometry
objects. The entire dataset is buffered in memory. These objects are then converted into hashes with rgeo-geojson
. These converted hashes represents takes up a lot of memory as well. Finally the resulting hashes are merged with other database properties and pushed on an array. All buffered in memory!
Response Time
The application spends the majority of its time converting RGeo::Geometry
objects into GeoJSON hashes. The CPU will hit 100% and memory will steadily clime. Upon closer inspection, I found out that rgeo-geojson
gem does the conversion entirely in Ruby! When the dataset gets too big, the browser or our proxy will either run out of patience and issue timeouts or our application process will have consumed so much memory it was killed by monit.
Understanding Root Cause
As all seasoned software engineers would do, I first analyzed the symptoms at hand and concluded the following root cause:
- GeoJSON rendering is too slow.
- dataset buffering causes memory bloat.
Research, Research, Research!
My initial effort was focused on solving the rendering speed issue. I know I cannot convert GeoJSON in ruby since ruby isn’t exactly suitable for this kind of heavy duty data crunching when speed is a core requirement.
The PostGIS route
The immediate thought came to mind is to have PostGIS generating the GeoJSON for me, and I found out st_asgeojson. However, this doesn’t solve my problem since I will have to convert the generated GeoJSON back into hashes and merge database properties. There’s gotta be a better way!
GDAL
After some more research I found out a little commandline tool called ogr2ogr
from the wonderful GDAL library. This tool can execute supplied SQL statements and convert the results into GeoJSON FeatureCollection. This is perfect!
Here is an example on how to use it:
ogr2ogr -f GeoJSON /vsistdout/ \
"PG:host=<host> dbname=<dbname> user=<user> password=<password>" \
-sql "SELECT name, geom FROM regions LIMIT 100"
The -f
flag indicates the output format, in this case I used GeoJSON. /vsistdout/
means I want to output the geojson directly to STDOUT.
I gave it a spin, and the rendering speed is blazing fast! Ok, I’m sticking with it!
Interfacing with Rails
In order to construct the command, I need to convert an ActiveRecord Scope into SQL statements.
SQL Statement
This is easy enough. #to_sql
method on ActiveRecord scope will do the trick. If you are using Rails 4, don’t forget to wrap this method inside scope.connection.unprepared_statement
block to generate the full statement.
Something simple like this would do:
def sql
@scope.connection.unprepared_statement { @scope.to_sql }
end
DB Connection String
The next step is to fetch the Rails database configuration and convert it into the db connection string expected by ogr2ogr
. Here is a snippet:
def conn_str
db_config = Rails.configuration.database_configuration[Rails.env]
host = db_config['host']
port = db_config['port']
database = db_config['database']
username = db_config['username']
password = db_config['password']
args = []
args.push "host=#{host}" if host
args.push "port=#{port}" if port
args.push "dbname=#{database}" if database
args.push "user=#{username}" if username
args.push "password=#{password}" if password
"PG:\"#{args.join(' ')}\""
end
Stitching It Togther
The entire command is constructed as follows:
def command
[
# the command name
'ogr2ogr',
# output geojson to stdout
'-f', 'GeoJSON', '/vsistdout/',
# postgres db config
conn_str,
# SQL statement to run
'-sql', "\"#{sql}\""
].join(' ')
end
Running the command
Run the generated command as a sub-process and get the STDOUT as an IO object.
def run
IO.popen(command)
end
Yay! Fast GeoJSON rendering!
Rails Live Streaming
You can read to the end of the IO stream and just send it off, but where is the fun? Let’s stream the response back with HTTP 1.1 chunked encoding in a 4KB buffer! This improves response time and reduces memory footprint. Sweet deal!
Add Chunked Enocding Support to Rails
In order to support chunked encoding, we need to add this into our config/application.rb
file:
# config/application.rb
module MyApp
class Application
#
# other rails application configurations
#
# ...
# Add Rack::Chunked before Rack::Sendfile
config.middleware.insert_before(Rack::Sendfile, Rack::Chunked)
end
end
This inserts the Rack::Chunked
middleware into the correct position of Rails middleware stack to support HTTP 1.1 chunked encoding.
Aligning The Stars, Er… I mean Interfaces
ActionController::Metal Streaming Interface
Since I was building an API, I used ActionController::Metal
instead of ActionController::Base
with is a much smaller abstraction for Rack
interface. The response body is set by using #response_body=
method.
In order to use chunked encoding, We need to give #response_body=
method an object that responds to #each
and optionally #close
, where #each
will yield data in chunks, and #close
is required if you need to close underlaying file descriptors or any cleanup operations. The goal is to wrap the ogr2ogr
IO stream object to something that #response_body=
expects.
Ruby IO Interface
Ruby IO class already implements #each
and #close
methods. However the behavior of #each
is not ideal in our use case. IO#each
is an alias to IO#each_line
which yields data line by line. ogr2ogr
generates the GeoJSON in a single gigantic line. We need to split each chunk by byte size instead of by line.
Luckily, Ruby IO offers a nice method called #readpartial
that takes a maxlen
argument. This argument tells the IO stream how many bytes read. When invoked, the IO object will wait until enough bytes are available and return a string with the requested byte size (or less if end of the stream is reached, or raises EOFError).
Chunking IO Stream
With the long explanation above, we are now equipped with enough knowledge to create a wrapper to handle the chunking. It turns out pretty simple:
# lib/chunked_stream.rb
#
# ChunkedStream
#
# It takes any IO object and reads the output in chunks.
# It implements #each and #close and is designed to be used
# to interface with Rails Live Stream or Rack::Chunked
#
class ChunkedStream
CHUNK_SIZE = 1024 * 4 # read in 4 kB size
attr_reader :io, :chunk_size
def initialize(io, chunk_size = CHUNK_SIZE)
@io = io
@chunk_size = chunk_size
end
def each
while chunk = io.readpartial(chunk_size)
yield chunk
end
rescue EOFError => e
nil
ensure
close
end
def close
io.close
end
end
Now we have all the Lego pieces in place! let’s put them in good use:
# In your ActionController::Metal sub class
# that generate the geojson
#
def index
scope = Region.all
io = GeojsonCommand.new(scope).run
self.status = :ok
self.content_type = 'application/json'
self.response_body = ChunkedStream.new(io)
end
Bonus Stage: On The Fly Compression for Chunked Encoding
Somewhere during the research, I came across Rack::Deflater. This is a middleware that checks for Accept-Encoding
in request headers, and compresses your response on the fly!
The trick to get it working with Rack::Chunked
is to insert this middleware right after Rack::Chunked
. This is because we need to compress each data chunk, instead of the entire request.
To enable compression, all you need to do is add the following line in your config/application.rb
:
config.middleware.insert_after(Rack::Chunked, Rack::Deflater)
Caveats
Although the speed has been significantly improved. There are still some caveats to watch for.
ogr2ogr Sub-Process Error Handling
Currently the code assumes ogr2ogr
sub-process can run successfully all the time, and we all know this is unrealistic. A better error handling process is needed to make it more solid.
Under Large Load
Under high load, many ogr2ogr
sub-processes will be created. The behavior under this situation is unknown. But this is an internal API with very light traffic. Thus this is not a concern for me (yet). I think some kind of in-process worker pool and monitor could mostly solve this issue, but I have no plan to implement this for now.
Go Try It Out!
Here is a gist for you to try it out. Let me know your results / opinions / rants! All is welcome!