This gem provides support for asynchronously updating a Solr index with the sunspot gem.
Since Solr runs as a different part of your application infrastructure, there is always the chance that it isn’t working right while the rest of your application is working fine. By queueing up changes for Solr when your models change, the update parts of your application can remain up even during a Solr outage. This could be critical if it keeps your staff working or your customer placing orders.
If your application stores data in a relational database, there’s always the chance that a record update could succeed while the transaction it was in fails. This can result in inconsistent data between your search index and your database.
If you use the tactic of batching Solr updates to commit them at once, you could have a problem if an exception is encountered prior to the batch being committed.
Queueing the updates in the same datastore that is used for persisting your models can prevent these sorts of inconsistencies from happening.
If you get a particularly large spike of updates to your indexed models, you could be taxing your Solr server with lots of single document updates. This could lead to downstream performance issues in your application. By queueing the Solr updates, they will be sent to the server in larger batches providing better performance. Furthermore, if Solr gets backed up updating the index, the slow down will be isolated to the background job processing the queue.
This library uses a dedicated work queue for processing Solr requests instead of building off of delayed_job or another background processing library in order to take advantage of efficiently batching requests to Solr. This can give orders of magnitude better performance over indexing and committing individual documents.
To use asynchronous indexing, you’ll have to set up three things.
Your application should be initialized with a Sunspot::IndexQueue::SessionProxy session. This will send queries immediately to Solr queries but send updates to a queue for processing later.
This will set up a session proxy using the default Solr session:
Sunspot.session = Sunspot::IndexQueue::SessionProxy.new
If you have custom configuration settings for your Solr session or need to wrap it with additional proxies, you can pass it to the constructor. For example, if you have a master/slave setup running on specific ports:
master_session = Sunspot::Session.new{|config| config.solr.url = 'http://master.solr.example.com/solr'} slave_session = Sunspot::Session.new{|config| config.solr.url = 'http://slave01.solr.example.com/solr'} master_slave_session = Sunspot::SessionProxy::MasterSlaveSessionProxy(master_session, slave_session) queue = Sunspot::IndexQueue.new(master_slave_session) Sunspot.session = Sunspot::IndexQueue::SessionProxy.new(queue)
The queue component is designed to be modular so that you can plugin a datastore that fits with your application architecture. To set the implementation, you can set it to one of the included implementations:
# Queue implementation backed by ActiveRecord Sunspot::IndexQueue::Entry.implementation = :active_record # Queue implementation backed by DataMapper Sunspot::IndexQueue::Entry.implementation = :data_mapper # Queue implementation backed by MongoDB Sunspot::IndexQueue::Entry.implementation = :mongo # You can also provide your own queue implementation Sunspot::IndexQueue::Entry.implementation = MyQueueImplementation
You’ll need to make sure you have the data structures set up properly for the implementation you choose. See the documentation for the implementations for more details
-
Sunspot::IndexQueue::Entry::ActiveRecordImpl
-
Sunspot::IndexQueue::Entry::DataMapperImpl
-
Sunspot::IndexQueue::Entry::MongoImpl
Note that as of version 1.1.0 the data structure for the ActiveRecord and the DataMapper implementations assumes the primary key on the indexed records is an integer. This is done since it is the usual case and far more efficient that using a string index. If your records use a primary key that is not an integer, you’ll need to add an additional migration to change the record_id
column type.
To process the queue:
queue = Sunspot::IndexQueue.new queue.process
This will process all entries currently in the queue. Of course, you’ll probably want to wrap this up in some sort of daemon process. Here is a sample daemon script you could run for a Rails application. You’ll need to customize it if your setup is more complicated.
#!/usr/bin/env ruby require 'rubygems' gem 'daemons-mikehale' require 'daemons' # Assumes the scrip is located in a subdirectory of the Rails root directory rails_root = File.expand_path(File.join(File.dirname(__FILE__), '..')) Daemons.run_proc(File.basename($0), :dir_mode => :normal, :dir => File.join(rails_root, 'log'), :force_kill_wait => 30) do require File.join(rails_root, 'config', 'environment') # Use the default queue settings. queue = Sunspot::IndexQueue.new # Don't want the daemon to fill up the log with SQL queries in development mode Rails.logger.level = Logger::INFO if Rails.logger.level == Logger::DEBUG loop do begin queue.process sleep(2) rescue Exception => e # Make sure we can exit the loop on a shutdown raise e if e.is_a?(SystemExit) || e.is_a?(Interrupt) # If Solr isn't responding, wait a while to give it time to get back up if e.is_a?(Sunspot::IndexQueue::SolrNotResponding) sleep(30) else Rails.logger.error(e) end end end end
The logic in the queue is designed to allow concurrent processing of the queue by multiple processes, however, some documents may end up getting submitted to Solr multiple times. This can speed up the processing of a large number of documents if you need to index a large data set. Forking too many processes to handle your queue, however, will result in more documents being processed multiple times.
If you have multiple models segmented into multiple Solr indexes, you can set up multiple queues. They will share the same persistence backend, but can be configured with different Solr sessions.
# Index all your content in one index content_queue = Sunspot::IndexQueue.new(:session => content_session, :class_names => [BlogPost, Review]) content_session_proxy = Sunspot::IndexQueue::SessionProxy.new(content_queue) # And all your products in another product_queue = Sunspot::IndexQueue.new(:session => product_session, :class_names => Product) product_session_proxy = Sunspot::IndexQueue::SessionProxy.new(product_queue)
When you have updates coming from multiple sources, it is often good to set a priority on when they should get processed so that when there is a backlog, it becomes less noticeable. For example, in an application that has models updated both by human interaction and by an automated feed, you probably want the human updated items to be indexed first. That way when their is a backlog, your human customers won’t notice it. You can control the priority of updates inside a block with the Sunspot::IndexQueue.set_priority method.
When an entry in the queue cannot be sent to Solr, it will be automatically rescheduled to be tried again later. The amount of time is controlled by the retry_interval
setting on IndexQueue (defaults to 1 minute). Every time it is tried and fails, the interval will be increased yet again (i.e. first try wait 1 minute, second try wait 2 minutes, third try wait 3 minutes, etc.).
Error messages and stack traces are stored with the queue entries so the can be debugged.
sThe exception to this is that if Solr is down altogether, the queue will stop processing and entries will be restored to try again immediately. This error is not logged on the entries but is rather thrown as a Sunspot::IndexQueue::SolrNotResponding exception.