Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

Actor crash when reads and writes both block #132

Open
fionawhim opened this issue Feb 12, 2015 · 15 comments
Open

Actor crash when reads and writes both block #132

fionawhim opened this issue Feb 12, 2015 · 15 comments
Labels
Milestone

Comments

@fionawhim
Copy link

Hi folks! I'm coming to Celluloid IO via the Krakow NSQ client gem. I have a reproduction of a bug that we've been seeing there, but I'm not sure if it's a bug in Celluloid IO or a bug in Krakow's assumptions about Celluloid IO.

The error we're seeing is “ArgumentError: this IO is already registered with selector” come from the monitor = @selector.register(io, set) line in reactor.rb’s wait method.

The code opens a TCP socket and has a permanent read loop, run via async. It then receives writes (usually synchronously). The bug seems to occur when, during write, an :IO::WaitWritable exception is thrown (see stream.rb's syswrite method). This causes Celluloid IO to wait_writeable on the actor, leading to a call to reactor.rb's wait method.

At this point, the reactor's selector is already registered on the io due to the recv in the read loop. When it registers again for the blocked write, the selector implementation raises an exception due to the double-registration of the same IO object.

Here are two files that demonstrate the problem. It takes maybe 10 iterations on my machine to see the exception.

# server.rb
require 'socket'

server = TCPServer.new 2000 # Server bind to port 2000

loop do
  server.accept
end
# client.rb
require 'celluloid/io'

class Client
  include Celluloid::IO

  def init!
    @socket = Celluloid::IO::TCPSocket.new("localhost", 2000)
    async.read_loop
  end

  def read_loop
    while true
      puts "READING"
      @socket.recv(8)
    end
  end

  def write
    @socket.write (0...100000).map { ('a'..'z').to_a[rand(26)] }.join
  end
end

client = Client.new()
client.init!

loop do
  puts "WRITING"
  client.write
end

The error trace on my box is:

$ ~/c/celluloid-test [1]> ruby ./client.rb
READING
WRITING
WRITING
WRITING
WRITING
WRITING
WRITING
WRITING
D, [2015-02-12T13:05:01.978070 #62944] DEBUG -- : Terminating 1 actor...
E, [2015-02-12T13:05:01.978205 #62944] ERROR -- : Actor crashed!
ArgumentError: this IO is already registered with selector
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `register'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `wait'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:26:in `wait_writable'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io.rb:65:in `wait_writable'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:33:in `wait_writable'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:63:in `rescue in block in syswrite'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:60:in `block in syswrite'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:386:in `synchronize'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:58:in `syswrite'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:352:in `do_write'
    /Users/phopkins/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/celluloid-io-0.16.1/lib/celluloid/io/stream.rb:246:in `write'
    ./client.rb:20:in `write'

Is this considered a bug in Celluloid IO, or is this simultaneous read/write on the same TCPSocket that Krakow is doing not a good pattern?

Thanks for any help on this issue. I'm happy to give a fix a try if someone can sketch out what they think would be good; I'm wondering about having both a “read” and a “write” selector inside the reactor for the waits (since latches elsewhere in the code ensure that those are exclusive) but there are some details there that don’t quite fit.

@fionawhim
Copy link
Author

I should add that my expected behavior for my repro script is for the writes to eventually block (since the server isn't consuming any of the data).

This is coming up in production because of network latency or some other reason causing the syswrite call to occasionally receive an :IO::WaitWritable exception.

@tarcieri
Copy link
Member

It's a bug in Celluloid::IO. Here was the initial attempt at a fix:

a4e953c

This was reverted because the way it's implemented, monitors accumulate in @monitors and are never garbage collected. Since monitors hold a reference to an IO object, this resulted in a file descriptor leak.

The correct implementation would be for Celluloid::IO objects to instead track the reactors they're attached to, and be able to look up the monitor they're currently using. That is, rather than reactors tracking their monitors (which is at best O(log(n)) complexity for n monitors), IO objects should track the reactors and their monitors.

There's a missing API in nio4r that would help quite a bit as well: right now nio4r does not expose an API for changing interest ops.

@fionawhim
Copy link
Author

@tarcieri Do you think that commit is close enough to a solution that I should try to fix the leak issue? Or is there a reason why cleaning them up is not tractable?

@tarcieri
Copy link
Member

It needs to be completely redone in the manner I outlined in my last comment

@fionawhim
Copy link
Author

OK. I'd like to take a stab at it, then, even without the nio4r change.

@tarcieri
Copy link
Member

Go for it.

@fionawhim
Copy link
Author

Thanks for your time explaining this. Just to follow up, I don't think I will able to be successful making the change you describe. The surface area of code I'm unfamiliar with is I think too high.

(I actually wonder if fixing this in nio4r and making it tolerant of 2 monitors for 1 io if one is read and the other is write is the cleanest way; I worried about losing the nice cleanup stuff of e.g. the monitor.close call in the ensure block in Reactor that was added recently when trying to re-use the same monitor across multiple wait calls.)

I've figured out a workaround that unblocks me for my specific use of Krakow so I will have to be content with that. I won't try to make a PR to fix this in Celluloid-IO or nio4r.

@tarcieri
Copy link
Member

I might take the time soon to add the ability to change interest ops to nio4r.

@ioquatix
Copy link
Contributor

Can you explain briefly what needs to be done?

@tarcieri
Copy link
Member

In nio4r, or in Celluloid::IO?

nio4r needs an API like this one from Java:

http://docs.oracle.com/javase/7/docs/api/java/nio/channels/SelectionKey.html#interestOps(int)

@fedenusy
Copy link

+1, I'm seeing this bug in my project as well.

@digitalextremist digitalextremist modified the milestone: 0.18.0 Aug 9, 2015
@HoneyryderChuck
Copy link

@tarcieri I saw that nio4r already has an API to change an interest on an IO object. Is this the missing link?

@tarcieri
Copy link
Member

@TiagoCardoso1983 yep

@HoneyryderChuck
Copy link

awesome! should this/is this already handled transparently in that nio4r version? If not, what needs to be adapted? I saw that the code didn't change and registering already registered io objects fails. Or should the relationship IO/monitor/reactor change first?

@tarcieri
Copy link
Member

@TiagoCardoso1983 I think each IO object should keep a registry of which selectors it's been registered with, so it can look up the associated monitor and change the interests.

Certainly not transparent, but interests are at least configurable now...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants