Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing / reading to / from file descriptor or memory directly #12

Closed
gaborcsardi opened this issue Jul 10, 2019 · 14 comments
Closed

Writing / reading to / from file descriptor or memory directly #12

gaborcsardi opened this issue Jul 10, 2019 · 14 comments

Comments

@gaborcsardi
Copy link

gaborcsardi commented Jul 10, 2019

Do you think it would be possible to add support for this? It would be great to be able to use a pipe/socket and also memory directly.

@traversc
Copy link
Collaborator

Yep, I was looking at pipes for the next version :)

@gaborcsardi
Copy link
Author

gaborcsardi commented Jul 10, 2019 via email

@traversc
Copy link
Collaborator

traversc commented Jul 22, 2019

I've added two new functions, qsave_pipe and qread_pipe for writing to file descriptors or R connections.

Writing to R connections seems to be un-allowed by CRAN normally e.g. tidyverse/readr#856 (comment), but can be enabled when compiling).

I'm going to test it out a bit more and submit CRAN.

@gaborcsardi
Copy link
Author

Thanks! Unfortunately for my use case R connections are not very good, just a simple Unix fd or a Windows HANDLE would be much better.

@traversc
Copy link
Collaborator

traversc commented Jul 22, 2019

I have this set up in two ways -- one way using R connections, the other way using FILE pointers created by popen from cstdio.h. So for example, you could do this:

> qsave_pipe(1:10, "cat > C:/temp.qc") # cat.exe comes from Rtools installation
> qread_pipe("cat C:/temp.qc")
 [1]  1  2  3  4  5  6  7  8  9 10

On the C++ side, this looks something like this:

  std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(scon.c_str(), "wb"), pclose);
  if (!pipe) {
    throw std::runtime_error("popen() failed!");
  }
  FILE * con = pipe.get();
  fwrite(data, 1, length, con);
 ...

Is that what you had in mind? Working with windows handles or even unix fd's (which I understand are wrapped by FILE * pointers) are a bit beyond my current expertise, and Google isn't being particularly helpful. But I am happy to learn if you could give some tips or pointers on implementation.

@gaborcsardi
Copy link
Author

Thanks! Well, almost. :) FILE * is still too difficult, it has its own buffering, etc.

The best for us would be file descriptors, i.e. the integers returned by open() on Unix, and the HANDLE returned by CreateFile() on Windows. You would probably put these into an external pointer, to be able to handle them the same way on both platforms.

Then we could use mmap() on Unix and MapViewOfFile() on Windows to serialize an R object into shared memory, and this would really speed up sharing data between processes.

@traversc
Copy link
Collaborator

Hi @gaborcsardi, I have a short toy example using file descriptors:

*nix version: https://gist.github.com/traversc/e04911a86c8d581b058815d4aa7e7366
Windows version: https://gist.github.com/traversc/b531a4932e87cca2aa324c6a015c80a4

Do you mind looking it over and seeing if it's what you had in mind?

Some questions for you:

Since we can use file descriptors in both windows and unix-like, that would simplify things, do you think there is still a need to use windows HANDLE ? The windows version worked with surprisingly little modfication.

I'm still not quite clear how mmap would come into play. Could you elaborate an example of how you would use it?

@gaborcsardi
Copy link
Author

gaborcsardi commented Jul 25, 2019

That's a good start! Unfortunately I don't think we can use the integer file descriptors on Windows, not everything is a file on Windows, and e.g. the shared memory handles will not work. But I am actually not completely sure about this.

Re. mmap, we will do this:

  1. open a temp file for writing (or CreateFileMappingA() on Windows) to get an fd
  2. delete the file
  3. resize the file to the "correct size"
  4. call mmap() on the fd to create a memory area in shared memory
  5. copy the data we want to share to shared memory, e.g. serialize into the fd.

Then we pass the fd to subprocesses, and they do an mmap on it as well, and unserialize.

Some bits of this is in r-lib/processx#201 but it needs quite some rewrite still. This has something like a serialization that only works for a list of atomic, non-character vectors. But it does have the advantage that the subprocesses do not need to unserialize, but they can create the objects "within" the serialized data. This is something we probably lose with a proper serialization, unless we design a serialization format that explicitly supports it.

@traversc
Copy link
Collaborator

traversc commented Aug 13, 2019

Hi @gaborcsardi, I think I've put together all the requests in the latest commit. I had to do a bunch of re-factoring to use templates instead of assuming std::fstream.

I have the following new functions:

  • qsave_fd -- save data to a fd (an int)
  • qread_fd -- read data from a fd
  • qsave_handle -- save data to a handle (an external void pointer since HANDLE is #defined as void*; Windows only)
  • qread_handle -- read data from a handle
  • qserialize -- save data to a RawVector
  • qdeserialize -- read data from a RawVector
  • qread_ptr -- read data from a memory void pointer (you also have to provide the length)

I also have the following helper functions:

  • openFd -- open a file descriptor with open
  • closeFd -- close a file descriptor
  • openMmap -- open a mmap from a file descriptor (Linux/Mac)
  • closeMmap -- close a mmap
  • openHandle -- open a file handle with CreateFileA (Windows)
  • closeHandle -- close a file handle
  • openWinFileMapping -- open a handle to a file mapping with MapViewOfFile (Windows)
  • openWinMapView -- open a map view of a file mapping, external void pointer
  • closeWinMapView -- close a map view

qsave and variants also now return invisibly the number of bytes written (as a double; an int is too small for large data)

Here are some examples:

Data:

n <- 5e6
data <- data.frame(a=rnorm(n), 
                   b=rpois(100,n),
                   c=sample(starnames$IAU,n,T),
                   d=sample(state.name,n,T),
                   stringsAsFactors = F)

On Linux/Mac:

library(qs)
fd <- qs:::openFd("/tmp/test.z", "wr")
unlink("/tmp/test.z")
length <- qsave_fd(data, fd, preset = "high")
mptr <- qs:::openMmap(fd, length)
data2 <- qread_ptr(mptr, length)
qs:::closeMmap(mptr, length)
qs:::closeFd(fd)
identical(data, data2)

On Windows:

fh <- qs:::openHandle("N:/test.z", "wr")
unlink("N:/test.z")
length <- qsave_handle(data, fh, preset = "high")
fmh <- qs:::openWinFileMapping(fh, length)
ptr <- qs:::openWinMapView(fmh, length)
data2 <- qread_ptr(ptr, length)
qs:::closeWinMapView(ptr)
qs:::closeHandle(fmh)
qs:::closeHandle(fh)
identical(data, data2)

Serialize to raw vector:

qd <- qserialize(data)
data2 <- qdeserialize(qd)
identical(data, data2)

Anyway, lmk what you think. Thanks.

@gaborcsardi
Copy link
Author

Awesome! Thanks for doing this. I'll take a good look very soon, sorry for the delay.

@traversc traversc closed this as completed Dec 2, 2019
@artemklevtsov
Copy link

Can I use this features (qsave_fd) to append data? I think not, but I want to clarify.

@traversc
Copy link
Collaborator

traversc commented Mar 4, 2020

@artemklevtsov Technically, yes. You would have to open a file descriptor in append mode:

https://stackoverflow.com/questions/7136416/opening-file-in-append-mode-using-open-api

But I don't recommend doing this, as I don't guarantee being able to correctly deserialize data if there are extra bytes at the end of a file.

@artemklevtsov
Copy link

@traversc thank you for the explanation. Do you have any plans to add a feature like that? I look for an alternative for the data.table::fwrite with append to fetch a bulk data.

@traversc
Copy link
Collaborator

traversc commented Mar 5, 2020

@artemklevtsov No plans for that feature as the format isn't set up for that, sorry.

I know that with the fst package you can do that with data.frames, and I definitely support using fst for that purpose.

Alternatively, you can save two separate data.frame objects and use rbind after reading. That should also be pretty fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants