Read-ahead or concurrent fetching? #13

lawik · 2023-06-09T06:51:18Z

I have an archive that has a lot of small files and while I haven't measured to confirm I'm pretty sure the streaming is slowing down a lot (download drops from multiple Mb/s to a few Kb/s) as the overhead/latency of each file fetch is more significant than the transfer time.

Thousands of files in this case.
It does complete eventually but it would be neat to be able to ask Packmatic to buffer at least X bytes forward beyond the current need or similar.

Is there a way to do it that I've missed or would this be a good addition in your eyes?

evadne · 2023-06-09T13:17:40Z

Are you able to draw a waterfall graph of each file that needed to be fetched or provide some details as to the size of files and the relative time to first byte?

There are many ways to pre-compute parts, yes this would include using an intermediary layer to concurrently prepare some responses etc

lawik · 2023-06-09T15:05:30Z

I think this reproduces the problem in a reasonable experimental way, a livebook:

Packmatic starvation

Mix.install([:packmatic, :kino, :kino_vega_lite])
alias VegaLite, as: Vl

Section

delay = Kino.Input.number("Delay, ms", default: 300) |> Kino.render()
file_size = Kino.Input.number("File size, kb", default: 512) |> Kino.render()
entry_count = Kino.Input.number("Files", default: 200)

chart =
  Vl.new(width: 400, height: 400)
  |> Vl.mark(:line)
  |> Vl.encode_field(:x, "x", type: :quantitative)
  |> Vl.encode_field(:y, "y", type: :quantitative)
  |> Kino.VegaLite.new()

t1 = System.os_time(:millisecond)

log_event = fn event ->
  t2 = System.os_time(:millisecond)
  offset = t2 - t1

  case event do
    %Packmatic.Event.EntryUpdated{stream_bytes_emitted: bytes} ->
      seconds = offset / 1000

      if seconds > 0 do
        kb_per_second = bytes / 1024 / seconds
        IO.inspect(kb_per_second, label: "kb/s")
        point = %{x: seconds, y: kb_per_second}
        Kino.VegaLite.push(chart, point)
      end

    %Packmatic.Event.EntryCompleted{} ->
      IO.inspect(event)

    _ ->
      nil
  end

  :ok
end

latency = Kino.Input.read(delay)
size = Kino.Input.read(file_size)

small_remote_file = fn ->
  # Overhead latency for request
  :timer.sleep(latency)
  # size 512 kb
  {:ok, {:random, size * 1024}}
end

count = Kino.Input.read(entry_count)

entries =
  1..count
  |> Enum.map(fn num ->
    [
      source: {:dynamic, small_remote_file},
      path: "#{num}.txt"
    ]
  end)

{t, _} =
  :timer.tc(fn ->
    entries
    |> Packmatic.build_stream(on_event: log_event)
    |> Stream.run()
  end)

IO.inspect(t / 1000, label: "took ms")
IO.inspect(count * latency, label: "entries * delay, ms")

lawik · 2023-06-09T15:06:10Z

Pasting a livebook in github is kinda weird :D

lawik · 2023-06-09T15:06:47Z

Gist: https://gist.github.com/lawik/b660a9a38061ffb4e7ae4f4378d2ccb9

evadne · 2023-12-12T16:10:18Z

@lawik Revisiting the problem there are some solutions around this

Make the Encoder concurrent by default with some adjustable prefetching logic but then make it tunable
Write a ConcurrentEncoder and keep both
Optimise connection setup, by using Hackney and pooling the connections we should be fast ish to a certain degree

It would depend on:

Whether the sources are on different hosts that resolve to different IP/port pairs which would require separate connections

Whether the individual files are large or small

Etc

There is also another solution which is to keep the encoding entries hot-addable so you have a producer and the consumer just goes on and on until it gets an end message. Then the intermediary layer can be added

lawik · 2023-12-13T18:25:58Z

I no longer have the problem because we optimized away the need for about 7000 files and suddenly things are quite snappy.

The last option you mention would let the developer determine their own level of look-ahead. I assume this could be modeled as a a stream of entries instead of a finalized list?

mellelieuwes mentioned this issue Jul 5, 2024

Downloading of data from storage is relatively slow eyra/mono#909

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read-ahead or concurrent fetching? #13

Read-ahead or concurrent fetching? #13

lawik commented Jun 9, 2023

evadne commented Jun 9, 2023

lawik commented Jun 9, 2023 •

edited

Loading

lawik commented Jun 9, 2023

lawik commented Jun 9, 2023

evadne commented Dec 12, 2023

lawik commented Dec 13, 2023

Read-ahead or concurrent fetching? #13

Read-ahead or concurrent fetching? #13

Comments

lawik commented Jun 9, 2023

evadne commented Jun 9, 2023

lawik commented Jun 9, 2023 • edited Loading

Packmatic starvation

Section

lawik commented Jun 9, 2023

lawik commented Jun 9, 2023

evadne commented Dec 12, 2023

lawik commented Dec 13, 2023

lawik commented Jun 9, 2023 •

edited

Loading