Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added remove indexes before bulk insert and add indexes after #12

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

aishek
Copy link
Contributor

@aishek aishek commented Sep 29, 2014

I've improved bulk insert performance: remove all indexes before first insert, and create after all inserts to avoid index rebuilds on each insert operation.

@toy
Copy link
Owner

toy commented Dec 5, 2014

Doesn't work for rails < 4.0: travis.

@aishek aishek force-pushed the ignore-indexes-on-restore branch from eb114e3 to ce0fd35 Compare December 7, 2014 16:57
@aishek
Copy link
Contributor Author

aishek commented Dec 7, 2014

Sorry, didn't notice that – fixed now.

@toy
Copy link
Owner

toy commented Dec 8, 2014

It will be better to use ActiveRecord::ConnectionAdapters::IndexDefinition.members instead of INDEX_FIELDS (for example becasue new rails version can add or change fields).
Why do you use all_indexes.select {|index| index.table == table} if ActiveRecord::Base.connection.indexes(table) should already return indexes for one table.
There are no tests.

More general — I assume that removing indexes during insertion should make whole process faster, but my attempts with/without removing indexes showed no difference using sqlite and postgres or opposite effect using mysql. Could you please create an example (schema and a script to generate data) which will be faster to expand with indexes removed?

@toy
Copy link
Owner

toy commented Dec 24, 2014

@aishek Did you get a notification about my comment?

@aishek
Copy link
Contributor Author

aishek commented Dec 24, 2014

@toy yeah, I've got it. A lot of work these days, sorry :)

@toy
Copy link
Owner

toy commented Dec 24, 2014

@aishek No worries :)

@aishek
Copy link
Contributor Author

aishek commented Jan 10, 2015

Let's discuss impact to speed for indexes removing. Here is example for sqlite3:

macbook-pro-2:temp aishek$ time sqlite3 test.sqlite3 < create_index_first.sql
real   0m1.150s
user   0m0.351s
sys    0m0.179s

macbook-pro-2:temp aishek$ time sqlite3 test.sqlite3 < create_index_after.sql
real   0m0.611s
user   0m0.299s
sys    0m0.123s

create_index_first.sql
create_index_after.sql

I suppose high selectivity of age column in my example make sense.

How do you think, should this feature present in dump? If so, I'll improve my pull request using your comments.

@toy
Copy link
Owner

toy commented Jan 10, 2015

I am a convinced.

Reworked example for multiple databases:

#!/usr/bin/env ruby

require 'benchmark'

table_cmd = 'echo "create table people (age int);"'
index_cmd = 'echo "create index people_age on people(age);"'
drop_cmd = 'echo "drop table people;"'

# Create data.sql inserting `n` rows
n = 100_000
per_insert = 500
values = Array(20..30)
File.open('data.sql', 'w') do |f|
  (n / per_insert).times do
    values_sql = Array.new(per_insert){ "('#{values.sample}')" }.join(',')
    f.puts("insert into people (age) VALUES #{values_sql};")
  end
end
data_cmd = 'cat data.sql'

Benchmark.bmbm do |x|
  {
    :sqlite => 'sqlite3 test.sqlite',
    :mysql => 'mysql5 -D test',
    :postgres => 'psql -q test',
  }.each do |db_name, db_cmd|
    x.report("#{db_name}: data then index") do
      system "(#{[table_cmd, data_cmd, index_cmd, drop_cmd] * ';'}) | #{db_cmd}"
    end

    x.report("#{db_name}: index then data") do
      system "(#{[table_cmd, index_cmd, data_cmd, drop_cmd] * ';'}) | #{db_cmd}"
    end
  end
end

Results:

Rehearsal -------------------------------------------------------------
sqlite: data then index     0.000000   0.000000   0.780000 (  0.878953)
sqlite: index then data     0.000000   0.000000   1.060000 (  1.185402)
mysql: data then index      0.000000   0.000000   0.030000 (  0.232668)
mysql: index then data      0.000000   0.000000   0.030000 (  0.717944)
postgres: data then index   0.000000   0.000000   0.040000 (  0.368946)
postgres: index then data   0.000000   0.000000   0.040000 (  0.575253)
---------------------------------------------------- total: 1.980000sec

                                user     system      total        real
sqlite: data then index     0.000000   0.000000   0.780000 (  0.933262)
sqlite: index then data     0.000000   0.000000   1.090000 (  1.398851)
mysql: data then index      0.000000   0.000000   0.030000 (  0.261416)
mysql: index then data      0.000000   0.000000   0.030000 (  0.731750)
postgres: data then index   0.000000   0.000000   0.040000 (  0.376046)
postgres: index then data   0.000000   0.000000   0.040000 (  0.584572)

@aishek aishek force-pushed the ignore-indexes-on-restore branch from 1255e19 to 5df6df8 Compare January 11, 2015 18:53
@aishek aishek force-pushed the ignore-indexes-on-restore branch from cd89333 to b6e666f Compare January 11, 2015 19:13
@aishek aishek force-pushed the ignore-indexes-on-restore branch from 78318c7 to 6adfbcf Compare January 11, 2015 19:25
@aishek
Copy link
Contributor Author

aishek commented Jan 11, 2015

I am using INDEX_FIELDS because of these approach. I'll little improved naming in last commits: now this array called VALID_INDEX_OPTIONS.

I also removed additional select.

I also added tests, rebased code, fixed pg and i18n gems versions for ruby 1.8.7

@toy
Copy link
Owner

toy commented Jan 16, 2015

After looking at rails DB adapters code it seems to me that not all index options which are supported by add_index (like internalor algorithm) will be present in instances of IndexDefinition returned by ActiveRecord::Base.connection.indexes(table). So there should be an environment variable to disable removing/adding indexes and an explanation in readme.

Your current approach for adding indexes ignores at least lengths and orders. To check - try to add an index with length to a table in MySQL DB and check how indexes returned by ActiveRecord::Base.connection.indexes(table) differ before and after restoring table.
So maybe it is better to grab all options provided by IndexDefinition and convert those which are arrays to hashes using column names as keys.

The test you've added checks only if with_disabled_indexes is called (also doesn't call the block) and doesn't check what happens to indexes.

Changes to reader.rb in b6e666f create long lines so are in reality creating code-style issues. They are not displayed as a some style checks are yet disabled by .rubocop_toto.yml.

I've already cherry-picked your commits about pg and i18n for version with updated progress gem dependency

@aishek
Copy link
Contributor Author

aishek commented Jan 16, 2015

I'll fix these issues later. Thanks for detailed remark!

@aishek
Copy link
Contributor Author

aishek commented Jan 21, 2015

Ok, I fixed all:

  • added disabled by default environment variable with explanation in README
  • added mysql lengths support
  • added with_disabled_indexes specs
  • fixed codestyle issues

There is two things to discuss.

First, I've extracter assets-related logic from Reader to separate class to fix Rubocop warning about long class definition. Now there is some inconsistency in code: Summary and Assets classes specs should be extracted to separate files, and DB-related code should be extracted from Reader to separate class.

Second, 7966b03#diff-89f6960ada41447910198cd9a526b3dcR73 – ugly send

What do you think?

@toy
Copy link
Owner

toy commented Feb 1, 2015

Sorry for long wait.

About extracting assets related logic: better not to do it as part of this pull request — just increase Metrics/ClassLength in .rubocop_todo.yml for now. Code of this gem needs cleanup but not as part of adding features.

Entry in readme — there is script/update_readme to add environment variable descriptions from Env class.
Also you should add :rebuild_indexes to :restore_options in variable_names_for_command.

index_options is problematic:
Try to add index like add_index :chickens, [:string_col, :text_col], :length => 10 to spec/db/schema.rb and run REBUILD_INDEXES=1 rspec.
It's better to add such index to schema and add a spec comparing indexes before and after restore to cycle_spec to test on real database.

Also you handle lengths but not orders, and PostgreSQL adapter returns orders. So I still think that instead of using VALID_INDEX_OPTIONS it will be better to go through index.members and if it is a hash — assign it to "singularised" option name and if it's an array — create hash by zipping it with column names and also assign to "singularised" option name. Do you see problems with such approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants