Skip to content

Commit

Permalink
Clickhouse import process (#59)
Browse files Browse the repository at this point in the history
* Scripts, property file, docs for Clickhouse import process

* add sling copy scripts, properties adjustments

* Add preliminary drop tables and create_derived scripts

also bugfixes added and TODO notes

* updates, fixes, and new set_update_process_state

* updates and fixes - partial update of create_derived

* complete create_derived_tables script

* add db schema script processing and derived tables

* separate get_database_currently_in_production.sh

* synchronize users to clickhouse as well as mysql and also renamed overloaded functions that clashed in mysql/sling/clickhouse shell scripts

* cleanup synchronize_user_tables script and working create_derived_tables_in_clickhouse_database_by_profile

* further development

- focus on derived table construction and wrap up steps

Co-authored-by: Manda Wilson <[email protected]>
Co-authored-by: Robert Sheridan <[email protected]>

* refinements

- add memory use target and batching to create_derived_tables
- add download of current ".sql" derived table statements from github

* add support for single-db (not green/blue) ops

* add arg checking

* fix argument processing bug

* fix continue bug

---------

Co-authored-by: importer system account <cbioportal_importer@knowledgesystems-importer.cbioportal.aws.mskcc.org>
Co-authored-by: Manda Wilson <[email protected]>
  • Loading branch information
3 people authored Feb 20, 2025
1 parent 64176ae commit c369485
Show file tree
Hide file tree
Showing 16 changed files with 3,064 additions and 0 deletions.
90 changes: 90 additions & 0 deletions scripts/clickhouse_import_support/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# cBioPortal Import Process Database Management Tools
These tools support a blue-green deployment approach to cBioPortal database updates.
This strategy was introduced to support the introduction of a coupled ClickHouse database
which will be used in conjunction with the prior MySQL database in order to improve the
runtime performance of the cBioPortal study view page.

Import of cancer studies is now directed into a not-in-production copy of the production
MySQL database using the existing import codebase. The newly populated MySQL database is
used as a datasource for populating a not-in-production ClickHouse database. Using this
approach, the production databases remain consistent because no changes occur to either
database during import operations. Once the ClickHouse database has been fully populated
and related derived tables and persistent views have been created in ClickHouse, the
cBioPortal web server backend can be switched over quickly to use the newly populated
database and make the newly imported cancer studies availabile in production.

## clone\_mysql\_database.sh
This bash script uses the *mysql* command line interface tool to make a complete copy
of the current production database into a separate database on the same MySQL server.
This will occur to initialize the not-in-production database and prepare it for cancer
study import.

## drop\_tables\_in\_mysql\_database.sh
This bash script uses the *mysql* command line interface tool to drop all tables which
exist in a mysql database. This will occur at the end of an import process in order to
clear the data from the prior production database (or the backup copy database) in order
to make the database empty and available for reuse during the next cycle of cancer study
import.

## copy\_mysql\_database\_tables\_to\_clickhouse.sh
This bash script uses the *sling* command line interface tool to copy data from all tables
present in the selected mysql database (green or blue) into the corresponding sling
database. Multiple retries are attempted on individual attempt failures. Copy results are
validated by record counts.

## create\_derived\_tables\_in\_clickhouse\_database.sh
This bash script uses the *clickhouse* command line interface tool to generate derived
tables in clickhouse from the newly copied tables in clickhouse. It takes in an ordered
list of SQL files, splits them into a set of files that each have one SQL statement.
It then iterates through the SQL statements sequentially. For most statements, it uses
the *clickhouse* command line interface tool to run the SQL statement. If it finds an
insert statement into either the *genetic_alteration_derived* or *generic_assay_data_derived*
tables, it executes the *create_derived_tables_in_clickhouse_database_by_profile.py* script
instead of executing the SQL statements directly.

## create\_derived\_tables\_in\_clickhouse\_database\_by\_profile.py
This python 3 script uses the *clickhouse* command line interface tool to modify two
SQL insert statements so that instead of running for all genetic profiles at once,
the queries are run once per genetic profile. This is done to reduce memory usage and
also so that if there is an error for one genetic profile, it doesn't prevent all following
genetic profiles from being handled. The two insert statements are for the
*genetic_alteration_derived* and *generic_assay_data_derived* tables.

## synchronize\_user\_tables\_between\_databases.sh
This bash script uses both the *mysql* and *clickhouse* command line interface tools
to update both mysql and clickhouse databases with any users that have been put into
the mysql database that they were cloned from. If the 'green' databases
have been cloned from the 'blue' databases, and now the 'blue' mysql database contains
users not in the 'green' databases, this script can copy any new users in the 'blue'
mysql database to both the 'green' mysql database and the 'green' clickhouse database.

## get\_database\_currently\_in\_production.sh
This bash script uses the *mysql* command line interface to get the current production database
from the management database, either 'green' or 'blue'.

## set\_update\_process\_state.sh
This bash script uses the *mysql* command line interface to set the management database
state to either 'running' or 'complete'. The script takes in the following options:
'running', 'complete', or 'abandoned'. The status can only be set to 'running' if it is
currently 'complete'. If the script is passed 'complete' and the status is currently
'running' the *time_of_last_update_process_completion* is set to the current timestamp
and the *current_database_in_production* is switched either from blue -> green or from
green -> blue. If the script is passed 'abandoned', and the current status is 'running'
the *time_of_last_update_process_completion* and *current_database_in_production*
are unchanged but the status is set to 'complete'.

## drop\_tables\_in\_clickhouse\_database.sh
This bash script uses the *clickhouse* command line interface tool to drop all tables which
exist in a clickhouse database. This will occur at the end of an import process in order to
clear the data from the prior production database (or the backup copy database) in order
to make the database empty and available for reuse during the next cycle of cancer study
import.

## Libraries:
* *mysql_command_line_functions.sh* contains functions for interacting with the *mysql* command
line interface.
* *sling_command_line_functions.sh* contains functions for interacting with the *sling* command
line interface.
* *clickhouse_client_command_line_functions.sh* contains functions for interacting with the
*clickhouse* command line interface.
* *parse_property_file_functions.sh* contains functions for parsing a *\*.properties* file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
#!/usr/bin/env bash

unset configured_clickhouse_config_file_path
unset sql_data_field_value
unset sql_data_array
configured_clickhouse_config_file_path=""
sql_data_field_value=""
declare -a sql_data_array
clickhouse_client_database_exists_filepath="$(pwd)/ccclf_database_exists.txt"
clickhouse_client_database_table_list_filepath="$(pwd)/ccclf_database_table_list.txt"

function write_clickhouse_config_file() {
local selected_database=$1
configured_clickhouse_config_file_path="$(pwd)/clickhouse_client_config_$(date "+%Y-%m-%d-%H-%M-%S").yaml"
if ! rm -f "$configured_clickhouse_config_file_path" || ! touch "$configured_clickhouse_config_file_path" ; then
echo "Error : unable to create clickhouse_client_config file $configured_clickhouse_config_file_path" >&2
return 1
fi
chmod 600 "$configured_clickhouse_config_file_path"
local db_name
if [ -z "$selected_database" ] ; then
db_name="${my_properties['clickhouse_database_name']}"
else
if [ "$selected_database" == "blue" ] ; then
db_name="${my_properties['clickhouse_blue_database_name']}"
else
if [ "$selected_database" == "green" ] ; then
db_name="${my_properties['clickhouse_green_database_name']}"
else
echo "Error : selected_database (when specified) must be passed as either 'blue' or 'green'. The argument passed was : '$selected_database'" >&2
return 1
fi
fi
fi
echo "user: ${my_properties['clickhouse_server_username']}" >> "$configured_clickhouse_config_file_path"
echo "password: ${my_properties['clickhouse_server_password']}" >> "$configured_clickhouse_config_file_path"
echo "host: ${my_properties['clickhouse_server_host_name']}" >> "$configured_clickhouse_config_file_path"
echo "port: ${my_properties['clickhouse_server_port']}" >> "$configured_clickhouse_config_file_path"
echo "database: $db_name" >> "$configured_clickhouse_config_file_path"
if ! [ "$(cat $configured_clickhouse_config_file_path | wc -l)" == "5" ] ; then
echo "Error : could not successfully write clickhouse_client config properties to file $configured_clickhouse_config_file_path" >&2
return 1
fi
return 0
}

function initialize_clickhouse_client_command_line_functions() {
local selected_database=$1
write_clickhouse_config_file "$selected_database"
}

function shutdown_clickhouse_client_command_line_functions() {
rm -f "$configured_clickhouse_config_file_path"
rm -f "$clickhouse_client_database_exists_filepath"
rm -f "$clickhouse_client_database_table_list_filepath"
unset configured_clickhouse_config_file_path
unset sql_data_field_value
unset sql_data_array
unset clickhouse_client_database_exists_filepath
unset clickhouse_client_database_table_list_filepath
}

function execute_sql_statement_via_clickhouse_client() {
local statement=$1
local output_filepath=$2
if [ -e "$output_filepath" ] && ! rm -f "$output_filepath" ; then
echo "Error : could not overwrite existing output file $output_filepath when executing mysql statment $statement" >&2
fi
(
clickhouse client --config-file="$configured_clickhouse_config_file_path" --format=TabSeparatedWithNames <<< "$statement" > "$output_filepath"
)
}

function execute_sql_statement_from_file_via_clickhouse_client() {
local statement_filepath=$1
local output_filepath=$2
if [ -e "$output_filepath" ] && ! rm -f "$output_filepath" ; then
echo "Error : could not overwrite existing output file $output_filepath when executing mysql statments from file $statement_filepath" >&2
fi
(
clickhouse client --config-file="$configured_clickhouse_config_file_path" --format=TabSeparatedWithNames --queries-file="$statement_filepath" > "$output_filepath"
)
}

function set_clickhouse_sql_data_field_value_from_record() {
local record_string=$1
local column_number=$2
unset sql_data_field_value
local record_string_length=${#record_string}
local LF=$'\n'
local TAB=$'\t'
local BACKSLASH=$'\\'
local NULL_MARKER='NULL_CHARACTER_CANNOT_BE_REPRESENTED'
local BACKSPACE=$'\b'
local FF=$'\f'
local CR=$'\r'
local APOSTROPHE="'"
local ENCODED_LF='\n'
local ENCODED_TAB='\t'
local ENCODED_BACKSLASH='\\'
local ENCODED_NULL='\0'
local ENCODED_BACKSPACE='\b'
local ENCODED_FF='\f'
local ENCODED_CR='\r'
local ENCODED_APOSTROPHE="\'"
local pos=0
local field_index=0
local parsed_value=""
while [ $pos -lt $record_string_length ] ; do
local character_at_position="${record_string:$pos:1}"
# a newline should occur at the end of the read line, and only there. Embedded newlines are encoded with '\n'
if [ "$character_at_position" == "$LF" ] ; then
field_index=$((field_index+1))
if [ "$field_index" -gt "$column_number" ] ; then
# field has been completely parsed
sql_data_field_value="$parsed_value"
return 0
fi
echo "Error : unable to locate column $column_number while parsing returned database record : $record_string" >&2
return 1
fi
# a tab character delimits the beginning of a new field, and is not part of the field. Embedded tabs are encoded with '\t'
if [ "$character_at_position" == "$TAB" ] ; then
field_index=$((field_index+1))
if [ "$field_index" -gt "$column_number" ] ; then
# field has been completely parsed
sql_data_field_value="$parsed_value"
return 0
fi
pos=$(($pos+1))
continue
fi
# a backslash must begin one of 8 possible escape sequences, all of which are made up of 2 characters : {'\n', '\t', '\\', '\0', '\b', '\f', '\r', "\'"}. No "plain" backslashes should be encountered.
if [ "$character_at_position" == "$BACKSLASH" ] ; then
local candidate_escape_string="${record_string:$pos:2}"
local decoded_character=""
if [ "$candidate_escape_string" == "$ENCODED_LF" ] ; then
decoded_character="$LF"
fi
if [ "$candidate_escape_string" == "$ENCODED_TAB" ] ; then
decoded_character="$TAB"
fi
if [ "$candidate_escape_string" == "$ENCODED_BACKSLASH" ] ; then
decoded_character="$BACKSLASH"
fi
if [ "$candidate_escape_string" == "$ENCODED_NULL" ] ; then
decoded_character="$NULL_MARKER"
fi
if [ "$candidate_escape_string" == "$ENCODED_BACKSPACE" ] ; then
decoded_character="$BACKSPACE"
fi
if [ "$candidate_escape_string" == "$ENCODED_FF" ] ; then
decoded_character="$FF"
fi
if [ "$candidate_escape_string" == "$ENCODED_CR" ] ; then
decoded_character="$CR"
fi
if [ "$candidate_escape_string" == "$ENCODED_APOSTROPHE" ] ; then
decoded_character="$APOSTROPHE"
fi
# pass over the escape sequence
pos=$(($pos+2))
if [ "$field_index" -eq "$column_number" ] ; then
if [ "$decoded_character" == "$NULL_MARKER" ] ; then
echo "Warning : discarding encoded NULL character (\\0) encountered at position $pos while parsing returned database record : $record_string" >&2
continue
fi
if [ -z "$decoded_character" ] ; then
echo "Error : unrecoginzed backslash escape sequence encountered at position $pos while parsing returned database record : $record_string" >&2
return 1
fi
parsed_value+="$decoded_character"
fi
continue
fi
# pass over the current (plain) character
pos=$(($pos+1))
if [ "$field_index" -eq "$column_number" ] ; then
parsed_value+="$character_at_position"
fi
done
sql_data_field_value="$parsed_value"
}

function set_clickhouse_sql_data_array_from_file() {
local filepath=$1
local column_number=$2
unset sql_data_array
if ! [ -r "$filepath" ] ; then
echo "Error : could not read output mysql query results from file : $filepath" >&2
return 1
fi
local headers_have_been_parsed=0
sql_data_array=()
while IFS='' read -r line ; do
if [ "$headers_have_been_parsed" -eq 0 ] ; then
headers_have_been_parsed=1
else
set_clickhouse_sql_data_field_value_from_record "$line" "$column_number"
sql_data_array+=("$sql_data_field_value")
fi
done < "$filepath"
}

function clickhouse_database_exists() {
local database_name=$1
local statement="SELECT COUNT(*) FROM system.databases WHERE name = '$database_name'"
if ! execute_sql_statement_via_clickhouse_client "$statement" "$clickhouse_client_database_exists_filepath" ; then
echo "Warning : unable to determine if database $database_name exists using : $statement" >&2
return 1
fi
set_clickhouse_sql_data_array_from_file "$clickhouse_client_database_exists_filepath" 0
if [[ "${sql_data_array[0]}" -ne 1 ]] ; then
echo "Warning : database $database_name not present on database server, or there are multiple listings for that name" >&2
return 2
fi
return 0
}

function clickhouse_database_is_empty() {
local database_name=$1
local statement="SELECT COUNT(*) FROM INFORMATION_SCHEMA.tables WHERE table_schema='$database_name'"
if ! execute_sql_statement_via_clickhouse_client "$statement" "$clickhouse_client_database_table_list_filepath" ; then
echo "Warning : unable to retrieve table/view list from database $database_name using : $statement" >&2
return 1
fi
set_clickhouse_sql_data_array_from_file "$clickhouse_client_database_table_list_filepath" 0
if [[ "${sql_data_array[0]}" -ne 0 ]] ; then
echo "Warning : database $database_name has tables or views (is not empty as required)" >&2
return 2
fi
return 0
}
Loading

0 comments on commit c369485

Please sign in to comment.