Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixtures #255

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ scripts/*.json
conda*
input.json
*.swp
patterns.tsv
522 changes: 522 additions & 0 deletions fixtures/requests/09603_I/09603_I.file.json

Large diffs are not rendered by default.

3,182 changes: 3,182 additions & 0 deletions fixtures/requests/09603_I/09603_I.filemetadata.json

Large diffs are not rendered by default.

56 changes: 56 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Helper Scripts

## Dumping Database Fixtures

In order to dump JSON copies of database fixtures, you should first have loaded a copy of a production database backup into your Beagle PostgreSQL database instance.

The included `dump_db_fixtures.py` script can be used to some types of fixtures. See `dump_db_fixtures.py -h` for the most up to date details.

### Dump Requests

A request can be dumped with `dump_db_fixtures.py request <request_id>`. This will output a JSON for the File entries associated with the request, and a JSON for the FileMetadata entries. Use the `--santize` flag to replace personal information in the fixtures (e.g. people's names and email addresses) with fake ones. Currently, `--sanitize`

Example:

```
$ ./dump_db_fixtures.py request 09603_I --sanitize

$ ls
09603_I.file.json
09603_I.filemetadata.json
```

- NOTE: `dump_db_fixtures.py` should be run from your Beagle dev environment since it requires Python 3 + Django. If you installed with the `Makefile` in the `beagle` root directory, from there you can just use `make bash` to load your dev instance environment to run the script with.

## Sanitize Fixtures

Fixtures for use in the repo should be further sanitized to remove sample ID's and replace CMO ID's.

Some easy helper scripts have been included for this purpose.

Use the script `get_fields_to_sanitize.sh` to print out a list of all recognized CMO ID's and sample ID's in the JSON you generated with `dump_db_fixtures.py`

```
$ ./get_fields_to_sanitize.sh 09603_I.filemetadata.json
# a list of id's
```

Use the list of ID's that are printed out to create a `patterns.tsv` file in this same directory. This will be used for the `sanitize.sh` script later. The `patterns.tsv` file should be formatted like this:

```
old_pattern1 new_pattern1
old_pattern2 new_pattern2
```

A quick & easy way to do this is to utilize both Excel and a raw text editor (`nano`, Atom, Sublime, etc.). You can copy the terminal output from `get_fields_to_sanitize.sh` into Excel, then in the adjacent column fill in dummy identifiers such as `Sample1`, `Sample2`, etc.. For CMO ID's, you can generate fake ones with the included `generate_cmo_id.py` script. Once you have two columns, simple highlight them in Excel (just the two columns, not the entire sheet) and paste them into `nano`/Atom/Sublime/etc. and they should be entered as tab delimited text which you can save with the filename of `patterns.tsv`

- TODO: write a script to do this instead

Once `patterns.tsv` is present, you can run the `sanitize.sh` script on your JSON to replace all the old patterns with new ones.


```
./sanitize.sh 09603_I.filemetadata.json
```

For good measure, double check the contents of the JSON to verify that it looks correct before commiting it. The JSON should be moved to the appropriate `fixtures` directory.
194 changes: 176 additions & 18 deletions scripts/dump_db_fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

$ dump_db_fixtures.py port_files fd41534c-71eb-4b1b-b3af-e3b1ec3aecde

$ dump_db_fixtures.py run --onefile ca18b090-03ad-4bef-acd3-52600f8e62eb

Output
------

Expand All @@ -34,7 +36,7 @@
import json
import argparse
import django
from django.db.models import Prefetch
from django.db.models import Prefetch, Max
from django.core import serializers
from pprint import pprint

Expand All @@ -44,21 +46,100 @@
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "beagle.settings")
django.setup()
from file_system.models import File, FileMetadata, FileGroup, FileType
from runner.models import Run, RunStatus, Port, PortType, Pipeline
from runner.models import Run, RunStatus, Port, PortType, Pipeline, OperatorRun
sys.path.pop(0)

def sanitize_metadata(queryset_data_dicts):
"""
Strip personal data from the FileMetadata queryset instances that have been converted to dicts

Parameters
----------
queryset_data_dicts: list
a list of dicts that represent the FileMetadata instances.

Output
------
list
a list of modified dicts

Notes
-----
Modifies dicts in place; pass by reference

Example of FileMetadata dict data structure:

[{
"model": "file_system.filemetadata",
"fields": {
"metadata": {
"patientId": "C-ABCD",
"sampleName": "C-ABCD-P001-d",
"labHeadName": "Roslyn Franklin",
"labHeadEmail": "[email protected]",
"dataAnalystName": "",
"dataAnalystEmail": "",
"externalSampleId": "AB-123-XYZ",
"investigatorName": "Dr. Smith",
"investigatorEmail": "[email protected]",
"projectManagerName": "Dr. Jones",
"investigatorSampleId": "AB-123-XYZ",
},
}
}, ... ]
"""
for queryset_data_dict in queryset_data_dicts:
if 'metadata' in queryset_data_dict['fields']:
if 'labHeadName' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['labHeadName'] = 'Roslyn Franklin'

if 'labHeadEmail' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['labHeadEmail'] = '[email protected]'

if 'dataAnalystName' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['dataAnalystName'] = 'E Wilson'

if 'dataAnalystEmail' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['dataAnalystEmail'] = '[email protected]'

if 'investigatorName' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['investigatorName'] = 'Dr. Smith'

if 'investigatorEmail' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['investigatorEmail'] = '[email protected]'

if 'projectManagerName' in queryset_data_dict['fields']['metadata']:
queryset_data_dict['fields']['metadata']['projectManagerName'] = 'Dr. Jones'

return(queryset_data_dicts)

def get_file_filemetadata_from_port(port_instance):
"""
Get the queryset of all File and FileMetadata entries for a given Port entry
returns a tuple of type (files_queryset, filemetadata_queryset)
"""
files_queryset = port_instance.files.all()
# filemetadata_queryset = FileMetadata.objects.filter(file__in = [i for i in files_queryset])
filemetata_instances = []
for file in files_queryset:
filemetata_instances.append(FileMetadata.objects.filter(file = file).order_by('-version').first())
return(files_queryset, filemetata_instances)

def dump_request(**kwargs):
"""
Dump re-loadable fixtures for File and FileMetadata items from a given request
"""
requestID = kwargs.pop('requestID')
sanitize = kwargs.pop('sanitize', False)
output_file_file = "{}.file.json".format(requestID)
output_filemetadata_file = "{}.filemetadata.json".format(requestID)

# get FileMetadata entries that match the request ID
file_instances = FileMetadata.objects.filter(metadata__requestId = requestID)
print(json.dumps(json.loads(serializers.serialize('json', file_instances)), indent=4), file = open(output_filemetadata_file, "w"))
data = json.loads(serializers.serialize('json', file_instances))
if sanitize == True:
data = sanitize_metadata(queryset_data_dicts = data)
print(json.dumps(data, indent=4), file = open(output_filemetadata_file, "w"))

# get the File entries that corresponds to the request ID
queryset = File.objects.prefetch_related(
Expand All @@ -73,22 +154,75 @@ def dump_run(**kwargs):
Dump re-loadable Django database fixtures for a Run entry and its associated input and output Port entries
"""
runID = kwargs.pop('runID')
onefile = kwargs.pop('onefile')
output_run_file = "{}.run.json".format(runID)
output_port_input_file = "{}.port.input.json".format(runID)
output_port_output_file = "{}.port.output.json".format(runID)
output_port_input_file = "{}.run.port.input.json".format(runID)
output_port_output_file = "{}.run.port.output.json".format(runID)
output_port_file_input_file = "{}.run.port_file.input.json".format(runID)
output_port_filemetadata_input_file = "{}.run.port_filemetadata.input.json".format(runID)
output_port_file_output_file = "{}.run.port_file.output.json".format(runID)
output_port_filemetadata_output_file = "{}.run.port_filemetadata.output.json".format(runID)
output_operator_run_file = "{}.run.operator_run.json".format(runID)

all_data = []
input_files = []
input_filemetadatas = []
output_files = []
output_filemetadatas = []

# get the parent Run instance
run_instance = Run.objects.get(id = runID)
print(json.dumps(json.loads(serializers.serialize('json', [run_instance])), indent=4), file = open(output_run_file, "w"))

# get the Run input and output Port instances
input_queryset = run_instance.port_set.filter(port_type=PortType.INPUT)
print(json.dumps(json.loads(serializers.serialize('json', input_queryset.all())), indent=4), file = open(output_port_input_file, "w"))
operator_run_instance = run_instance.operator_run

output_queryset = run_instance.port_set.filter(port_type=PortType.OUTPUT)
print(json.dumps(json.loads(serializers.serialize('json', output_queryset.all())), indent=4), file = open(output_port_output_file, "w"))
for item in output_queryset:
pprint((item, item.files.all()))
# get the Run input and output Port instances
input_port_queryset = run_instance.port_set.filter(port_type=PortType.INPUT)
output_port_queryset = run_instance.port_set.filter(port_type=PortType.OUTPUT)

for item in input_port_queryset:
files_queryset, filemetata_instances = get_file_filemetadata_from_port(item)
for file in files_queryset:
input_files.append(file)
for filemetadata in filemetata_instances:
input_filemetadatas.append(filemetadata)

for item in output_port_queryset:
files_queryset, filemetata_instances = get_file_filemetadata_from_port(item)
for file in files_queryset:
output_files.append(file)
for filemetadata in filemetata_instances:
output_filemetadatas.append(filemetadata)

# save each set of items to individual files by default
if onefile == False:
print(json.dumps(json.loads(serializers.serialize('json', [run_instance])), indent=4), file = open(output_run_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', [operator_run_instance])), indent=4), file = open(output_operator_run_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', input_port_queryset.all())), indent=4), file = open(output_port_input_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', output_port_queryset.all())), indent=4), file = open(output_port_output_file, "w"))

print(json.dumps(json.loads(serializers.serialize('json', input_files)), indent=4), file = open(output_port_file_input_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', input_filemetadatas)), indent=4), file = open(output_port_filemetadata_input_file, "w"))

print(json.dumps(json.loads(serializers.serialize('json', output_files)), indent=4), file = open(output_port_file_output_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', output_filemetadatas)), indent=4), file = open(output_port_filemetadata_output_file, "w"))

# save all items to a single file
if onefile == True:
all_data.append(run_instance)
all_data.append(operator_run_instance)
for item in input_port_queryset:
all_data.append(item)
for item in output_port_queryset:
all_data.append(item)
for item in input_files:
all_data.append(item)
for item in input_filemetadatas:
all_data.append(item)
for item in output_files:
all_data.append(item)
for item in output_filemetadatas:
all_data.append(item)
print(json.dumps(json.loads(serializers.serialize('json', all_data)), indent=4), file = open(output_run_file, "w"))


def dump_port_files(**kwargs):
Expand All @@ -113,13 +247,23 @@ def dump_pipeline(**kwargs):
Dump re-loadable Django database fixtures for Pipeline entries and related table fixtures
"""
pipelineName = kwargs.pop('pipelineName')
output_pipeline_file = "{}.pipeline.json".format(pipelineName)
output_pipeline_filegroup_file = "{}.pipeline.output_file_group.json".format(pipelineName)
if pipelineName == "all":
output_pipeline_file = "all_pipeline.json".format(pipelineName)
output_pipeline_filegroup_file = "all_pipeline.output_file_group.json".format(pipelineName)

pipelines = Pipeline.objects.all()
print(json.dumps(json.loads(serializers.serialize('json', pipelines)), indent=4), file = open(output_pipeline_file, "w"))

file_groups = FileGroup.objects.all()
print(json.dumps(json.loads(serializers.serialize('json', file_groups)), indent=4), file = open(output_pipeline_filegroup_file, "w"))
else:
output_pipeline_file = "{}.pipeline.json".format(pipelineName)
output_pipeline_filegroup_file = "{}.pipeline.output_file_group.json".format(pipelineName)

pipeline_instance = Pipeline.objects.get(name = pipelineName)
print(json.dumps(json.loads(serializers.serialize('json', [pipeline_instance])), indent=4), file = open(output_pipeline_file, "w"))
pipeline_instance = Pipeline.objects.get(name = pipelineName)
print(json.dumps(json.loads(serializers.serialize('json', [pipeline_instance])), indent=4), file = open(output_pipeline_file, "w"))

print(json.dumps(json.loads(serializers.serialize('json', [pipeline_instance.output_file_group])), indent=4), file = open(output_pipeline_filegroup_file, "w"))
print(json.dumps(json.loads(serializers.serialize('json', [pipeline_instance.output_file_group])), indent=4), file = open(output_pipeline_filegroup_file, "w"))

def get_files(value, type):
"""
Expand Down Expand Up @@ -178,6 +322,14 @@ def dump_file(**kwargs):
output_file = "all.file_filemetadata.json"
print(json.dumps(all_data, indent=4), file = open(output_file, "w"))

def dump_file_group(**kwargs):
"""
Dump the FileGroup fixtures
"""
fileGroupId = kwargs.pop('fileGroupID')
output_file_group_file = "{}.file_group.json".format(fileGroupId)
filegroup_instance = FileGroup.objects.get(id = fileGroupId)
print(json.dumps(json.loads(serializers.serialize('json', [filegroup_instance])), indent=4), file = open(output_file_group_file, "w"))

def parse():
"""
Expand All @@ -189,10 +341,12 @@ def parse():
# subparser for dumping requests
request = subparsers.add_parser('request', help = 'Dump File and FileMetadata based on a requestId')
request.add_argument('requestID', help = 'requestID to dump items for')
request.add_argument('--sanitize', action = "store_true", help = 'Attempt to scrub personal data from the FileMetadata entries')
request.set_defaults(func = dump_request)

run = subparsers.add_parser('run', help = 'Dump output data for pipeline run')
run.add_argument('runID', help = 'Run ID to dump items for')
run.add_argument('--onefile', action = "store_true", help = 'Put all the outputs into a single file ')
run.set_defaults(func = dump_run)

pipeline = subparsers.add_parser('pipeline', help = 'Dump pipeline fixture')
Expand All @@ -210,6 +364,10 @@ def parse():
port_files.add_argument('portID', help = 'Port ID to dump files for')
port_files.set_defaults(func = dump_port_files)

file_group = subparsers.add_parser('filegroup', help = 'Dump filegroup fixture')
file_group.add_argument('fileGroupID', help = 'FileGroup ID ID to dump items for')
file_group.set_defaults(func = dump_file_group)

args = parser.parse_args()
args.func(**vars(args))

Expand Down
Loading