You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.
This causes a problem for #502. In that issue, we want to refactor how we handle identifiers, to be able to add identifiers from unknown sources, and multiple identifiers from the same source. We want to create an identifiers table like:
trial_id
record_id
source_id
identifier
TRIAL_1
RECORD_1
SOURCE_1
'NCT0000000'
TRIAL_1
RECORD_2
SOURCE_2
'ICTRP00000'
...
...
...
...
With that in place, the process of writing a new trial would be:
Write the trial in trials table
Write the record in records table
Write the trial identifiers in the identifiers table
Imagine that we have a trial from NCT that has two identifiers: one from NCT and other from EUCTR. Currently, when running the NCT processor, there's no guarantee that the EUCTR processor has ever ran before, so we can't rely on having a euctr source already created. This means that, if we try to create a row with source_id == 'euctr' in the identifiers table, we'll receive an Exception complaining that this source_id doesn't exist.
We have a couple options to solve this. The easy way is to remove the foreign key constraint on source_id, so it won't complain. This means we loose the DB-level guarantee that every source_id exists, so would have to handle this case in the app-level.
My proposed solution is to move the responsibility of writing the sources to the API migrations, as a data migration. The source data is constant, not depending on whatever we have extracted from our sources. It also doesn't change often, so is a great candidate for a data migration. With this, we have the guarantee that all the sources will be in the DB.
The drawback is that the knowledge of the IDs of each source are spread across the API and the Processors. We can (and should) limit this exposition to a single place in the processor, treating the IDs as constants, so at we have a single place to change whenever we need.
We currently write the sources via our processors. For example, the NCT processor has a
extract_source()
extractor on https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/nct/extractors.py#L12-L20, which is used by thebase.processors.trial.process_trials()
method to create/update the source (https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/processors/trial.py#L26-L28).This causes a problem for #502. In that issue, we want to refactor how we handle identifiers, to be able to add identifiers from unknown sources, and multiple identifiers from the same source. We want to create an
identifiers
table like:With that in place, the process of writing a new trial would be:
trials
tablerecords
tableidentifiers
tableImagine that we have a trial from NCT that has two identifiers: one from NCT and other from EUCTR. Currently, when running the NCT processor, there's no guarantee that the EUCTR processor has ever ran before, so we can't rely on having a
euctr
source already created. This means that, if we try to create a row withsource_id == 'euctr'
in theidentifiers
table, we'll receive an Exception complaining that thissource_id
doesn't exist.We have a couple options to solve this. The easy way is to remove the foreign key constraint on
source_id
, so it won't complain. This means we loose the DB-level guarantee that everysource_id
exists, so would have to handle this case in the app-level.My proposed solution is to move the responsibility of writing the sources to the API migrations, as a data migration. The source data is constant, not depending on whatever we have extracted from our sources. It also doesn't change often, so is a great candidate for a data migration. With this, we have the guarantee that all the sources will be in the DB.
The drawback is that the knowledge of the IDs of each source are spread across the API and the Processors. We can (and should) limit this exposition to a single place in the processor, treating the IDs as constants, so at we have a single place to change whenever we need.
We did something very similar when building the data categories. Their IDs also need to be shared between the API and the Processor, so we added a simple dict constant in https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/config.py#L104-L139, and reuse that everywhere we need.
The text was updated successfully, but these errors were encountered: