Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

Move sources creation out of the processors into our API migrations #813

Open
vitorbaptista opened this issue Apr 27, 2017 · 0 comments
Open

Comments

@vitorbaptista
Copy link
Contributor

We currently write the sources via our processors. For example, the NCT processor has a extract_source() extractor on https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/nct/extractors.py#L12-L20, which is used by the base.processors.trial.process_trials() method to create/update the source (https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/processors/trial.py#L26-L28).

This causes a problem for #502. In that issue, we want to refactor how we handle identifiers, to be able to add identifiers from unknown sources, and multiple identifiers from the same source. We want to create an identifiers table like:

trial_id record_id source_id identifier
TRIAL_1 RECORD_1 SOURCE_1 'NCT0000000'
TRIAL_1 RECORD_2 SOURCE_2 'ICTRP00000'
... ... ... ...

With that in place, the process of writing a new trial would be:

  1. Write the trial in trials table
  2. Write the record in records table
  3. Write the trial identifiers in the identifiers table

Imagine that we have a trial from NCT that has two identifiers: one from NCT and other from EUCTR. Currently, when running the NCT processor, there's no guarantee that the EUCTR processor has ever ran before, so we can't rely on having a euctr source already created. This means that, if we try to create a row with source_id == 'euctr' in the identifiers table, we'll receive an Exception complaining that this source_id doesn't exist.

We have a couple options to solve this. The easy way is to remove the foreign key constraint on source_id, so it won't complain. This means we loose the DB-level guarantee that every source_id exists, so would have to handle this case in the app-level.

My proposed solution is to move the responsibility of writing the sources to the API migrations, as a data migration. The source data is constant, not depending on whatever we have extracted from our sources. It also doesn't change often, so is a great candidate for a data migration. With this, we have the guarantee that all the sources will be in the DB.

The drawback is that the knowledge of the IDs of each source are spread across the API and the Processors. We can (and should) limit this exposition to a single place in the processor, treating the IDs as constants, so at we have a single place to change whenever we need.

We did something very similar when building the data categories. Their IDs also need to be shared between the API and the Processor, so we added a simple dict constant in https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/config.py#L104-L139, and reuse that everywhere we need.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant