Move sources creation out of the processors into our API migrations #813

vitorbaptista · 2017-04-27T18:21:59Z

We currently write the sources via our processors. For example, the NCT processor has a extract_source() extractor on https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/nct/extractors.py#L12-L20, which is used by the base.processors.trial.process_trials() method to create/update the source (https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/processors/trial.py#L26-L28).

This causes a problem for #502. In that issue, we want to refactor how we handle identifiers, to be able to add identifiers from unknown sources, and multiple identifiers from the same source. We want to create an identifiers table like:

trial_id	record_id	source_id	identifier
TRIAL_1	RECORD_1	SOURCE_1	'NCT0000000'
TRIAL_1	RECORD_2	SOURCE_2	'ICTRP00000'
...	...	...	...

With that in place, the process of writing a new trial would be:

Write the trial in trials table
Write the record in records table
Write the trial identifiers in the identifiers table

Imagine that we have a trial from NCT that has two identifiers: one from NCT and other from EUCTR. Currently, when running the NCT processor, there's no guarantee that the EUCTR processor has ever ran before, so we can't rely on having a euctr source already created. This means that, if we try to create a row with source_id == 'euctr' in the identifiers table, we'll receive an Exception complaining that this source_id doesn't exist.

We have a couple options to solve this. The easy way is to remove the foreign key constraint on source_id, so it won't complain. This means we loose the DB-level guarantee that every source_id exists, so would have to handle this case in the app-level.

My proposed solution is to move the responsibility of writing the sources to the API migrations, as a data migration. The source data is constant, not depending on whatever we have extracted from our sources. It also doesn't change often, so is a great candidate for a data migration. With this, we have the guarantee that all the sources will be in the DB.

The drawback is that the knowledge of the IDs of each source are spread across the API and the Processors. We can (and should) limit this exposition to a single place in the processor, treating the IDs as constants, so at we have a single place to change whenever we need.

We did something very similar when building the data categories. Their IDs also need to be shared between the API and the Processor, so we added a simple dict constant in https://github.com/opentrials/processors/blob/cd3c32822f24914e7ef0d48a8c2618b64b05c1cc/processors/base/config.py#L104-L139, and reuse that everywhere we need.

The text was updated successfully, but these errors were encountered:

vitorbaptista added API Database debt labels Apr 27, 2017

vitorbaptista mentioned this issue Apr 27, 2017

Refactor how we handle identifiers #502

Open

vitorbaptista added the 0. Ready for Analysis label Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move sources creation out of the processors into our API migrations #813

Move sources creation out of the processors into our API migrations #813

vitorbaptista commented Apr 27, 2017

Move sources creation out of the processors into our API migrations #813

Move sources creation out of the processors into our API migrations #813

Comments

vitorbaptista commented Apr 27, 2017