Skip to content

Duplicate Harvest Source Cleanup and Other Nuclear Harvest Fixes

James Brown edited this page Feb 21, 2020 · 2 revisions

Multiple Harvest Sources

The DOI harvest source was locked up and could not be restarted or cleared. This was due to a number of harvest sources being created to access the same source, and causing packages (datasets) to be linked to both harvest sources. When trying to clear 1, the sql commands would fail on foreign keys still being linked. To allow for a clear and re-harvest, both sql commands have to be run simultaneously. To accomplish this, we used the sql commands here.

These commands are directly from the harvester clear commands, with the final delete statement removed and added as a final sql command. From this point, a formal clear of each harvest source should be run using the command line paster tool:

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester clearsource 34ce571b-cb98-4e0b-979f-30f9ecc452c5 -c /etc/ckan/production.ini

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester clearsource 1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7 -c /etc/ckan/production.ini

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester clearsource d8ef5c4f-b0fe-4338-afe9-089c029ea378 -c /etc/ckan/production.ini

Now we need to remove all but one harvest source. We will keep the first id, and remove the others:

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester rmsource d8ef5c4f-b0fe-4338-afe9-089c029ea378 -c /etc/ckan/production.ini

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester rmsource 1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7 -c /etc/ckan/production.ini

If a harvest source exists but the package for the harvest source doesn't exist (the harvest source doesn't show up through the website and the above clear or remove source fails), then it needs to be removed manually. To do this, run the SQL commands detailed here:

BEGIN;
DELETE FROM harvest_object_error where harvest_object_id in (select id from harvest_object where harvest_source_id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR harvest_source_id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378');
DELETE FROM harvest_object_extra where harvest_object_id in (select id from harvest_object where harvest_source_id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR harvest_source_id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378');
DELETE FROM harvest_object where harvest_source_id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR harvest_source_id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378';
DELETE FROM harvest_gather_error where harvest_job_id in (select id from harvest_job where source_id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR source_id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378');
DELETE FROM harvest_job where source_id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR source_id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378';
DELETE FROM harvest_source WHERE id = '1f3ebea5-58d2-4aaa-85a0-e7ebda3074e7' OR id = 'd8ef5c4f-b0fe-4338-afe9-089c029ea378';
COMMIT;

For DOI harvest clear on 10/18/2018, both of these steps were needed to be completed for a successful harvest to run. To create a new harvest job, run this final command:

/usr/lib/ckan/bin/paster --plugin=ckanext-harvest harvester job 34ce571b-cb98-4e0b-979f-30f9ecc452c5 -c /etc/ckan/production.ini

Harvest stop (Only when every other restart/cleanup option for harvester is exhausted!)

There are 2 pieces to stopping a harvest: the DB and the Redis server. The process

Clear Redis

There is a ckan harvest command to clear the redis queue: ckan --plugin=ckanext-harvest harvester purge_queues, documented here. The following was used to simply remove all items from the queue using the redis cli. It may be more prudent to remove specific items in the future.

redis-cli
flushall

Clear DB

The following queries mark all created harvest objects as errors, and updates the harvest job to finished. The first needs to be run before the second, otherwise the first will have nothing to update.

UPDATE harvest_object
SET state = 'ERROR'
WHERE harvest_job_id IN (SELECT id FROM harvest_job WHERE status = 'Running')
AND state <> 'COMPLETE';
UPDATE harvest_job
SET status = 'Finished'
WHERE status = 'Running';
Clone this wiki locally