Skip to content

Integrate TACOS with external lambda #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .env.test
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@ LINKRESOLVER_BASEURL=https://mit.primo.exlibrisgroup.com/discovery/openurl?insti
[email protected]
LIBKEY_KEY=FAKE_LIBKEY_KEY
LIBKEY_ID=FAKE_LIBKEY_ID
DETECTOR_LAMBDA_URL=http://localhost:3000
DETECTOR_LAMBDA_PATH=/foo
DETECTOR_LAMBDA_CHALLENGE_SECRET=secret_phrase
96 changes: 12 additions & 84 deletions app/models/detector/citation.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,11 @@ class Detector
# hallmarks of being a citation.
# Phrases whose score is higher than the REQUIRED_SCORE value can be registered as a Detection.
class Citation
attr_reader :score, :subpatterns, :summary
attr_reader :features, :score, :patterns, :summary

# shared singleton methods
extend Detector::BulkChecker

# Citation patterns are regular expressions which attempt to identify structures that are part of many citations.
# This object is used as part of the pattern_checker method. Some of these patterns may get promoted to the Detector
# model if they prove useful beyond a Citation context.
CITATION_PATTERNS = {
apa_volume_issue: /\d+\(\d+\)/,
no: /no\.\s\d+/,
pages: /\d+-+\d+/,
pp: /pp\.\s\d+/,
vol: /vol\.\s\d+/,
year_parens: /\(\d{4}\)/,
brackets: /\[.*?\]/,
lastnames: /[A-Z][a-z]+[.,]/,
quotes: /".*?"/
}.freeze

# The required score value is the threshold needed for a phrase to be officially recorded with a Detection via it's
# associated Term.
# Hint: set this to 0 in development environments if you want to temporarily see all output
Expand All @@ -55,29 +40,29 @@ def detection?
@score >= REQUIRED_SCORE
end

# The initializer handles the parsing of a phrase, and subsequent population of the @subpatterns, @summary,
# and @score instance variables. @subpatterns contains all the citation components which have been flagged by the
# The initializer handles the parsing of a phrase, and subsequent population of the @patterns, @summary,
# and @score instance variables. @patterns contains all the citation components which have been flagged by the
# CITATION_PATTERNS hash. @summary contains counts of how often certain characters or words appear in the phrase.
# Finally, the @score value is a summary of how many elements in the subpatterns or summary report were detected.
# Finally, the @score value is a summary of how many elements in the patterns or summary report were detected.
#
# @note This method can be called directly via Detector::Citation.new(phrase). It is also called indirectly via the
# Detector::Citation.record(Term) instance method. This method can be called directly when a Detection is not
# desired.
# @param phrase String. Often a `Term.phrase`.
# @return Nothing intentional. Data is written to Hashes `@subpatterns`, `@summary`,
# @return Nothing intentional. Data is written to Hashes `@patterns`, `@summary`,
# and `@score` during processing.
def initialize(phrase)
@subpatterns = {}
@summary = {}
pattern_checker(phrase)
summarize(phrase)
f = Detector::Features.new(phrase)
@features = f.features
@patterns = f.patterns
@summary = f.summary
@score = calculate_score
end

def detections
return unless detection?

[@summary, @subpatterns, @score]
[@summary, @patterns, @score]
end

# The record method first runs all of the parsers by running the initialize method. If the resulting score is higher
Expand All @@ -99,7 +84,7 @@ def self.record(term)

private

# This combines the two reports generated by the Citation detector (subpatterns and summary), and calculates the
# This combines the two reports generated by the Citation detector (patterns and summary), and calculates the
# final score value from their contents.
#
# Any detected subpattern is counted toward the score (multiple detections do not get counted twice). For example,
Expand All @@ -116,64 +101,7 @@ def calculate_score
SUMMARY_THRESHOLDS.key?(key) && value >= SUMMARY_THRESHOLDS[key]
end

summary_score + @subpatterns.length
end

# This calculates the number of characters in the search phrase. It is called by the summarize method.
def characters(phrase)
phrase.length
end

# This counts the number of colons that appear in the search phrase, because they tend to appear more often in
# citations than in other searches. It is called by the summarize method.
def colons(phrase)
phrase.count(':')
end

# This counts the number of commas in the search phrase. It is called by the summarize method.
def commas(phrase)
phrase.count(',')
end

# This builds one of the two main components of the Citation detector - the subpattern report. It uses each of the
# regular expressions in the CITATION_PATTERNS constant, extracting all matches using the scan method.
#
# @return hash
def pattern_checker(phrase)
CITATION_PATTERNS.each_pair do |type, pattern|
@subpatterns[type.to_sym] = scan(pattern, phrase) if scan(pattern, phrase).present?
end
end

# This counts the number of periods in the search phrase. It is called by the summarize method.
def periods(phrase)
phrase.count('.')
end

# This is a convenience method for the scan method, which is used by pattern_checker.
def scan(pattern, phrase)
phrase.scan(pattern).map(&:strip)
end

# This counts the semicolons in the search phrase. It is called by the summarize method.
def semicolons(phrase)
phrase.count(';')
end

# This builds one of the two main components of the Citation detector - the summary report. It calls each of the
# methods in the first line - which all return integers - and puts the result as a key-value pair in the @summary
# instance variable.
#
# @return hash
def summarize(phrase)
%w[characters colons commas periods semicolons words].each do |check|
@summary[check.to_sym] = send(check, phrase)
end
end

# This counts the number of words in the search phrase. It is called by the summarize method.
def words(phrase)
phrase.split.length
summary_score + @patterns.length
end
end
end
91 changes: 91 additions & 0 deletions app/models/detector/features.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# frozen_string_literal: true

class Detector
class Features
attr_reader :features, :patterns, :summary

# Citation patterns are regular expressions which attempt to identify structures that are part of many citations.
# This object is used as part of the pattern_checker method. Some of these patterns may get promoted to the Detector
# model if they prove useful beyond a Citation context.
CITATION_PATTERNS = {
apa_volume_issue: /\d+\(\d+\)/,
no: /no\.\s\d+/,
pages: /\d+-+\d+/,
pp: /pp\.\s\d+/,
vol: /vol\.\s\d+/,
year_parens: /\(\d{4}\)/,
brackets: /\[.*?\]/,
lastnames: /[A-Z][a-z]+[.,]/,
quotes: /".*?"/
}.freeze

def initialize(phrase)
@features = {}
@patterns = {}
@summary = {}
pattern_checker(phrase)
summarize(phrase)
@features = @patterns.deep_dup.transform_values(&:length).merge(summary)
@patterns.delete_if { |_, v| v == [] }
end

private

# This calculates the number of characters in the search phrase. It is called by the summarize method.
def characters(phrase)
phrase.length
end

# This counts the number of colons that appear in the search phrase, because they tend to appear more often in
# citations than in other searches. It is called by the summarize method.
def colons(phrase)
phrase.count(':')
end

# This counts the number of commas in the search phrase. It is called by the summarize method.
def commas(phrase)
phrase.count(',')
end

# This builds one of the two main components of the Citation detector - the subpattern report. It uses each of the
# regular expressions in the CITATION_PATTERNS constant, extracting all matches using the scan method.
#
# @return hash
def pattern_checker(phrase)
CITATION_PATTERNS.each_pair do |type, pattern|
@patterns[type.to_sym] = scan(pattern, phrase)
end
end

# This counts the number of periods in the search phrase. It is called by the summarize method.
def periods(phrase)
phrase.count('.')
end

# This is a convenience method for the scan method, which is used by pattern_checker.
def scan(pattern, phrase)
phrase.scan(pattern).map(&:strip)
end

# This counts the semicolons in the search phrase. It is called by the summarize method.
def semicolons(phrase)
phrase.count(';')
end

# This builds one of the two main components of the Citation detector - the summary report. It calls each of the
# methods in the first line - which all return integers - and puts the result as a key-value pair in the @summary
# instance variable.
#
# @return hash
def summarize(phrase)
%w[characters colons commas periods semicolons words].each do |check|
@summary[check.to_sym] = send(check, phrase)
end
end

# This counts the number of words in the search phrase. It is called by the summarize method.
def words(phrase)
phrase.split.length
end
end
end
97 changes: 97 additions & 0 deletions app/models/lookup_citation.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# frozen_string_literal: true

class LookupCitation
# The info method is the way to return information about whether a given phrase is a citation. It consults an
# external lambda service (address in env) and returns either a true or a false. The default if anything goes wrong
# is to return false.
#
# @return Boolean or nil
def info(phrase)
return unless expected_env?

external_data = fetch(phrase)
return if external_data == 'Error'

external_data
end

private

def lambda_path
ENV.fetch('DETECTOR_LAMBDA_PATH', nil)
end

def lambda_secret
ENV.fetch('DETECTOR_LAMBDA_CHALLENGE_SECRET', nil)
end

def lambda_url
ENV.fetch('DETECTOR_LAMBDA_URL', nil)
end

# define_lambda connects to the detector lambda.
#
# @return Faraday connection
def define_lambda
Faraday.new(
url: lambda_url,
params: {}
)
end

# define_payload defines the Hash that will be sent to the lambda.
#
# @return Hash
def define_payload(phrase)
{
action: 'predict',
features: extract_features(phrase),
challenge_secret: lambda_secret
}
end

# expected_env? confirms that all three required environment variables are defined.
#
# @return Boolean
def expected_env?
Rails.logger.error('No lambda URL defined') if lambda_url.nil?

Rails.logger.error('No lambda path defined') if lambda_path.nil?

Rails.logger.error('No lambda secret defined') if lambda_secret.nil?

[lambda_url, lambda_path, lambda_secret].all?(&:present?)
end

# extract_features passes the search phrase through the citation detector, and massages the resulting features object
# to correspond with what the lambda expects.
#
# @return Hash
def extract_features(phrase)
features = Detector::Citation.new(phrase).features
features[:apa] = features.delete :apa_volume_issue
features[:year] = features.delete :year_parens
features.delete :characters
features
end

# Fetch handles the communication with the detector lambda: defining the connection, building the payload, and any
# error handling with the response.
#
# @return Boolean or 'Error'
def fetch(phrase)
lambda = define_lambda
payload = define_payload(phrase)

response = lambda.post(lambda_path, payload.to_json)

if response.status == 200
JSON.parse(response.body)['response'] == 'true'
else
Rails.logger.error(response.body)
Rails.logger.error(response.body['error'])

'Error'
end
end
end
Loading