adding Luxembourg administrative sanctions #849

bgmello · 2024-05-14T20:20:37Z

Fixes opensanctions/crawler-planning#157

bgmello · 2024-05-16T13:45:02Z

There are some sanctions where the title doesn't mention the company, such as "Sanction administrative prononcée à l’encontre d’un réviseur d’entreprises agréé" (Administrative sanction pronounced against an approved company auditor). We are currently ignoring those.

There also appears to be a bug where the lookup doesn't recognize 2 titles:

Sanction administrative prononcée à l’encontre de l’établissement de crédit BEMO EUROPE – Banque Privée S.A.
Sanctions et mesures administratives prononcées à l’encontre de quatorze professionnels agréés ou enregistrés soumis à la surveillance de la Commission de Surveillance du Secteur Financier (la « CSSF ») en matière de lutte contre le blanchiment et le financement du terrorisme

Thus they are raising warnings even though they are already in the YAML

jbothma

Thanks! Some potential improvements

datasets/lu/administrative_sanctions/crawler.py

jbothma · 2024-05-20T14:52:49Z

datasets/lu/administrative_sanctions/crawler.py

+ subtitle = (
+ response.find('.//*[@class="single-news__subtitle"]').text_content().strip()
+ )
+ res = context.lookup("subtitle_to_names", subtitle)
+ if res:
+ names = cast("List[str]", res.names)
+ else:
+ context.log.warning("Can't find the name of the company", text=subtitle)
+ names = [subtitle]
+
+ # If the subtitle doesn't contain any names
+ if not names or len(names) == 0:
+ return


instead of maintaining all items in the yaml, I think we can come up with an algorithm that's fairly safe against accidental nonsense names:

strip r"(Sanction|amende) administrative prononcée à l’encontre d[eu] (gestionnaire de fonds d’investissement)?" from the start

if the rest starts with upper case, that looks like a name, so use it

else if there's a match in lookups, use the values from that

else use the rest as a name and warn to check

I added more rules to the REGEX and I was able to get more names, but there was still some that required lookups

yeah totally fine - it's just good to automatically handle the common cases so we don't literally have to add each and every record.

I realise now, the instances where they don't mention a name, we'll have to manually pull the name from PDF. How about the following logic to do a second lookup:

if not res.names: pdf_res = lookups(detail_url) if pdf_res is None: #warn names = names_res.names else: names = res.names for name in names: ...

the fist lookup handles if we can't parse a name from the subtitle
the second lookup happens if the first lookup gave us an empty list, then we use the detail url to look up a name someone found in the PDF and added - for cases like these https://github.com/opensanctions/opensanctions/pull/849/files#diff-31dfffcd8054c60130382e3389c526fa63b3e350581059737aaf4f8ab5c12834R55

I don't want to assume that we can give the name AIFM Law for Sanction administrative prononcée à l’encontre d’un gestionnaire de fonds d’investissement - it's totally conceivable that they'd use the same string again for a different sanction, right?

I checked all links and could only find 3 with a company name in the PDF. From what I understood, AIFM Law is not a name of a company, it's the Alternative Investment Fund Managers Law.

lol shows how well I read that. sorry. But this, looks good!

datasets/lu/administrative_sanctions/crawler.py

datasets/lu/administrative_sanctions/lu_administrative_sanctions.yaml

datasets/lu/administrative_sanctions/crawler.py

datasets/lu/administrative_sanctions/lu_administrative_sanctions.yaml

Co-authored-by: JD Bothma <[email protected]>

jbothma · 2024-05-22T16:36:49Z

Thanks! looking good! Just a couple of changes then we're there

bgmello added 2 commits May 14, 2024 17:19

adding Luxembourg administrative sanctions

a641cd1

adding new line at the end of YAML

2c00d22

bgmello marked this pull request as draft May 16, 2024 13:41

jbothma reviewed May 20, 2024

View reviewed changes

bgmello and others added 5 commits May 21, 2024 13:21

Apply suggestions from code review

1993701

Co-authored-by: JD Bothma <[email protected]>

Passing list of names in name property of entity

7bd6af5

Improving date parser

7a479b3

Changing spacing to four spaces in publisher description

55e2f9c

Improving name parser by using REGEX

1d6f09f

bgmello added 3 commits May 23, 2024 09:06

Using only listing page to crawl data

8c1e914

Changing topic to reg.warn

bfc6ede

Improving lookup by checking names on PDFs

2fd3a3c

jbothma merged commit 19e9606 into opensanctions:main May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding Luxembourg administrative sanctions #849

adding Luxembourg administrative sanctions #849

bgmello commented May 14, 2024

bgmello commented May 16, 2024

jbothma left a comment

jbothma May 20, 2024

bgmello May 21, 2024

jbothma May 22, 2024 •

edited

jbothma May 22, 2024

bgmello May 23, 2024

jbothma May 24, 2024

jbothma commented May 22, 2024

adding Luxembourg administrative sanctions #849

adding Luxembourg administrative sanctions #849

Conversation

bgmello commented May 14, 2024

bgmello commented May 16, 2024

jbothma left a comment

Choose a reason for hiding this comment

jbothma May 20, 2024

Choose a reason for hiding this comment

bgmello May 21, 2024

Choose a reason for hiding this comment

jbothma May 22, 2024 • edited

Choose a reason for hiding this comment

jbothma May 22, 2024

Choose a reason for hiding this comment

bgmello May 23, 2024

Choose a reason for hiding this comment

jbothma May 24, 2024

Choose a reason for hiding this comment

jbothma commented May 22, 2024

jbothma May 22, 2024 •

edited