Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected in EXPENSE_ROW but not as ITEM #385

Open
arsher-b opened this issue Aug 9, 2024 · 1 comment
Open

Detected in EXPENSE_ROW but not as ITEM #385

arsher-b opened this issue Aug 9, 2024 · 1 comment

Comments

@arsher-b
Copy link

arsher-b commented Aug 9, 2024

The expected item did not parse as an item, but it exists in the expense row. We expected “SURF CB 6+1” to be parsed as an item so that it could be recognized as the item name.

image

Receipt Used:
photo_2024-08-06_21-27-04

@athewsey
Copy link
Contributor

Unless I'm mistaken, this is an issue with the detection within Amazon Textract itself, rather than the processing in the Textractor library right?

It looks like the preceding record should've captured the whole "Zonrox\nLemon 1000ml", but instead the newline caused Textract to treat "Lemon 1000ml" as the item for the following record. I tentatively think fixing this behaviour might be outside the scope of the Textractor library to fix, because it'd probably require fine-tuning a receipt-parsing ML model.

Today, Amazon Textract's expense model doesn't support fine-tuning - so I think it would be a question of whether these kinds of errors are common enough to be worth you A) documenting examples & raising it as a support case with the service team, and/or B) implementing a post-processing model using open-source like HuggingFace or similar, to try and edit the results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants