Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent parsing of narration field with respect to generating tags #1557

Open
mondjef opened this issue Feb 26, 2023 · 5 comments
Open

Comments

@mondjef
Copy link

mondjef commented Feb 26, 2023

In my csv importer, if I import the following record:
02/18/2023,RETAILER #89 TORONTO ON,92.78,

I get the following entry..

2023-02-18 * "RETAILER #89 TORONTO, ON"
  Expenses:Someaccount
  Assets:BankAccount                       -92.78 CAD

Which is ok...however if I import the same record and add a tag (or make any other edit) to the narration field during the fava import process things change. For example if I add the tag '#test' I now get the following:

2023-02-18 * "RETAILER NEPEAN ON" #89 #test
  Expenses:Someaccount
  Assets:BankAccount                       -92.78 CAD

What I would like is that the #89 (store number/identifier) to not be parsed as a tag but I get that the fava process does not know what is a real tag versus one that should be not considered a tag. I had thought maybe I could circumvent this by simply escaping the # but this is not a great way to do it. Using \ results in \ in my beancount file which then has downstream consequences and I also noticed that inserting a single space after the # also prevents it being parsed as a tag but again it does leave the narrative intact as read from csv which is what I would like.

I think the fava process that parses this string should only parse and collect tags starting from the end of the string and stop when it encounters a non-tag token. This would prevent the issues I am noticing and would also make consistent regardless if the narrative text was modified or not.

@mondjef
Copy link
Author

mondjef commented Feb 26, 2023

ok I believe I found where the logic is taking place which confirms my observed behaviour...just don't know how to fix it (of if it should be fixed...). Currently the only way I can think of escaping or avoiding my issue is in my csv importer is to have a custom extract that strips these (or modifies in other ways) prior to it getting to fava which is less then ideal IMO.

https://github.com/beancount/fava/tree/main/frontend/src/entry-forms)/Transaction.svelte

lines 48-61

  /// Extract tags and links that can be provided in the narration <input>.
  function onNarrationChange({
    currentTarget,
  }: {
    currentTarget: HTMLInputElement;
  }) {
    const { value } = currentTarget;
    entry.tags = [...value.matchAll(TAGS_RE)].map((a) => a[1] ?? "");
    entry.links = [...value.matchAll(LINKS_RE)].map((a) => a[1] ?? "");
    entry.narration = value
      .replaceAll(TAGS_RE, "")
      .replaceAll(LINKS_RE, "")
      .trim();
  }

@mondjef
Copy link
Author

mondjef commented Feb 27, 2023

I have tried to implement a workaround for this in my csv importers as I am already overriding the extract function their.

I have the following test code where I let the base extract method parse the entries then I loop over the entries and do some matching to then handle in a way that eliminates the issues I am having by putting the establishment reference numbers into a metadata tag and then remove from the narration field. The issue I am having is nothing changes in the narration field when fava displays it. Does anyone know what I might be missing to get this to work? It is like fava is capturing its own narration field and can't change it from my csv importer.

        entries = super().extract(file, existing_entries)
        
        def fix_store_numbers(narration):
            if not narration:
                return (None, narration)
            
            store_num = re.search(r'(?:^|\s)(#[A-Za-z0-9\-_/.]+)\b', narration)
            
            if not store_num:
            	return (None, narration)
            store_num = store_num.group(1)
            # remove store number from narration string
            narration = re.sub(r'(?:^|\s)(#[A-Za-z0-9\-_/.]+)(:?\b)', '', narration, 1)
            # cleanup narration by compressing spaces and stripping spaces
            narration = re.sub(r'\s\s+', ' ', narration).strip()
            
            return (store_num, narration)

        for entry in entries:
            if isinstance(entry, Transaction):
                store_num, new_narration = fix_store_numbers(entry.narration)
                if store_num:
                    entry.meta['store number'] = store_num
                    entry._replace(narration = new_narration)

        return entries

@yagebu
Copy link
Member

yagebu commented May 26, 2023

I think the fava process that parses this string should only parse and collect tags starting from the end of the string and stop when it encounters a non-tag token.

Sounds good - PR welcome :)

The issue I am having is nothing changes in the narration field when fava displays it. Does anyone know what I might be missing to get this to work?

The issue is that you never modify any of the entries ;) The line entry._replace(narration = new_narration) creates a copy of entry but leaves entry untouched - you'd need to replace it in the list entries as well.

@mondjef
Copy link
Author

mondjef commented May 27, 2023

thank you,

I'd gladly do a PR but Java is not my swim lane at all and I am afraid to admit that I don't know how to do a PR.... I have been learning Python on my own and have git on the list to also get familiar with so I can be more useful for these kinds of things but have not gotten around to it yet.

I have some ideas for a fix though....either 1) we merge the tags and links regexs into a single regex but this loses some flexibility and not sure if this would impact other things or 2) keep the tag and link patterns separated but strip them down to the essential and then dynamically combine them in the final regex that does the matching and removal.

option 2 would be something like...

const TAGS_RE = \s#([A-Za-z0-9\-_/.]+);
const LINKS_RE = \s\^([A-Za-z0-9\-_/.]+);

  /// Extract tags and links that can be provided in the narration <input>.
  function onNarrationChange({
    currentTarget,
  }: {
    currentTarget: HTMLInputElement;
  }) {
    const { value } = currentTarget;
    matches = [...value.match(/(?:+TAGS_RE+|+LINK_RE)\s*$/
    entry.tags = [...matches.matchAll(TAGS_RE)].map((a) => a[1] ?? "");
    entry.links = [...matches.matchAll(LINKS_RE)].map((a) => a[1] ?? "");
    entry.narration = value
      .replaceFirst(/(?:+TAGS_RE+|+LINK_RE)\s*$/, "").trim(), entry);
  }

@mondjef
Copy link
Author

mondjef commented Jul 2, 2023

posting this for others in case they to need a way to sanitize the narration field for descriptions with store numbers in them to avoid them becoming tags. Below is my current working solution in the csv importer where I over-ride the extract function and use some regex's to detect and remove store numbers. I chose to store these store numbers as metadata as to not loose this information but certainly not necessary. This is a workaround for the time being until maybe a solution is implemented for more precise parsing of tags and links from the narration field.


    def extract(self, file, existing_entries=None):
        """
        Function used to retrieve orignal csv data to populate __source_ metadata field
        """
        def get_source_line(file, lineno):
            skip_lines = self.skip_lines
            row = 'Unavailable'
            iconfig, has_header = csv_importer.normalize_config(self.config, file.head(), self.csv_dialect, self.skip_lines)

            with open(file.name, encoding=self.encoding) as csvfile:
                reader = csv.reader(csvfile, dialect=self.csv_dialect)
                if has_header:
                    skip_lines =+ 1

                row = ','.join(next(itertools.islice(reader,(lineno - 1) + skip_lines, lineno + skip_lines)))

            return row

        mapped_account = self.file_account(file)
        entries = super().extract(file, existing_entries)

        def fix_store_numbers(narration):
            if not narration:
                return (None, narration)

            store_num = re.search(r'(?:#\s*)([A-Za-z0-9\-_/.]+)\b', narration)

            if not store_num:
            	return (None, narration)

            store_num = store_num.group(1)
            narration = re.sub(r'(#\s*[A-Za-z0-9\-_/.]+)(?:\b)', ' ', narration, 1)
            narration = re.sub(r'\s\s+', ' ', narration).strip()

            return (store_num, narration)

        for idx, entry in enumerate(entries):
            if isinstance(entry, Transaction):
                src = get_source_line(file, entry.meta['lineno'])

                entry.meta['ingestfingerprint'] = hashlib.md5(','.join([src, mapped_account, self.institution]).encode('utf-8')).hexdigest()
                entry.meta['__source__'] = src

                store_num, new_narration = fix_store_numbers(entry.narration)

                if store_num:
                    entry.meta['store_number'] = store_num
                    entry = entry._replace(**{'narration': new_narration})
                entries[idx] = entry

        return entries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants