Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid JSON generated during the categorization process #9

Open
nasuka opened this issue Jan 9, 2025 · 0 comments
Open

Invalid JSON generated during the categorization process #9

nasuka opened this issue Jan 9, 2025 · 0 comments

Comments

@nasuka
Copy link

nasuka commented Jan 9, 2025

I encountered an issue where the JSON output gets truncated partway when running sensemaking on a file that contains long Japanese text. Here is a snippet of the truncated output:

[
  {
    "id": "1870821933711040512",
    "topics": [
      {
        "name": "M1 Grand Prix",
        "subtopics": [
          {
            "name": "General Appreciation"
          }
        ]
      }
    ]
  },
  {
    "id": "1870821933648163328",
    "topics": [
      {
        "name": "M1 Grand Prix",
        "subtopics": [
          {
            "name": "Performance Analysis"
          }
        ]
      }
    ]
  },
...(many rows)..
  {
    "id": "1870821922046710016",
    "topics": [
      {
        "name":

It appears that categorizationBatchSize is currently fixed at 100, which might be causing the model to exceed its output token limit, especially for languages like Japanese that consume more tokens or for comments that are very long.

Proposed Solution
It would be helpful if categorizationBatchSize could be passed as a parameter upon invocation, so users can adjust it according to their language or the size of their dataset. This way, we can avoid hitting the model’s output token limit and prevent truncated JSON outputs.

Would it be possible to make categorizationBatchSize configurable? If you have any suggestions or alternative approaches, I'd be happy to hear them. Thank you in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant