Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lark can't match µ character even though it is defined in the input #1478

Open
adi12345-crypto opened this issue Oct 21, 2024 · 2 comments
Open

Comments

@adi12345-crypto
Copy link

I am trying to write a lark parser which can extract parts of text of form number unit e.g 10 grams or 10gm etc. I am trying to parse the following input:
"10 μl"
I get the error
`10 μl
^
Expected one of:
* DECIMAL_HAHNEMANNIAN
* WEIGHT_GM
* DOSIS_TABLET
* WEIGHT_MG
* LENGTH_CM
* CELL_CULTURE_INFECTIOUS_DOSE
* MILLION_COLONY_FORMING_UNIT
* WEIGHT_NG
* SEPARATOR
* MICROKATAL
* SPEYWOOD_UNIT
* DOSIS_BLISTER
* DOSIS_GENERATOR
* VOLUME_L
* EFFECTIVE_RESPONSE_50
* EFFECTIVE_RESPONSE_60
* WEIGHT_MCG_BASE
* TISSUE_CULTURE_INFECTIOUS_DOSE
* SPACES
* VOLUME_ML
* HIGH_ACTIVATION_UNIT
* ANTIGENIC_UNIT
* LOG_ELISA_UNIT
* EFFECTIVE_RESPONSE_25
* MILI_EQUIVALENTS_UNIT
* BECQUEREL
* BIOEQUIVALENT_ALLERGY_UNIT
* DOSIS_SRT
* HOMEOPATHIC_POTENCY_C_UNIT
* IU_MIU
* DOSIS_GUM
* MOLE_MCMOL
* TIME_MIN
* GIGA_BECQUEREL
* HAEMAGGLUTINATION_INHIBITION_UNIT
* PERCENT
* DOSIS_BAG
* DOSIS_PACK
* VECTOR_GENOME_UNIT
* ANTI_XA_UNIT
* ALLERGAN_UNIT
* PROTEIN_NITROGEN_UNIT
* PERCENT_WEIGHT_PER_WEIGHT
* IU_IU
* TIME_DAY
* MOLE_MOL
* MILLION_CELL
* HOMEOPATHIC_POTENCY_X_UNIT
* DOSIS_CAPSULE
* MICROCURIE
* DOSIS_DOS
* D_ANTEGEN_UNIT
* GALACTOSIDASE_UNIT
* DOSIS_CONTAINER
* WEIGHT_PG
* KATAL
* WEIGHT_LBS
* VOLUME_DROP
* GERMS_UNIT
* CELL
* ANTIBODY_MICRO_AGGLUTINATION_LYTIC_REACTION_UNIT
* LOG_TISSUE_CULTURE_INFECTIOUS_DOSE
* TIME_HOUR
* HOMEOPATHIC_POTENCY_M_UNIT
* WEIGHT_MG_BASE
* LENGTH_MM
* OOCYST_UNIT
* RELATIVE_POTENCY_UNIT
* VOLUME_MCL
* KALLIDINOGENASE_INACTIVATOR_UNIT
* IU_KIU
* PERCENT_VOLUME_PER_VOLUME
* KALLIKREIN_INACTIVATOR_UNIT
* PERCENT_WEIGHT_PER_VOLUME
* LIMIT_OF_FLOCCULATION_UNIT
* EFFECTIVE_RESPONSE_120
* WEIGHT_TON
* WEIGHT_MCG
* LOG_CELL_CULTURE_INFECTIOUS_DOSE
* PARTS_PER_MILLION_UNIT
* TCP_UNIT
* HOMEOPATHIC_POTENCY_Q_UNIT
* DOSIS_KIT
* KILO_BECQUEREL
* WEIGHT_KGS
* TUBERCULIN_UNIT
* MILICURIE
* LOG_HAEMAGGLUTINATION_INHIBITION_UNIT
* DOSIS_LOZENGE
* RANGE_SEPARATOR
* MEGA_BECQUEREL
* USP_UNIT
* FLUORESCENT_FOCUS_UNIT
* DOSIS_SACHET
* DOSIS_BOTTLE
* EQUIVALENTS_UNIT
* ELISA_UNIT
* PERCENT_VOLUME_PER_WEIGHT
* DOSIS_VIAL
* SPORULATED_OOCYST_UNIT
* COLONY_FORMING_UNIT
* PULSATION_UNIT
* MOLE_MMOL
* CURIE
* AREA_CM_SQ
* DOSIS_STRIP
* BILLION_COLONY_FORMING_UNIT
* HOMEOPATHIC_POTENCY_UNIT
* PLAQUE_FORMING_UNIT
* WEIGHT_GM_BASE
* LOG_COLONY_FORMING_UNIT
* MOLE_NMOL
* DOSIS_SYRINGE
* DOSIS_CARTRIDGE
* EFFECTIVE_RESPONSE_70
* DOSIS_ACT
* DOSIS_CYLINDER
* LOG_PLAQUE_FORMING_UNIT
* VOLUME_GAL
* DOSIS_PIECE

NoneMy token VOLUME_MCL is defined as:
VOLUME_MCL: "mikroliter" | "microl" | "mcl" | "µl" | "ul"
`

@adi12345-crypto
Copy link
Author

For reference here is the grammar:

// define rules for strength and strength ranges when they appear alone and with additional text
start: strength_only | strength_range_only | strength_additional_text
| strength_range_additional_text | concatenated_strengths_only
| words_only

// define structure for rules with no additional text
strength_only.2: number unit | number unit? separator number? unit
concatenated_strengths_only.2: strength_only concatenator strength_only
strength_range_only.2: strength_only range_separator strength_only | number range_separator number unit
// define structure for rules with additional text
strength_additional_text: (skippable* strength_only skippable*)+
strength_range_additional_text: (skippable* strength_range_only skippable*)+
// define structure for words only
words_only: extended_word+

// define non-terminals units as all possible unit rules then for each unit rule specify units
unit: base_unit | base_unit separator base_unit
base_unit.2: weight_unit | limit_of_flocculation_unit | volume_unit | length_unit | mole_unit | iu_unit
| percent_unit | time_unit | percent_weight_volume | dosis_unit | colony_forming_unit
| kallikrein_inactivator_unit | plaque_forming_unit | becquerel_unit | cell_unit | area_unit
| curie_unit | tissue_culture_infectious_dose_unit | hahnemannian_unit | parts_per_million_unit
| vector_genome_unit | anti_xa_unit | cell_culture_infectious_dose_unit | equivalents_unit | d_antegen_unit
| elisa_unit | allergan_unit | effective_response_unit | relative_potency_unit | tcp_unit
| haemagglutination_inhibition_unit | oocyst_unit | high_activation_unit | antigenic_unit
| tuberculin_unit | antibody_micro_agglutination_lytic_reaction_unit | speywood_unit | galactosidase_unit
| pulsation_unit | katal_unit | germs_unit | kallidinogenase_inactivator_unit | usp_unit
| homeopathic_potency_unit | bioequivalent_allergy_unit | fluorescent_focus_unit | protein_nitrogen_unit

// define all weight units
weight_unit: WEIGHT_TON | WEIGHT_KGS | WEIGHT_KGS | WEIGHT_LBS | WEIGHT_GM | WEIGHT_GM_BASE | WEIGHT_MG
| WEIGHT_MG_BASE | WEIGHT_MCG | WEIGHT_MCG_BASE | WEIGHT_NG | WEIGHT_PG
WEIGHT_TON: "tonelada" | "tons" | "ton" | "tne" | "mts"
WEIGHT_KGS: "kilograms" | "kilogramm" | "Kilogramm" | "kilogram" | "kilo" | "kgs" | "kgm" | "kga" | "kg"
WEIGHT_LBS: "lbs"
WEIGHT_GM: "gramm" | "grams" | "gramo" | "gram" | "gms" | "gm" | "g." | "g"
WEIGHT_GM_BASE: "g base"
WEIGHT_MG: "milligramm" | "mili.gram" | "miligramo" | "milig" | "mgs" | "mg"
WEIGHT_MG_BASE: "mg base"
WEIGHT_MCG: "micrpgrammes" | "microgrammes" | "microgramme"| "mikrogramm" | "microgramo"
| "mikrograma" | "mikrogramów" | "microgram" | "microg" | "mcg" | "µg" | "ug"
WEIGHT_MCG_BASE: "mcg base"
WEIGHT_NG: "nanogramm" | "ng"
WEIGHT_PG: "picogramm"

// define all volume units
volume_unit: VOLUME_GAL | VOLUME_L | VOLUME_DROP | VOLUME_ML | VOLUME_MCL
VOLUME_GAL: "gal"
VOLUME_L: "liter" | "l"
VOLUME_DROP: "drop"
VOLUME_ML: "millilitre" | "milliliter" | "ml"
VOLUME_MCL: "mikroliter" | "microl" | "mcl" | "µl" | "ul"

// define all length units
length_unit: LENGTH_MM | LENGTH_CM
LENGTH_MM: "millimeter" | "mm"
LENGTH_CM: "zentimeter" | "cm"

//define all mole units
mole_unit: MOLE_NMOL | MOLE_MMOL | MOLE_MCMOL | MOLE_MOL
MOLE_NMOL: "nanomol" | "nmol"
MOLE_MMOL: "millimol" | "mmole" | "mmol"
MOLE_MCMOL: "micromol" | "mcmol" | "µmol"
MOLE_MOL: "mole" | "mol"

// define all iu units
iu_unit: IU_IU | IU_KIU | IU_MIU
IU_IU: "internationale einheit(en)" | "internationale einheit" | "pressor units" | "unités" | "u.i."
| "i.u." | "i.e." | "unit" | "u." | "iu" | "[iu]" | "j.m." | "ie" | "ui" | "u"
IU_KIU: "kilo UI" | "kiu"
IU_MIU: "million international units (iu)" | "millones ui" | "millions ui" | "million i.e." | "mill. ui"
| "million u" | "m.ui" | "miu" | "mu"

// define all percent units
percent_unit: PERCENT
PERCENT: "porcentaje" | "porciento" | "%"

// define all time units
time_unit: TIME_MIN | TIME_HOUR | TIME_DAY
TIME_MIN: "min"
TIME_HOUR: "hour" | "h"
TIME_DAY: "day"

// define all percent weight volume units
percent_weight_volume: PERCENT_WEIGHT_PER_WEIGHT | PERCENT_WEIGHT_PER_VOLUME
| PERCENT_VOLUME_PER_VOLUME | PERCENT_VOLUME_PER_WEIGHT
PERCENT_WEIGHT_PER_WEIGHT: "% (w/w)" | "% w/w" | "% / w/w" | "%w/w" | "porcentaje peso/peso"
PERCENT_WEIGHT_PER_VOLUME: "% / w/v" | "% w/v" | "% (w/v)"
PERCENT_VOLUME_PER_VOLUME: "prozentgehalt volumen in volumen" | "% / v/v" | "% v/v" | "% (v/v)"
PERCENT_VOLUME_PER_WEIGHT: "Prozentgehalt Volumen in Masse"

// define all dosis units
dosis_unit: DOSIS_DOS | DOSIS_VIAL | DOSIS_BAG | DOSIS_BOTTLE | DOSIS_SACHET | DOSIS_SYRINGE| DOSIS_CONTAINER
| DOSIS_SRT | DOSIS_ACT | DOSIS_GUM | DOSIS_BLISTER | DOSIS_STRIP | DOSIS_CAPSULE
| DOSIS_KIT | DOSIS_TABLET | DOSIS_PACK | DOSIS_LOZENGE | DOSIS_CARTRIDGE | DOSIS_PIECE | DOSIS_GENERATOR
| DOSIS_CYLINDER
DOSIS_DOS: "dawkę odmierzoną" | "dos(es)" | "dose(s)" | "dosis" | "dose" | "dos"
DOSIS_VIAL: "flacon" | "fiolkę" | "vial"
DOSIS_BAG: "bag" | "zak"
DOSIS_BOTTLE: "bottle"
DOSIS_SACHET: "sachet"
DOSIS_SYRINGE: "syr"
DOSIS_CONTAINER: "container"
DOSIS_SRT: "srt"
DOSIS_ACT: "act"
DOSIS_GUM: "gum"
DOSIS_BLISTER: "blister"
DOSIS_STRIP: "strip" | "pasek"
DOSIS_CAPSULE: "capsules" | "capsule" | "cap"
DOSIS_KIT: "kit"
DOSIS_TABLET: "tablet" | "tab"
DOSIS_PACK: "pck"
DOSIS_LOZENGE: "loz"
DOSIS_CARTRIDGE: "cartridge"
DOSIS_PIECE: "stück" | "piece" | "stuk"
DOSIS_GENERATOR: "gen"
DOSIS_CYLINDER: "cylr"

// list all colony forming units
colony_forming_unit: COLONY_FORMING_UNIT | LOG_COLONY_FORMING_UNIT
| MILLION_COLONY_FORMING_UNIT | BILLION_COLONY_FORMING_UNIT
COLONY_FORMING_UNIT: "koloniebildende einheit(en)" | "cfu" | "[cfu]"
LOG_COLONY_FORMING_UNIT: "log10 cfu"
MILLION_COLONY_FORMING_UNIT: "millionkeime" | "million cfu"
BILLION_COLONY_FORMING_UNIT: "b"

// list all kallikrein inactivator units
kallikrein_inactivator_unit: KALLIKREIN_INACTIVATOR_UNIT
KALLIKREIN_INACTIVATOR_UNIT: "kallikrein-inhibitor-einheit" | "kui"

// list all plaque forming units
plaque_forming_unit: PLAQUE_FORMING_UNIT | LOG_PLAQUE_FORMING_UNIT
PLAQUE_FORMING_UNIT: "unidades formadoras de placa (ufp)" | "pfu" | "[pfu]"
LOG_PLAQUE_FORMING_UNIT: "log10 pfu"

// list all becquerel unit
becquerel_unit: BECQUEREL | KILO_BECQUEREL | MEGA_BECQUEREL | GIGA_BECQUEREL
BECQUEREL: "becquerel"
KILO_BECQUEREL: "kilobequerelio" | "kbq"
MEGA_BECQUEREL: "megabecquerelio" | "megabecquerel" | "mbq"
GIGA_BECQUEREL: "gigabecquerelios" | "gigabecquerel" | "gbq"

// list all cell unit
cell_unit: CELL | MILLION_CELL
CELL: "celulas" | "komorek" | "cellen" | "cells"
MILLION_CELL: "millions de cellules" | "million cells" | "mln komórek"

// list all area unit
area_unit: AREA_CM_SQ
AREA_CM_SQ: "quadratzentimeter" | "sq cm" | "cm2"

// list all currie unit
curie_unit: CURIE | MILICURIE | MICROCURIE
CURIE: "ci"
MILICURIE: "mci" | "millicurie"
MICROCURIE: "mcci" | "mikrocurie"

// list all tissue culture infectious dose unit
tissue_culture_infectious_dose_unit : TISSUE_CULTURE_INFECTIOUS_DOSE | LOG_TISSUE_CULTURE_INFECTIOUS_DOSE
TISSUE_CULTURE_INFECTIOUS_DOSE: "gewebekultur-infektiöse-dosis 50%" | "tcid50" | "[tcid_50]"
LOG_TISSUE_CULTURE_INFECTIOUS_DOSE: "log10 tcid50"

// list all hahnemannian units
hahnemannian_unit: DECIMAL_HAHNEMANNIAN
DECIMAL_HAHNEMANNIAN: "dh"

// list all ppm units
parts_per_million_unit: PARTS_PER_MILLION_UNIT
PARTS_PER_MILLION_UNIT: "ppm" | "[ppm]"

// list all vector_genome units
vector_genome_unit: VECTOR_GENOME_UNIT
VECTOR_GENOME_UNIT: "vg"

// list all anti xa unit
anti_xa_unit: ANTI_XA_UNIT
ANTI_XA_UNIT: "anti-blutgerinnungsfaktor xa aktivität" | "anti xa units" | "u.i. antixa" | "ui anti-xa"
| "ul anti-xa" | "anti-xa iu" | "anti-xa ui" | "ui antixa" | "anti-xa" | "j.m. a.xa"
| "unidades antigenicas"

//list all cell_culture_infectious_dose unit
cell_culture_infectious_dose_unit: CELL_CULTURE_INFECTIOUS_DOSE | LOG_CELL_CULTURE_INFECTIOUS_DOSE
CELL_CULTURE_INFECTIOUS_DOSE: "ccid50" | "cid50"| "[ccid_50]"
LOG_CELL_CULTURE_INFECTIOUS_DOSE: "log10 ccid50" | "log10 cid50"

// list all equivalents unit
equivalents_unit: EQUIVALENTS_UNIT | MILI_EQUIVALENTS_UNIT
EQUIVALENTS_UNIT: "eq"
MILI_EQUIVALENTS_UNIT: "meq"

// list all d antegen units
d_antegen_unit: D_ANTEGEN_UNIT
D_ANTEGEN_UNIT: "unités antigènes d" | "D-UNITS" | "[d'ag'u]" | "d-au" | "du"

// list all elisa unit
elisa_unit: ELISA_UNIT | LOG_ELISA_UNIT
ELISA_UNIT: "elisa unit" | "elisa u" | "u elisa"
LOG_ELISA_UNIT: "log10 elisa u"

// list all all alergan unit
allergan_unit: ALLERGAN_UNIT
ALLERGAN_UNIT: "allergan-einheit"| "allergen-einheit" | "alergen units" | "allergan unit" | "Allergan units"
| "[au]" | "au"

// list all effective response
effective_response_unit: EFFECTIVE_RESPONSE_25 | EFFECTIVE_RESPONSE_50 | EFFECTIVE_RESPONSE_60 | EFFECTIVE_RESPONSE_70
| EFFECTIVE_RESPONSE_120
EFFECTIVE_RESPONSE_25: "% er25"
EFFECTIVE_RESPONSE_50: "% er50"
EFFECTIVE_RESPONSE_60: "% er60"
EFFECTIVE_RESPONSE_70: "% er70"
EFFECTIVE_RESPONSE_120: "% er120"

// list all relative potency unit
relative_potency_unit: RELATIVE_POTENCY_UNIT
RELATIVE_POTENCY_UNIT: "rp"

// list all tcp unit
tcp_unit: TCP_UNIT
TCP_UNIT: "tcp units"

// list all haemagglutination_inhibition_unit
haemagglutination_inhibition_unit: HAEMAGGLUTINATION_INHIBITION_UNIT | LOG_HAEMAGGLUTINATION_INHIBITION_UNIT
HAEMAGGLUTINATION_INHIBITION_UNIT: "hai" | "hiu"
LOG_HAEMAGGLUTINATION_INHIBITION_UNIT: "log10 hai" | "log10 hi" | "log10 hiu"

// list all oocysts unit
oocyst_unit: OOCYST_UNIT | SPORULATED_OOCYST_UNIT
OOCYST_UNIT: "oocysts"
SPORULATED_OOCYST_UNIT: "sporulated oocysts"

// list all high activation unit
high_activation_unit: HIGH_ACTIVATION_UNIT
HIGH_ACTIVATION_UNIT: "hau"

// list all antigenic unut
antigenic_unit: ANTIGENIC_UNIT
ANTIGENIC_UNIT: "antigenic units" | "unidades antigénicas"

// list all tuberculin_units
tuberculin_unit: TUBERCULIN_UNIT
TUBERCULIN_UNIT: "tu"

// list all antibody_micro_agglutination_lytic_reaction unit
antibody_micro_agglutination_lytic_reaction_unit: ANTIBODY_MICRO_AGGLUTINATION_LYTIC_REACTION_UNIT
ANTIBODY_MICRO_AGGLUTINATION_LYTIC_REACTION_UNIT: "alr"

// list all speywood unit
speywood_unit: SPEYWOOD_UNIT
SPEYWOOD_UNIT: "unités speywood" | "speywood units"

// list all galactosidase_units
galactosidase_unit: GALACTOSIDASE_UNIT
GALACTOSIDASE_UNIT: "galu"

// list all pulsation units
pulsation_unit: PULSATION_UNIT
PULSATION_UNIT: "pulsación"

// list all katal unit
katal_unit: KATAL | MICROKATAL
KATAL: "katal" | "katals"
MICROKATAL: "microkatals" | "mikrokatal"

// list all limit_of_flocculation_unit
limit_of_flocculation_unit: LIMIT_OF_FLOCCULATION_UNIT
LIMIT_OF_FLOCCULATION_UNIT: "lf"

// list all germs unit
germs_unit: GERMS_UNIT
GERMS_UNIT: "keime"

// list all kallidinogenase_inactivator_unit
kallidinogenase_inactivator_unit: KALLIDINOGENASE_INACTIVATOR_UNIT
KALLIDINOGENASE_INACTIVATOR_UNIT: "kallidinogenase-inaktivator-einheit"

// list all usp unit
usp_unit: USP_UNIT
USP_UNIT: "usp-einheiten" | "[usp'u]" | "[usp]"

homeopathic_potency_unit: HOMEOPATHIC_POTENCY_UNIT | HOMEOPATHIC_POTENCY_X_UNIT | HOMEOPATHIC_POTENCY_C_UNIT
| HOMEOPATHIC_POTENCY_M_UNIT | HOMEOPATHIC_POTENCY_Q_UNIT
HOMEOPATHIC_POTENCY_UNIT: "hp"
HOMEOPATHIC_POTENCY_X_UNIT: "[hp_x]"
HOMEOPATHIC_POTENCY_C_UNIT: "[hp_c]"
HOMEOPATHIC_POTENCY_M_UNIT: "[hp_m]"
HOMEOPATHIC_POTENCY_Q_UNIT: "[hp_q]"

// list all bioequivalent_allergy_units
bioequivalent_allergy_unit: BIOEQUIVALENT_ALLERGY_UNIT
BIOEQUIVALENT_ALLERGY_UNIT: "[bau]"

// list all fluorescent_focus_units
fluorescent_focus_unit: FLUORESCENT_FOCUS_UNIT
FLUORESCENT_FOCUS_UNIT: "[ffu]" | "ffu"

// list all protein_nitrogen_units
protein_nitrogen_unit: PROTEIN_NITROGEN_UNIT
PROTEIN_NITROGEN_UNIT: "[pnu]" | "pnu"

// list all helper terminals used to define strength rules
not_important: extended_word | ignorable_special_chars
skippable: not_important+
spaces: SPACES*
extended_word: spaces* EXTENDED_WORD spaces*
ignorable_special_chars: spaces* IGNORABLE_SPECIAL_CHARS spaces*
number.2: spaces* NUMBER spaces*
separator.2: spaces* SEPARATOR spaces*
concatenator.2: spaces* CONCATENATOR spaces*
range_separator.2: spaces* RANGE_SEPARATOR spaces*

// define general terminals
EXTENDED_WORD: /[a-zA-ZöäüéÖÄÜß]+/
IGNORABLE_SPECIAL_CHARS: /[,.;:-[]()+&%*]/
SEPARATOR: "in" | "per" | "\" | "/" | "per dose of"
CONCATENATOR: "+" | ","
RANGE_SEPARATOR: "-" | "–" | "to"
NUMBER: /\d+([ \s.,]\d*)*|[.,]?\d+([eE^xX][-]?\d+)?/
SPACES: /\s+/

any tips on improving the grammar are also aprreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@adi12345-crypto and others