Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine/llm api op unittest #528

Open
wants to merge 178 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
178 commits
Select commit Hold shift + click to select a range
63d430a
add api call
drcege Oct 23, 2024
6720da4
add call_api ops
drcege Oct 24, 2024
8daa6e1
clean
drcege Oct 29, 2024
ef11951
minor update
drcege Oct 29, 2024
5597d5c
more tests
drcege Oct 29, 2024
4b6e769
update tests
drcege Oct 29, 2024
835be22
Merge branch 'main' into dev/api_model
drcege Oct 29, 2024
325a753
update prompts
drcege Oct 29, 2024
4f04bdd
fix unittest
drcege Oct 30, 2024
0adbdcd
update tests
drcege Oct 30, 2024
0aa4069
add docs
drcege Nov 1, 2024
f007532
minor fix
drcege Nov 1, 2024
9aa7390
Merge branch 'main' into dev/api_model
drcege Nov 5, 2024
ee4f461
add API processor
drcege Nov 5, 2024
9bbfe47
Merge branch 'main' into dev/api_model
drcege Nov 5, 2024
b00b182
refine API processor
drcege Nov 5, 2024
b718de7
refine
drcege Nov 5, 2024
6d1d433
chunk and extract events
BeachWang Nov 6, 2024
4d1670f
fix bugs
drcege Nov 6, 2024
9e11aa3
fix tests
drcege Nov 6, 2024
cc40fc0
extract attribute
BeachWang Nov 7, 2024
4c262ad
Merge branch 'dev/api_model' of github.com:alibaba/data-juicer into d…
BeachWang Nov 7, 2024
347bc0f
refine tests
drcege Nov 7, 2024
c9d5051
extract nickname
BeachWang Nov 8, 2024
8a128ca
Merge branch 'dev/api_model' of github.com:alibaba/data-juicer into d…
BeachWang Nov 8, 2024
9262777
nickname test done
BeachWang Nov 8, 2024
58fc020
merge main
BeachWang Nov 8, 2024
c7dc28e
lightRAG to OP
BeachWang Nov 11, 2024
238869e
merge main
BeachWang Nov 11, 2024
0e51a43
doc done
BeachWang Nov 11, 2024
6d9d8a5
remove extra test
BeachWang Nov 11, 2024
a637a64
relavant -> relevant
BeachWang Nov 11, 2024
56e7988
fix minor error
BeachWang Nov 11, 2024
03880b7
group by op done
BeachWang Nov 12, 2024
23174fd
ValueError -> Exception
BeachWang Nov 12, 2024
e82cc06
merge main
BeachWang Nov 12, 2024
20a8dee
fix config_all error
BeachWang Nov 12, 2024
38a9511
fix prepare_api_model
BeachWang Nov 13, 2024
35f0eb3
fix rank sample None
BeachWang Nov 13, 2024
155d3dd
constant fix key
BeachWang Nov 13, 2024
f862897
aggregator op
BeachWang Nov 14, 2024
2d4da5e
merge llm_info_extract
BeachWang Nov 14, 2024
7e66057
init python_lambda_mapper
drcege Nov 20, 2024
a61859b
set default arg
drcege Nov 20, 2024
8031a31
fix init
drcege Nov 21, 2024
67711f9
add python_file_mapper
drcege Nov 21, 2024
cdeb692
support text & most relavant entities
BeachWang Nov 22, 2024
125a8f3
coverage ignore_errors
drcege Nov 25, 2024
0c68089
index sample
BeachWang Nov 25, 2024
651789d
role_playing_system_prompt_yaml
BeachWang Nov 25, 2024
c5d7b9e
merge python_file_mapper
BeachWang Nov 26, 2024
cf6a53a
Merge branch 'main' of github.com:alibaba/data-juicer into dev/group_…
BeachWang Nov 26, 2024
222790e
system_prompt begin
BeachWang Nov 27, 2024
75f2911
support batched
drcege Nov 27, 2024
11fa852
remove unforkable
BeachWang Nov 27, 2024
4af2bfb
support batched & add docs
drcege Nov 27, 2024
8867580
Merge branch 'main' into op/python_lambda
drcege Nov 28, 2024
553d5ad
add docs
drcege Nov 28, 2024
470ca19
fix docs
drcege Nov 28, 2024
399a238
update docs
drcege Nov 28, 2024
706365f
Merge branch 'main' into op/python_file
drcege Nov 28, 2024
115ab9a
pre-commit done
BeachWang Nov 28, 2024
ecb8635
fix batch bug
BeachWang Dec 2, 2024
03e3469
fix batch bug
BeachWang Dec 2, 2024
1788fa6
merge fix_batch_bug
BeachWang Dec 3, 2024
735ff4d
Merge branch 'main' of github.com:alibaba/data-juicer into debug/fix_…
BeachWang Dec 3, 2024
00ff624
fix filter batch
BeachWang Dec 3, 2024
8601519
fix filter batch
BeachWang Dec 3, 2024
eeefcab
system prompt recipe done
BeachWang Dec 3, 2024
6eaa50c
Merge branch 'main' of github.com:alibaba/data-juicer into dev/group_…
BeachWang Dec 3, 2024
1575717
not rank for filter
BeachWang Dec 5, 2024
2c5c4a1
limit pyav version
BeachWang Dec 5, 2024
5c96dd5
Merge branch 'debug/fix_batch_bug' of github.com:alibaba/data-juicer …
BeachWang Dec 5, 2024
49be467
add test for op
BeachWang Dec 5, 2024
9ab02fe
tmp
BeachWang Dec 5, 2024
ba086de
tmp
BeachWang Dec 5, 2024
f712131
doc done
BeachWang Dec 5, 2024
12b7616
Merge branch 'op/python_lambda' of github.com:alibaba/data-juicer int…
BeachWang Dec 5, 2024
e57b64a
merge python_lambda
BeachWang Dec 5, 2024
5f463cd
merge python_lambda
BeachWang Dec 5, 2024
a786070
skip api test
BeachWang Dec 6, 2024
73f4e77
merge main
BeachWang Dec 6, 2024
4b6f0b9
merge main
BeachWang Dec 6, 2024
788a212
add env dependency
BeachWang Dec 6, 2024
10242c4
install by recipe
BeachWang Dec 10, 2024
6a43eec
dialog sent intensity
BeachWang Dec 12, 2024
621a693
add query
BeachWang Dec 12, 2024
b46d105
change to dj_install
BeachWang Dec 12, 2024
a0da444
change to dj_install
BeachWang Dec 12, 2024
02f8dda
developer doc done
BeachWang Dec 12, 2024
635a8a9
merge dj_install
BeachWang Dec 12, 2024
083b665
+ add auto mode for analyzer: load all filters that produce stats to …
HYLcool Dec 12, 2024
662df5e
+ add default mem_required for those model-based OPs
HYLcool Dec 13, 2024
3b04908
query sent_int mapper
BeachWang Dec 13, 2024
6b4d525
query sentiment test done
BeachWang Dec 13, 2024
926c3da
- support wordcloud drawing for str or str list fields in stats
HYLcool Dec 13, 2024
27347c0
- take the minimum one of dataset length and auto num
HYLcool Dec 13, 2024
d19f92f
* update default export path
HYLcool Dec 13, 2024
fbd6726
* set version limit for wandb to avoid exception
HYLcool Dec 13, 2024
58288f7
change meta pass
BeachWang Dec 13, 2024
9f9f85b
+ add docs for auto mode
HYLcool Dec 13, 2024
b665c10
doc done
BeachWang Dec 13, 2024
07be552
merge main
BeachWang Dec 13, 2024
8ba4156
sentiment detection
BeachWang Dec 16, 2024
48b1761
diff label
BeachWang Dec 16, 2024
8160725
sentiment
BeachWang Dec 16, 2024
01846d1
test done
BeachWang Dec 16, 2024
566eb5b
+ support t-test for Measure
HYLcool Dec 16, 2024
7b8ee5c
* fix some bugs
HYLcool Dec 16, 2024
a76d975
dialog intent label
BeachWang Dec 17, 2024
2fb9fe4
fix typo
BeachWang Dec 17, 2024
324467f
prompt adjust
BeachWang Dec 17, 2024
4a3ad39
add more test
BeachWang Dec 17, 2024
937b3f1
query intent detection
BeachWang Dec 17, 2024
d4ca87b
for test
BeachWang Dec 17, 2024
8109c71
for test
BeachWang Dec 17, 2024
c749dcd
change model
BeachWang Dec 17, 2024
c7df0bc
fix typo
BeachWang Dec 17, 2024
c7662cb
fix typo
BeachWang Dec 17, 2024
6f44ec0
for test
BeachWang Dec 17, 2024
9b6652d
for test
BeachWang Dec 17, 2024
fa306dc
doc done
BeachWang Dec 17, 2024
601d9a2
- support analyze a dataset object
HYLcool Dec 17, 2024
34f2ab6
- support analysis on tags in meta
HYLcool Dec 17, 2024
8531a01
- support analysis with tagging OPs
HYLcool Dec 17, 2024
4d6b701
- move tags into the meta field
HYLcool Dec 18, 2024
767b2f0
dialog topic detection
BeachWang Dec 18, 2024
c088cb1
dialog topic detection
BeachWang Dec 18, 2024
12351db
dialog topic detection
BeachWang Dec 18, 2024
4b4e946
dialog topic detection
BeachWang Dec 18, 2024
4506a8e
dialog topic detection
BeachWang Dec 18, 2024
d21db85
dialog topic detection
BeachWang Dec 18, 2024
6f394ee
query topic detection
BeachWang Dec 18, 2024
abee815
query topic detection
BeachWang Dec 18, 2024
0494741
query topic detection
BeachWang Dec 18, 2024
38523a1
query topic detection
BeachWang Dec 18, 2024
b03a33a
query topic detection
BeachWang Dec 18, 2024
35aa6bd
- do not tell tags using their suffix
HYLcool Dec 18, 2024
ad226b1
doc done
BeachWang Dec 18, 2024
85e1392
- add insight mining
HYLcool Dec 18, 2024
b02745b
meta tags aggregator
BeachWang Dec 19, 2024
f2654f1
meta tags aggregator
BeachWang Dec 19, 2024
23e5d6f
meta tags aggregator
BeachWang Dec 19, 2024
1c74709
meta tags aggregator
BeachWang Dec 19, 2024
a997726
meta tags aggregator
BeachWang Dec 19, 2024
2642847
meta tags aggregator
BeachWang Dec 19, 2024
2dae3b8
meta tags aggregator
BeachWang Dec 19, 2024
8bb2509
meta tags aggregator
BeachWang Dec 19, 2024
90303ee
meta tags aggregator
BeachWang Dec 19, 2024
e4c6ff1
meta tags aggregator
BeachWang Dec 19, 2024
12f8946
meta tags aggregator
BeachWang Dec 19, 2024
09b1599
meta tags aggregator
BeachWang Dec 19, 2024
203bc64
naive reverse grouper
BeachWang Dec 19, 2024
cf01e7e
naive reverse grouper
BeachWang Dec 19, 2024
e3d7b8b
* resolve the bugs when running insight mining in multiprocessing mode
HYLcool Dec 19, 2024
3ca9994
Merge branch 'main' into feat/insight_mining
HYLcool Dec 19, 2024
16ca358
* update unittests
HYLcool Dec 20, 2024
dfb0bca
* update unittests
HYLcool Dec 20, 2024
f8b9539
* update unittests
HYLcool Dec 20, 2024
0ba6459
tags specified field
BeachWang Dec 20, 2024
45259e5
* update readme for analyzer
HYLcool Dec 20, 2024
174ee05
Merge branch 'main' into feat/insight_mining
HYLcool Dec 20, 2024
4ad8b8d
merge main
BeachWang Dec 20, 2024
9f098bd
doc done
BeachWang Dec 20, 2024
51f53dc
* use more detailed key
HYLcool Dec 20, 2024
58001ca
+ add reference
HYLcool Dec 20, 2024
892cb48
Merge branch 'feat/insight_mining' of github.com:alibaba/data-juicer …
BeachWang Dec 20, 2024
19fd15b
move mm tags
BeachWang Dec 20, 2024
8fec0f7
move meta key
BeachWang Dec 24, 2024
6fdc95b
done
BeachWang Dec 30, 2024
8e01f7e
merge main
BeachWang Dec 30, 2024
af9e14d
test done
BeachWang Dec 31, 2024
f57f454
rm nested set
BeachWang Dec 31, 2024
4188150
enable op error for unittest
BeachWang Jan 2, 2025
fad48f5
merge main
BeachWang Jan 2, 2025
e6f4564
enhance api unittest
BeachWang Jan 3, 2025
a572f5a
merge main
BeachWang Jan 3, 2025
97f4642
expose skip_op_error
BeachWang Jan 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ np: 4 # number of subproce
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
turbo: false # Enable Turbo mode to maximize processing speed when batch size is 1.
skip_op_error: true # Skip errors in OPs caused by unexpected unvalid samples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unvalid --> invalid

use_cache: true # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable `HF_DATASETS_CACHE`, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
open_monitor: true # Whether to open the monitor to trace resource utilization for each OP during data processing. It\'s True in default.
Expand Down
10 changes: 8 additions & 2 deletions data_juicer/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,8 +219,13 @@ def init_configs(args: Optional[List[str]] = None, which_entry: object = None):
'--turbo',
type=bool,
default=False,
help='Enable Turbo mode to maximize processing speed. Stability '
'features like fault tolerance will be disabled.')
help='Enable Turbo mode to maximize processing speed when batch size '
'is 1.')
parser.add_argument(
'--skip_op_error',
type=bool,
default=True,
help='Skip errors in OPs caused by unexpected unvalid samples.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

parser.add_argument(
'--use_cache',
type=bool,
Expand Down Expand Up @@ -550,6 +555,7 @@ def init_setup_from_cfg(cfg: Namespace):
'video_key': cfg.video_key,
'num_proc': cfg.np,
'turbo': cfg.turbo,
'skip_op_error': cfg.skip_op_error,
'work_dir': cfg.work_dir,
}
cfg.process = update_op_attr(cfg.process, op_attrs)
Expand Down
42 changes: 30 additions & 12 deletions data_juicer/ops/base_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def wrapper(sample, *args, **kwargs):
return wrapper


def catch_map_batches_exception(method):
def catch_map_batches_exception(method, skip_op_error=False):
"""
For batched-map sample-level fault tolerance.
"""
Expand All @@ -59,6 +59,8 @@ def wrapper(samples, *args, **kwargs):
try:
return method(samples, *args, **kwargs)
except Exception as e:
if not skip_op_error:
raise
from loguru import logger
logger.error(
f'An error occurred in mapper operation when processing '
Expand All @@ -72,7 +74,9 @@ def wrapper(samples, *args, **kwargs):
return wrapper


def catch_map_single_exception(method, return_sample=True):
def catch_map_single_exception(method,
return_sample=True,
skip_op_error=False):
"""
For single-map sample-level fault tolerance.
The input sample is expected batch_size = 1.
Expand Down Expand Up @@ -100,6 +104,8 @@ def wrapper(sample, *args, **kwargs):
else:
return [res]
except Exception as e:
if skip_op_error:
raise
from loguru import logger
logger.error(
f'An error occurred in mapper operation when processing '
Expand Down Expand Up @@ -156,6 +162,10 @@ def __init__(self, *args, **kwargs):
self.batch_size = kwargs.get('batch_size', 1000)
self.work_dir = kwargs.get('work_dir', None)

# for unittest, do not skip the error.
# It would be set to be True in config init.
self.skip_op_error = kwargs.get('skip_op_error', False)

# whether the model can be accelerated using cuda
_accelerator = kwargs.get('accelerator', None)
if _accelerator is not None:
Expand Down Expand Up @@ -277,9 +287,11 @@ def __init__(self, *args, **kwargs):

# runtime wrappers
if self.is_batched_op():
self.process = catch_map_batches_exception(self.process_batched)
self.process = catch_map_batches_exception(
self.process_batched, skip_op_error=self.skip_op_error)
else:
self.process = catch_map_single_exception(self.process_single)
self.process = catch_map_single_exception(
self.process_single, skip_op_error=self.skip_op_error)

# set the process method is not allowed to be overridden
def __init_subclass__(cls, **kwargs):
Expand Down Expand Up @@ -366,13 +378,16 @@ def __init__(self, *args, **kwargs):
# runtime wrappers
if self.is_batched_op():
self.compute_stats = catch_map_batches_exception(
self.compute_stats_batched)
self.process = catch_map_batches_exception(self.process_batched)
self.compute_stats_batched, skip_op_error=self.skip_op_error)
self.process = catch_map_batches_exception(
self.process_batched, skip_op_error=self.skip_op_error)
else:
self.compute_stats = catch_map_single_exception(
self.compute_stats_single)
self.process = catch_map_single_exception(self.process_single,
return_sample=False)
self.compute_stats_single, skip_op_error=self.skip_op_error)
self.process = catch_map_single_exception(
self.process_single,
return_sample=False,
skip_op_error=self.skip_op_error)

# set the process method is not allowed to be overridden
def __init_subclass__(cls, **kwargs):
Expand Down Expand Up @@ -481,9 +496,11 @@ def __init__(self, *args, **kwargs):

# runtime wrappers
if self.is_batched_op():
self.compute_hash = catch_map_batches_exception(self.compute_hash)
self.compute_hash = catch_map_batches_exception(
self.compute_hash, skip_op_error=self.skip_op_error)
else:
self.compute_hash = catch_map_single_exception(self.compute_hash)
self.compute_hash = catch_map_single_exception(
self.compute_hash, skip_op_error=self.skip_op_error)

def compute_hash(self, sample):
"""
Expand Down Expand Up @@ -619,7 +636,8 @@ def __init__(self, *args, **kwargs):
queries and responses
"""
super(Aggregator, self).__init__(*args, **kwargs)
self.process = catch_map_single_exception(self.process_single)
self.process = catch_map_single_exception(
self.process_single, skip_op_error=self.skip_op_error)

def process_single(self, sample):
"""
Expand Down
7 changes: 7 additions & 0 deletions tests/config/test_config_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def test_yaml_cfg_file(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
}, 'nested dict load fail, for nonparametric op')
Expand All @@ -79,6 +80,7 @@ def test_yaml_cfg_file(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
}, 'nested dict load fail, un-expected internal value')
Expand Down Expand Up @@ -151,6 +153,7 @@ def test_mixture_cfg(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
})
Expand All @@ -174,6 +177,7 @@ def test_mixture_cfg(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
})
Expand All @@ -197,6 +201,7 @@ def test_mixture_cfg(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
})
Expand All @@ -220,6 +225,7 @@ def test_mixture_cfg(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
})
Expand All @@ -243,6 +249,7 @@ def test_mixture_cfg(self):
'turbo': False,
'batch_size': 1000,
'index_key': None,
'skip_op_error': True,
'work_dir': WORKDIR,
}
})
Expand Down
8 changes: 5 additions & 3 deletions tests/ops/aggregator/test_entity_attribute_aggregator.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
from data_juicer.core.data import NestedDataset as Dataset
from data_juicer.ops.aggregator import EntityAttributeAggregator
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase, SKIPPED_TESTS
from data_juicer.utils.constant import Fields, MetaKeys
from data_juicer.utils.constant import Fields, BatchMetaKeys, MetaKeys


@SKIPPED_TESTS.register_module()
class EntityAttributeAggregatorTest(DataJuicerTestCaseBase):

def _run_helper(self, op, samples):
def _run_helper(self, op, samples, output_key=BatchMetaKeys.entity_attribute):

# before runing this test, set below environment variables:
# export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1/
Expand All @@ -23,6 +23,8 @@ def _run_helper(self, op, samples):
for data in new_dataset:
for k in data:
logger.info(f"{k}: {data[k]}")
self.assertIn(output_key, data[Fields.batch_meta])
self.assertNotEqual(data[Fields.batch_met][output_key], '')

self.assertEqual(len(new_dataset), len(samples))

Expand Down Expand Up @@ -64,7 +66,7 @@ def test_input_output(self):
input_key='sub_docs',
output_key='text'
)
self._run_helper(op, samples)
self._run_helper(op, samples, output_key='text')

def test_max_token_num(self):
samples = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
from data_juicer.ops.aggregator import MostRelavantEntitiesAggregator
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase, SKIPPED_TESTS

from data_juicer.utils.constant import Fields, MetaKeys
from data_juicer.utils.constant import Fields, BatchMetaKeys, MetaKeys


@SKIPPED_TESTS.register_module()
class MostRelavantEntitiesAggregatorTest(DataJuicerTestCaseBase):

def _run_helper(self, op, samples):
def _run_helper(self, op, samples, output_key=BatchMetaKeys.most_relavant_entities):

# before runing this test, set below environment variables:
# export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1/
Expand All @@ -24,6 +24,8 @@ def _run_helper(self, op, samples):
for data in new_dataset:
for k in data:
logger.info(f"{k}: {data[k]}")
self.assertIn(output_key, data[Fields.batch_meta])
self.assertNotEqual(data[Fields.batch_meta][output_key], '')

self.assertEqual(len(new_dataset), len(samples))

Expand Down Expand Up @@ -67,7 +69,7 @@ def test_input_output(self):
input_key='events',
output_key='relavant_roles'
)
self._run_helper(op, samples)
self._run_helper(op, samples, output_key='relavant_roles')

def test_max_token_num(self):
samples = [
Expand Down
6 changes: 4 additions & 2 deletions tests/ops/aggregator/test_nested_aggregator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
@SKIPPED_TESTS.register_module()
class NestedAggregatorTest(DataJuicerTestCaseBase):

def _run_helper(self, op, samples):
def _run_helper(self, op, samples, output_key=MetaKeys.event_description):

# before runing this test, set below environment variables:
# export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1/
Expand All @@ -24,6 +24,8 @@ def _run_helper(self, op, samples):
for data in new_dataset:
for k in data:
logger.info(f"{k}: {data[k]}")
self.assertIn(output_key, data[Fields.batch_meta])
self.assertNotEqual(data[Fields.batch_meta][output_key], '')

self.assertEqual(len(new_dataset), len(samples))

Expand Down Expand Up @@ -61,7 +63,7 @@ def test_input_output(self):
input_key='sub_docs',
output_key='text'
)
self._run_helper(op, samples)
self._run_helper(op, samples, output_key='text')

def test_max_token_num_1(self):
samples = [
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_dialog_intent_detection_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ def _run_op(self, op, samples, target_len, labels_key=None, analysis_key=None):
for analysis, labels in zip(analysis_list, labels_list):
logger.info(f'分析:{analysis}')
logger.info(f'意图:{labels}')
self.assertNotEqual(analysis, '')
self.assertNotEqual(labels, '')

self.assertEqual(len(analysis_list), target_len)
self.assertEqual(len(labels_list), target_len)
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_dialog_sentiment_detection_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ def _run_op(self, op, samples, target_len, labels_key=None, analysis_key=None):
for analysis, labels in zip(analysis_list, labels_list):
logger.info(f'分析:{analysis}')
logger.info(f'情绪:{labels}')
self.assertNotEqual(analysis, '')
self.assertNotEqual(labels, '')

self.assertEqual(len(analysis_list), target_len)
self.assertEqual(len(labels_list), target_len)
Expand Down
1 change: 1 addition & 0 deletions tests/ops/mapper/test_dialog_sentiment_intensity_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def _run_op(self, op, samples, target_len, intensities_key=None, analysis_key=No
for analysis, intensity in zip(analysis_list, intensity_list):
logger.info(f'分析:{analysis}')
logger.info(f'情绪:{intensity}')
self.assertNotEqual(analysis, '')

self.assertEqual(len(analysis_list), target_len)
self.assertEqual(len(intensity_list), target_len)
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_dialog_topic_detection_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ def _run_op(self, op, samples, target_len, labels_key=None, analysis_key=None):
for analysis, labels in zip(analysis_list, labels_list):
logger.info(f'分析:{analysis}')
logger.info(f'话题:{labels}')
self.assertNotEqual(analysis, '')
self.assertNotEqual(labels, '')

self.assertEqual(len(analysis_list), target_len)
self.assertEqual(len(labels_list), target_len)
Expand Down
4 changes: 4 additions & 0 deletions tests/ops/mapper/test_extract_entity_attribute_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ def _run_op(self, api_model, response_path=None):
dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
for sample in dataset:
self.assertIn(MetaKeys.main_entities, sample[Fields.meta])
self.assertIn(MetaKeys.attributes, sample[Fields.meta])
self.assertIn(MetaKeys.attribute_descriptions, sample[Fields.meta])
self.assertIn(MetaKeys.attribute_support_texts, sample[Fields.meta])
ents = sample[Fields.meta][MetaKeys.main_entities]
attrs = sample[Fields.meta][MetaKeys.attributes]
descs = sample[Fields.meta][MetaKeys.attribute_descriptions]
Expand Down
4 changes: 4 additions & 0 deletions tests/ops/mapper/test_extract_entity_relation_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def _run_op(self, op):
dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
sample = dataset[0]
self.assertIn(MetaKeys.entity, sample[Fields.meta])
self.assertIn(MetaKeys.relation, sample[Fields.meta])
self.assertNotEqual(len(sample[Fields.meta][MetaKeys.entity]), 0)
self.assertNotEqual(len(sample[Fields.meta][MetaKeys.relation]), 0)
logger.info(f"entitis: {sample[Fields.meta][MetaKeys.entity]}")
logger.info(f"relations: {sample[Fields.meta][MetaKeys.relation]}")

Expand Down
3 changes: 2 additions & 1 deletion tests/ops/mapper/test_extract_event_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,9 @@ def _run_op(self, api_model, response_path=None):

dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
self.assertNotEqual(len(dataset), 0)
for sample in dataset:
self.assertIn(MetaKeys.event_description, sample[Fields.meta])
self.assertIn(MetaKeys.relevant_characters, sample[Fields.meta])
logger.info(f"chunk_id: {sample['chunk_id']}")
self.assertEqual(sample['chunk_id'], 0)
logger.info(f"event: {sample[Fields.meta][MetaKeys.event_description]}")
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_extract_keyword_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ def _run_op(self, api_model, response_path=None):
dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
sample = dataset[0]
self.assertIn(MetaKeys.keyword, sample[Fields.meta])
self.assertNotEqual(len(sample[Fields.meta][MetaKeys.keyword]), 0)
logger.info(f"keywords: {sample[Fields.meta][MetaKeys.keyword]}")

def test(self):
Expand Down
1 change: 1 addition & 0 deletions tests/ops/mapper/test_extract_nickname_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def _run_op(self, api_model, response_path=None):

dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
self.assertIn(MetaKeys.nickname, dataset[0][Fields.meta])
result = dataset[0][Fields.meta][MetaKeys.nickname]
result = [(
d[MetaKeys.source_entity],
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_extract_support_text_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ def _run_op(self, api_model):
dataset = Dataset.from_list(samples)
dataset = op.run(dataset)
sample = dataset[0]
self.assertIn(MetaKeys.support_text, sample[Fields.meta])
logger.info(f"support_text: \n{sample[Fields.meta][MetaKeys.support_text]}")
self.assertNotEqual(sample[Fields.meta][MetaKeys.support_text], '')

def test(self):
# before runing this test, set below environment variables:
Expand Down
2 changes: 2 additions & 0 deletions tests/ops/mapper/test_relation_identity_mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ def _run_op(self, api_model, output_key=MetaKeys.role_relation):
for data in dataset:
for k in data:
logger.info(f"{k}: {data[k]}")
self.assertIn(output_key, data[Fields.meta])
self.assertNotEqual(data[Fields.meta][output_key], '')

def test_default(self):
self._run_op('qwen2.5-72b-instruct')
Expand Down
Loading