Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document using statistics in the Faker connector #24874

Merged
merged 1 commit into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 102 additions & 1 deletion docs/src/main/sphinx/connector/faker.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,15 @@ The following table details all general configuration properties:
* - `faker.locale`
- Default locale for generating character-based data, specified as a IETF BCP
47 language tag string. Defaults to `en`.
* - `faker.sequence-detection-enabled`
nineinchnick marked this conversation as resolved.
Show resolved Hide resolved
- If true, when creating a table using existing data, columns with the number
of distinct values close to the number of rows are treated as sequences.
Defaults to `true`.
* - `faker.dictionary-detection-enabled`
- If true, when creating a table using existing data, columns with a low
number of distinct values are treated as dictionaries, and get
the `allowed_values` column property populated with random values.
Defaults to `true`.
:::

The following table details all supported schema properties. If they're not
Expand All @@ -66,6 +75,15 @@ set, values from corresponding configuration properties are used.
them, in any table of this schema.
* - `default_limit`
- Default number of rows in a table.
* - `sequence_detection_enabled`
nineinchnick marked this conversation as resolved.
Show resolved Hide resolved
- If true, when creating a table using existing data, columns with the number
of distinct values close to the number of rows are treated as sequences.
Defaults to `true`.
* - `dictionary_detection_enabled`
- If true, when creating a table using existing data, columns with a low
number of distinct values are treated as dictionaries, and get
the `allowed_values` column property populated with random values.
Defaults to `true`.
:::

The following table details all supported table properties. If they're not set,
Expand All @@ -82,6 +100,15 @@ values from corresponding schema properties are used.
`null` in the table.
* - `default_limit`
- Default number of rows in the table.
* - `sequence_detection_enabled`
- If true, when creating a table using existing data, columns with the number
of distinct values close to the number of rows are treated as sequences.
Defaults to `true`.
* - `dictionary_detection_enabled`
- If true, when creating a table using existing data, columns with a low
number of distinct values are treated as dictionaries, and get
the `allowed_values` column property populated with random values.
Defaults to `true`.
:::

The following table details all supported column properties.
Expand Down Expand Up @@ -245,7 +272,7 @@ operation](sql-read-operations) statements to generate data.
To define the schema for generating data, it supports the following features:

- [](/sql/create-table)
- [](/sql/create-table-as)
- [](/sql/create-table-as), see also [](faker-statistics)
- [](/sql/drop-table)
- [](/sql/create-schema)
- [](/sql/drop-schema)
Expand Down Expand Up @@ -317,3 +344,77 @@ CREATE TABLE generator.default.customer (
group_id INTEGER WITH (allowed_values = ARRAY['10', '32', '81'])
);
```

nineinchnick marked this conversation as resolved.
Show resolved Hide resolved
(faker-statistics)=
### Using existing data statistics

The Faker connector automatically sets the `default_limit` table property, and
the `min`, `max`, and `null_probability` column properties, based on statistics
collected by scanning existing data read by Trino from the data source. The
connector uses these statistics to be able to generate data that is more similar
to the original data set, without using any of that data:


```sql
CREATE TABLE generator.default.customer AS
SELECT *
FROM production.public.customer
WHERE created_at > CURRENT_DATE - INTERVAL '1' YEAR;
```

Instead of using range, or other predicates, tables can be sampled,
see [](tablesample).

When the `SELECT` statement doesn't contain a `WHERE` clause, a shorter notation
can be used:

```sql
CREATE TABLE generator.default.customer AS TABLE production.public.customer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah .. I still have to document the TABLE keyword somewhere as well...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @kasiafi and @martint .. any suggestions where? .. just as a separate page?

```

The Faker connector detects sequence columns, which are integer column with the
number of distinct values almost equal to the number of rows in the table. For
nineinchnick marked this conversation as resolved.
Show resolved Hide resolved
such columns, Faker sets the `step` column property to 1.

Sequence detection can be turned off using the `sequence_detection_enabled`
table, or schema property or in the connector configuration file, using the
`faker.sequence-detection-enabled` property.

The Faker connector detects dictionary columns, which are columns of
nineinchnick marked this conversation as resolved.
Show resolved Hide resolved
non-character types with the number of distinct values lower or equal to 1000.
For such columns, Faker generates a list of random values to choose from, and
saves it in the `allowed_values` column property.

Dictionary detection can be turned off using the `dictionary_detection_enabled`
table, or schema property or in the connector configuration file, using
the `faker.dictionary-detection-enabled` property.

For example, copy the `orders` table from the TPC-H connector with
statistics, using the following query:

```sql
CREATE TABLE generator.default.orders AS TABLE tpch.tiny.orders;
```

Inspect the schema of the table created by the Faker connector:
```sql
SHOW CREATE TABLE generator.default.orders;
```

The table schema should contain additional column and table properties.
```
CREATE TABLE generator.default.orders (
orderkey bigint WITH (max = '60000', min = '1', null_probability = 0E0, step = '1'),
custkey bigint WITH (allowed_values = ARRAY['153','662','1453','63','784', ..., '1493','657'], null_probability = 0E0),
orderstatus varchar(1),
totalprice double WITH (max = '466001.28', min = '874.89', null_probability = 0E0),
orderdate date WITH (max = '1998-08-02', min = '1992-01-01', null_probability = 0E0),
orderpriority varchar(15),
clerk varchar(15),
shippriority integer WITH (allowed_values = ARRAY['0'], null_probability = 0E0),
comment varchar(79)
)
WITH (
default_limit = 15000
)
```
1 change: 1 addition & 0 deletions docs/src/main/sphinx/sql/select.md
Original file line number Diff line number Diff line change
Expand Up @@ -1039,6 +1039,7 @@ ORDER BY regionkey FETCH FIRST ROW WITH TIES;
(5 rows)
```

(tablesample)=
## TABLESAMPLE

There are multiple sample methods:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ public boolean isSequenceDetectionEnabled()
@ConfigDescription(
"""
If true, when creating a table using existing data, columns with the number of distinct values close to
the number of rows will be treated as sequences""")
the number of rows are treated as sequences""")
public FakerConfig setSequenceDetectionEnabled(boolean value)
{
this.sequenceDetectionEnabled = value;
Expand All @@ -96,7 +96,7 @@ public boolean isDictionaryDetectionEnabled()
@ConfigDescription(
"""
If true, when creating a table using existing data, columns with a low number of distinct values
will have the allowed_values column property populated with random values""")
are treated as dictionaries, and get the allowed_values column property populated with random values""")
public FakerConfig setDictionaryDetectionEnabled(boolean value)
{
this.dictionaryDetectionEnabled = value;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,14 +131,14 @@ public List<PropertyMetadata<?>> getSchemaProperties()
SchemaInfo.SEQUENCE_DETECTION_ENABLED,
"""
If true, when creating a table using existing data, columns with the number of distinct values close to
the number of rows will be treated as sequences""",
the number of rows are treated as sequences""",
null,
false),
booleanProperty(
SchemaInfo.DICTIONARY_DETECTION_ENABLED,
"""
If true, when creating a table using existing data, columns with a low number of distinct values
will have the allowed_values column property populated with random values""",
are treated as dictionaries, and get the allowed_values column property populated with random values""",
null,
false));
}
Expand All @@ -163,14 +163,14 @@ public List<PropertyMetadata<?>> getTableProperties()
TableInfo.SEQUENCE_DETECTION_ENABLED,
"""
If true, when creating a table using existing data, columns with the number of distinct values close to
the number of rows will be treated as sequences""",
the number of rows are treated as sequences""",
null,
false),
booleanProperty(
TableInfo.DICTIONARY_DETECTION_ENABLED,
"""
If true, when creating a table using existing data, columns with a low number of distinct values
will have the allowed_values column property populated with random values""",
are treated as dictionaries, and get the allowed_values column property populated with random values""",
null,
false));
}
Expand Down
Loading