Filter on 'char' field was error while import data by copy #1465

happyjoblzp · 2018-07-17T12:23:04Z

default_database=# CREATE TABLE REGION  ( R_REGIONKEY  INTEGER NOT NULL,R_NAME       CHAR(25) NOT NULL,R_COMMENT    VARCHAR(152));
CREATE TABLE
default_database=# copy region from '/Users/liuliuzhipeng/tpch-dbgen/region.tbl' with(delimiter '|', null '');
COPY 5
default_database=# select * from region;
 r_regionkey |   r_name    |                                                      r_comment                                                      
-------------+-------------+---------------------------------------------------------------------------------------------------------------------
           0 | AFRICA      | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to 
           2 | ASIA        | ges. thinly even pinto beans ca
           4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl
           1 | AMERICA     | hs use ironic, even requests. s
           3 | EUROPE      | ly final courts cajole furiously final excuse
(5 rows)

default_database=# select * from region where r_name='ASIA';
 r_regionkey | r_name | r_comment 
-------------+--------+-----------
(0 rows)

default_database=# select * from region where r_regionkey=2;
 r_regionkey | r_name |            r_comment            
-------------+--------+---------------------------------
           2 | ASIA   | ges. thinly even pinto beans ca
(1 row)

The text was updated successfully, but these errors were encountered:

RonyMin · 2018-11-30T02:48:22Z

I reproduce the issue @happyjoblzp you faced, and analyze the internals of Peloton to solve the issue.
The problem is originated from how Peloton handles the end value of a string, i.e., varchar column.

The basic mechanism for every char-type value is to add '\0' (00000000 in a bitstring) at the end of string value.
For instance, a binary string for selection predicate, r_name = 'ASIA', in a query becomes:
01000001 01010011 01001001 01000001 00000000 (ASIA'\0')
In addition, its length is also increased by one, i.e., not four but five.

However, if we scan tables from CSV file (see void CSVScanner::ProduceCSV(char *line) { ... } in https://github.com/cmu-db/peloton/blob/master/src/codegen/util/csv_scanner.cpp), it does not add such null value.

In this case, if we read a varchar value with increased length by one, i.e., read five characters instead of four character, scanned char value becomes just a char value + its delimiter.

For instance, typical TPC-H and TPC-DS dataset generators use '|' as their column delimiter.
Thus, Peloton compares 'ASIA|' (by scanning column value of a table) to 'ASIA\0' (by parsing SQL query), and thus string comparator always returns false (please refer static inline int CompareStrings(const char* str1, int len1, const char* str2, int len2) { ... } function in (https://github.com/cmu-db/peloton/blob/master/src/include/type/type_util.h).

I create PR #1490 for solving the issue, and thus please refer the simple solution.
To solve the issue, as a simple test, I just replace all delimiters in CSV to '\0' while scanning CSV.
Then, the result of a query becomes correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter on 'char' field was error while import data by copy #1465

Filter on 'char' field was error while import data by copy #1465

happyjoblzp commented Jul 17, 2018

RonyMin commented Nov 30, 2018

Filter on 'char' field was error while import data by copy #1465

Filter on 'char' field was error while import data by copy #1465

Comments

happyjoblzp commented Jul 17, 2018

RonyMin commented Nov 30, 2018