Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

Filter on 'char' field was error while import data by copy #1465

Open
happyjoblzp opened this issue Jul 17, 2018 · 1 comment
Open

Filter on 'char' field was error while import data by copy #1465

happyjoblzp opened this issue Jul 17, 2018 · 1 comment

Comments

@happyjoblzp
Copy link

default_database=# CREATE TABLE REGION  ( R_REGIONKEY  INTEGER NOT NULL,R_NAME       CHAR(25) NOT NULL,R_COMMENT    VARCHAR(152));
CREATE TABLE
default_database=# copy region from '/Users/liuliuzhipeng/tpch-dbgen/region.tbl' with(delimiter '|', null '');
COPY 5
default_database=# select * from region;
 r_regionkey |   r_name    |                                                      r_comment                                                      
-------------+-------------+---------------------------------------------------------------------------------------------------------------------
           0 | AFRICA      | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to 
           2 | ASIA        | ges. thinly even pinto beans ca
           4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl
           1 | AMERICA     | hs use ironic, even requests. s
           3 | EUROPE      | ly final courts cajole furiously final excuse
(5 rows)

default_database=# select * from region where r_name='ASIA';
 r_regionkey | r_name | r_comment 
-------------+--------+-----------
(0 rows)

default_database=# select * from region where r_regionkey=2;
 r_regionkey | r_name |            r_comment            
-------------+--------+---------------------------------
           2 | ASIA   | ges. thinly even pinto beans ca
(1 row)
@RonyMin
Copy link

RonyMin commented Nov 30, 2018

I reproduce the issue @happyjoblzp you faced, and analyze the internals of Peloton to solve the issue.
The problem is originated from how Peloton handles the end value of a string, i.e., varchar column.

The basic mechanism for every char-type value is to add '\0' (00000000 in a bitstring) at the end of string value.
For instance, a binary string for selection predicate, r_name = 'ASIA', in a query becomes:
01000001 01010011 01001001 01000001 00000000 (ASIA'\0')
In addition, its length is also increased by one, i.e., not four but five.

However, if we scan tables from CSV file (see void CSVScanner::ProduceCSV(char *line) { ... } in https://github.com/cmu-db/peloton/blob/master/src/codegen/util/csv_scanner.cpp), it does not add such null value.

In this case, if we read a varchar value with increased length by one, i.e., read five characters instead of four character, scanned char value becomes just a char value + its delimiter.

For instance, typical TPC-H and TPC-DS dataset generators use '|' as their column delimiter.
Thus, Peloton compares 'ASIA|' (by scanning column value of a table) to 'ASIA\0' (by parsing SQL query), and thus string comparator always returns false (please refer static inline int CompareStrings(const char* str1, int len1, const char* str2, int len2) { ... } function in (https://github.com/cmu-db/peloton/blob/master/src/include/type/type_util.h).

I create PR #1490 for solving the issue, and thus please refer the simple solution.
To solve the issue, as a simple test, I just replace all delimiters in CSV to '\0' while scanning CSV.
Then, the result of a query becomes correct.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants