Skip to content
This repository has been archived by the owner on Jun 29, 2021. It is now read-only.

dev/add datatypes to excel columns #230

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

SociopathicPixel
Copy link

Updated:

  • poi-ooxml: version update from 4.0.1 to 4.1.0
  • ExcelDataContextTest: Updated test outcome String.

Jan Bob added 2 commits September 25, 2019 14:37
added eager datatype checker for first 1000 rows
Copy link
Contributor

@arjansh arjansh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next to the inline comments, can you also add some test cases to DefaultSpreadsheetReaderDelegateTest, which test the new functionality?

return table;
}

private void setColumnType(Iterator<Row> data, int rowLength, ColumnType[] columnTypes) {
while (data.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially you're iterating over the complete data set now. I would expect you to only look at the first line (or maybe the first few lines of data) to get the column type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a query to look up only the first 1000records, I see that that got removed when getting DataRow types instead of the "normal" Row type.
I've added a countdown from 1000 in the while-loop

int eagerness = 1000;
...
...
while(data.hasnext() && eagerness-- > 0) {
...
...

@kaspersorensen
Copy link
Contributor

It would have been nice to have a small discussion about this change before doing the PR because I feel like there's maybe a few things I would have liked to do differently:

  • The idea of detecting column type seems general enough that we should have a facility for it as a decorator pattern or such.
  • The PR does not do anything to ensure conversion of values to conform with the data types. I know we have a converter API for that in the core module.
  • I do like the change from VARCHAR to STRING. For the sake of separating concerns I would do that in a separate PR though.

@kaspersorensen
Copy link
Contributor

Oh, and column type detection should be optional. We don't want to break existing users and we also don't want it to potentially break while reading beyond the 1000th first records.

Copy link
Contributor

@arjansh arjansh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of inline comments, please also add unit tests for the added functionality.

if (currentRow.getLastCellNum() == 0) {
continue;
}
if (currentRow.getCell(index) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the logic a bit around here. I would propose you extract lines 189 through 218 into a separate method which returns the ColumnType of a cell in a row. And then move the logic from lines 225 through 231 into this method deciding whether or not to assign that value to the columnTypes array.

Copy link
Author

@SociopathicPixel SociopathicPixel Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted the method: getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index) from lines 189 through 218.

I haven't moved the logic of
checkColumnTypes(final ColumnType expecetedColumnType, ColumnType columnType)
to
getColumnTypeFromRow(final ColumnType columnType, final Row currentRow, int index)
Cause it gets used multiple times.

@arjansh
Copy link
Contributor

arjansh commented Sep 27, 2019

It would have been nice to have a small discussion about this change before doing the PR because I feel like there's maybe a few things I would have liked to do differently:

* The idea of detecting column type seems general enough that we should have a facility for it as a decorator pattern or such.

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

* The PR does not do anything to ensure conversion of values to conform with the data types. I know  we have a converter API for that in the core module.

* I  do like  the change from VARCHAR to STRING. For the sake of separating concerns I would  do that in a separate PR though.

Oh, and column type detection should be optional. We don't want to break existing users and we also don't want it to potentially break while reading beyond the 1000th first records.

I completely agree with these three last remarks. I'm also not sure if scanning a 1000 rows in an Excel file may be too much, because of possible performance issues.

@kaspersorensen
Copy link
Contributor

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

Maybe just some "column type detector" reusable class that you could use as a sort of builder to add data type observations to. Basically just components that we could derive a general pattern from. And potentially have the 1000 constant in here incapsulated (and eventually made parameterized) from etc.

@kaspersorensen
Copy link
Contributor

Wishful thinking code:

ColumnTypeDetector ctd = new ColumnTypeDetector(tableName, columnNames);
for (Object[] record : nativeDataSet) {
  for (int i=0: i<columnNames.length; i++) {
    Object value = record[i];
    ctd.registerValue(i, value);
  }
  if (ctd.sampledEnough()) {
    break;
  }
}
Table table = ctd.createTable();

@kaspersorensen
Copy link
Contributor

That thing could conceivably then also detect whether something is nullable etc.

@arjansh
Copy link
Contributor

arjansh commented Sep 27, 2019

I'm not sure how you mean that this is general, because, yes it is general because all other data stores which support data types already have their own mechanism for detecting column types, but they're all implemented in different manners, because each data store provides a different manner to get the data types. For all these data stores it's default behavior to determine the column types, so why not for Excel too?

Maybe just some "column type detector" reusable class that you could use as a sort of builder to add data type observations to. Basically just components that we could derive a general pattern from. And potentially have the 1000 constant in here incapsulated (and eventually made parameterized) from etc.

I see what you mean now. I think the code here can be organized in a manner so that the column type detection is implemented on a separate class. I personally don't think it should be generalized right away in the context of this PR, but more likely in a separate PR, because I'm afraid otherwise the scope of this PR will become a bit too broad.

@kaspersorensen
Copy link
Contributor

I see what you mean now. I think the code here can be organized in a manner so that the column type detection is implemented on a separate class. I personally don't think it should be generalized right away in the context of this PR, but more likely in a separate PR, because I'm afraid otherwise the scope of this PR will become a bit too broad.

Yes that's a fine approach.

@SociopathicPixel
Copy link
Author

I'm also not sure if scanning a 1000 rows in an Excel file may be too much, because of possible performance issues.

Maybe we can take a percentage of the file row count? Like check 10% with a max cap of 1000 rows.

@kaspersorensen
Copy link
Contributor

Maybe we can take a percentage of the file row count? Like check 10% with a max cap of 1000 rows.

I don't think it's a case where we can set a default that our users will be happy with. I think 1000 is an OK default for when the feature is enabled. I think you should add the number into ExcelConfiguration along with a boolean that enables/disables the feature. And the default should be to have it OFF so that we retain backwards compatibility.

SociopathicPixel and others added 7 commits September 30, 2019 13:52
* I  do like  the change from VARCHAR to STRING. For the sake of separating concerns I would  do that in a separate PR though. - changed STRING back to VARCHAR
* And potentially have the 1000 constant in here incapsulated - Added EAGER_READ to ExcelConfiguration including boolean validateColumnTypes
* I  do like  the change from VARCHAR to STRING. For the sake of separating concerns I would  do that in a separate PR though. - changed STRING back to VARCHAR
* And potentially have the 1000 constant in here incapsulated - Added EAGER_READ to ExcelConfiguration including boolean validateColumnTypes
corrected a few tests cause these where breaking on the datatypes
Copy link
Contributor

@arjansh arjansh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kaspersorensen mentioned earlier, when the data type detection is enabled, you want to use it when writing to an Excel sheet, so for example in the ExcelInsertBuilder, you probably want to check if the class of the value that is inserted matches the ColumnType for the cell its inserted into.

if (currentRow.getLastCellNum() == 0) {
continue;
}
columnTypes[index] = getColumnTypeFromRow(columnTypes[index], currentRow, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it better if you don't pass the current value of the columnType at the current index into the getColumnTypeFromRow method. I would move the logic from the checkColumnType(ColumnType, ColumnType) here, so that method can be removed and the getColumnTypeFromRow` just returns the ColumnType it determines for its inspected cell.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a comment about this earlier on which I didn't get an answer. The logic in checkColumnType(expected, columnType) gets used multiple times in this method so why not reuse the code in a seperate method like now?

Jan Bob and others added 7 commits October 2, 2019 16:33
switching devices and apearantly autosave was off -_-
…om/SociopathicPixel/metamodel into dev/add-datatypes-to-excel-columns

# Conflicts:
#	excel/src/test/java/org/apache/metamodel/excel/ExcelDataContextTest.java
coudn't find any finals
final Column column = new MutableColumn(columnNamingSession.getNextColumnName(namingContext),
ColumnType.STRING, table, j, true);
final Column column;
if (_configuration.isDetectColumnTypes()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this check inside the getColumnTypes method, because you only want to execute the logic which scans a number of rows in the Excel sheet for data types when this is activated, otherwise you'll only lose performance for something which you don't use. In case isDetectColumnTypes returns false, just have the getColumnTypes method return an array filled with ColumnType.STRING objects.

Copy link
Author

@SociopathicPixel SociopathicPixel Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the whole part? (lines 135 - 185).
Also appearantly I'm not in beta mode on this repo which is a bummer for selecting code in review comments.

}

ColumnType columnType = columnTypes[index];
ColumnType expecetedColumnType = getColumnTypeFromRow(currentRow, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this variable to expectedColumnType and make both this and the columnType variable final?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with next commit

if (currentRow.getCell(index) == null) {
return ColumnType.STRING;
} else {
CellType cellType = currentRow.getCell(index).getCellType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this variable final?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the currentRow.getCell(index).getCellType() into the switch statement with the next commit

Comment on lines 144 to 145
public int getEagerness() {
return eagerReader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this method and the underlying field to something different? When I see "eager" in combination with "reader", I interpret it as the opposite of "lazy reading", which is not what this method is about. I would rather see something like getNumberOfLinesToScan, getScanLines or something even better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with next commit

@@ -69,12 +69,33 @@ public void execute() throws MetaModelException {
private List<Row> updateRows(List<Row> rows) {
for (ListIterator<Row> it = rows.listIterator(); it.hasNext();) {
final Row original = (Row) it.next();
validateUpdateType(original);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it from the class with next commit, I think it was to prevent the insertion of for example a String value into a Column which only contains Integers and therefore has the columnType of Integer.

@@ -149,8 +151,33 @@ protected CellStyle fetch() {
cell.setCellStyle(cellStyle.get());
}
}
validateUpdateType(row);
Copy link
Contributor

@arjansh arjansh Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the value is of adding this here. I think that for now it may be best to skip the "validation" part.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see: ExcelDataContextTest.testUpdateDifferentDataTypes

It looks if the Column has a specific ColumnType and if so it will validate the given row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SociopathicPixel I'm sorry, but I still don't get why this has been added here.

@arjansh
Copy link
Contributor

arjansh commented Oct 11, 2019

Outside of the comments the (Excel) tests currently fail, that should be addressed.

@SociopathicPixel
Copy link
Author

Outside of the comments the (Excel) tests currently fail, that should be addressed.

I see indeed that I still had one change staged which wasn't commited (validateColumnType vs detectColumnType)
Will be fixed in the next commit

@arjansh
Copy link
Contributor

arjansh commented Oct 21, 2019

@SociopathicPixel Can you also add a unit test for an .xlsx Excel file, because MetaModel uses the XlsxSpreadsheetReaderDelegate to read these Excel files (see line 192 in ExcelDataContext), so you may need to build in a check which ensure that in case detectColumnTypes is set to true on the ExcelConfiguration, the DefaultSpreadsheetReaderDelegate is always used for reading the Excel file.

@arjansh
Copy link
Contributor

arjansh commented Oct 21, 2019

@SociopathicPixel The values in a Row of an Excel DataSet are all still String objects. They don't use the detected column type to return an object of the expected type. This is implemented in the ExcelUtils#createRow(Workbook, Row, DataSetHeader) method.

Copy link
Contributor

@arjansh arjansh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, the values in a Row of an Excel DataSet are all still String objects. They don't use the detected column type to return an object of the expected type. This is implemented in the ExcelUtils#createRow(Workbook, Row, DataSetHeader) method.

final ColumnType[] columnTypes = new ColumnType[rowLength];
if (_configuration.isDetectColumnTypes()) {

int eagerness = _configuration.getEagerness();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change eagerness to something like numberOfLinesToScan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.


public ExcelConfiguration() {
this(DEFAULT_COLUMN_NAME_LINE, true, false);
}

public ExcelConfiguration(int columnNameLineNumber, boolean skipEmptyLines, boolean skipEmptyColumns) {
this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns);
this(columnNameLineNumber, null, skipEmptyLines, skipEmptyColumns, false, 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the default value of a "1000" was in a constant, and maybe you want to set the default to zero in case you pass "false" as an argument for the detectColumnTypes property.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

}

public ExcelConfiguration(int columnNameLineNumber, ColumnNamingStrategy columnNamingStrategy,
boolean skipEmptyLines, boolean skipEmptyColumns) {
boolean skipEmptyLines, boolean skipEmptyColumns, boolean detectColumnTypes, int eagerness) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename eagerness parameter to numberOfLinesToScan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

@@ -38,25 +38,38 @@
public static final int NO_COLUMN_NAME_LINE = 0;
public static final int DEFAULT_COLUMN_NAME_LINE = 1;

private final int getNumberOfLinesToScan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename getNumberOfLinesToScan parameter to numberOfLinesToScan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

+ detectColumnTypes + "]";
}

public int getEagerness() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename getEagerness to getNumberOfLinesToScan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

@Override
protected void decorateIdentity(List<Object> identifiers) {
identifiers.add(columnNameLineNumber);
identifiers.add(skipEmptyLines);
identifiers.add(skipEmptyColumns);
identifiers.add(detectColumnTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add numberOfLinesToScan to identifiers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

}

@Override
public String toString() {
return "ExcelConfiguration[columnNameLineNumber="
+ columnNameLineNumber + ", skipEmptyLines=" + skipEmptyLines
+ ", skipEmptyColumns=" + skipEmptyColumns + "]";
+ ", skipEmptyColumns=" + skipEmptyColumns +", detectColumnTypes="
+ detectColumnTypes + "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add numberOfLinesToScan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit.

@@ -152,10 +160,150 @@ public void close() throws IOException {
});
}

private ColumnType[] getColumnTypes(final Sheet sheet, final Row row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are copy/pasting a lot of code into this class from the DefaultSpreadsheetReaderDelegate class, that seems like a bad idea to me, why not always use the DefaultSpreadsheetReaderDelegate in case you need to detect column types and don't change anything in this class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with next commit

@arjansh
Copy link
Contributor

arjansh commented Oct 24, 2019

@SociopathicPixel It may also be a good idea to merge master branch (from apache/metamodel) into this branch, because that contains a general fix for the failing unit tests.

@@ -149,8 +151,33 @@ protected CellStyle fetch() {
cell.setCellStyle(cellStyle.get());
}
}
validateUpdateType(row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SociopathicPixel I'm sorry, but I still don't get why this has been added here.

final Style[] styles = new Style[size];
if (row != null) {
for (int i = 0; i < size; i++) {
final int columnNumber = header.getSelectItem(i).getColumn().getColumnNumber();
final Cell cell = row.getCell(columnNumber);
final String value = ExcelUtils.getCellValue(workbook, cell);
final Object value = ExcelUtils.getCellValue(workbook, cell);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an Object doesn't do the trick, we know based on the column type, what type of object should be returned, so use that to either return a String, Integer, Boolean or Date object.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants