get_hz_data_from_NASIS_db - issues with join to phsample #120

brownag · 2020-01-17T22:44:59Z

The LEFT OUTER JOIN to phsample used in the main NASIS horizon data query has the potential to duplicate horizon records in the event of a many:one relationship.

There are some data in NASIS that have duplicate entries in phsample -- same labsampnum, sample bag number, depth intervals etc. These are not true subsamples but are rather are the result of duplicate lab pedon entries both being in the database, or a data entry error.

In the rare event of true subsamples (i.e. multiple labsampnum per phiid), the same duplication will occur.

Currently the LEFT OUTER JOIN works for 99% of data and is extremely convenient for getting labsampnum into @horizons for fetchNASIS result. That said, it can cause spurious data quality issues -- and exclusion of pedons with valid data with default argument rmHzErrors=TRUE.

Doing all of the joins in SQL severely limits options for programmatically dealing with the child table phsample. I think this code should be re-vamped to utilize fewer assumptions (see related issue #52 RE: MIN(texcl)) and other SQL acrobatics.

Preference should be to return the unique labsampnum for subsample spanning the whole depth interval of horizon in question -- this would deal cleanly with straight duplications of records in phsample.

Subsequent preference could be given to the largest proportional subsample interval. An additional field could contain a comma-separated string of non-dominant labsampnums.

SELECT peiid, phiid, upedonid as pedon_id,
  hzname, dspcomplayerid as genhz, hzdept, hzdepb,
  claytotest AS clay, CASE WHEN silttotest IS NULL THEN 100 - (claytotest + sandtotest) ELSE silttotest END AS silt, 
  sandtotest AS sand, fragvoltot, texture, texcl, lieutex, phfield, effclass, phs.labsampnum, rupresblkdry, stickiness, plasticity, ksatpedon
  
  FROM 

  pedon_View_1 p 
  INNER JOIN phorizon_View_1 ph ON ph.peiidref = p.peiid 
  LEFT OUTER JOIN phsample_View_1 phs ON phs.phiidref = ph.phiid
  LEFT OUTER JOIN 
  (
  SELECT phiidref, MIN(texcl) AS texcl, MIN(lieutex) as lieutex
  FROM phtexture_View_1 
  GROUP BY phiidref
  ) AS pht ON pht.phiidref = ph.phiid
  
  ORDER BY p.upedonid, ph.hzdept ASC;

The text was updated successfully, but these errors were encountered:

brownag · 2020-01-17T22:53:44Z

Recently resolved issue #119 has some lab pedon userpedonIDs that would be a good test case for this.

NOTICE: multiple labsampnum values / horizons; see pedon IDs:
98IN045006,F2008MI121016,F2010MI117001,F2010MI117008,S07MI121014,S07MI139011,S08MI139004,S1968MI057002,S1970IN141001,S1978MI085002,S1979MI081002,S1979MI081005,S1979MI107004,S1980MI073005,S1980MI073006,S1980MI081002,S1980MI081003,S1980MI081007,S1992IN035011,S1994IN035005,S1994IN035005a,S1994IN039021,S1994IN039022,S1994IN039023,S2001MI101005,S2007MI139011,S2009MI063001,S2009MI123002,S2010MI117001,S2010MI117008,S2011MI117102,S2011OH003117,S2012IN003002,S2013MI091028,S2013MI149001,S2013MI161021,S2014MI067001A,S2014MI067001B,S2014MI067001C,S2014MI067001D,S2014MI067001E,S2014MI081001A,S2014MI081001B,S2014MI081001C,S2014MI081001D,S2014MI081001E,S2014MI117001A,S2014MI117001B,S2014MI117001C,S2014MI117001D,S2014MI117001E,S2014MI123001A,S2014MI123001B,S2014MI123001C,S2014MI123001D,S2014MI123001E,S2015MI015001A,S2015MI015001B,S2015MI015001C,S2015MI015001D,S2015MI015001E,S2015MI037001A,S2015MI037001B,S2015MI037001C,S2015MI037001D,S2015MI037001E,S2015MI037002A,S2015MI045001A,S2015MI045001B,S2015MI045001C,S2015MI045001D,S2015MI045001E,S2015MI045002A,S2015MI045002B,S2015MI045002C,S2015MI045002D,S2015MI045002E,S2015MI107001A,S2015MI107001B,S2015MI107001C,S2015MI107001D,S2015MI107001E,S2015OH065002,S2015OH065003,S2015OH109001,S2016OH065004,S2017MI127001

brownag · 2020-01-21T18:06:24Z

Breadcrumbs for picking up this issue in a bit. Thanks to @phytoclast and Matt Bromley in Grand Rapids for highlighting the severity of this issue.

It is unclear why this type of result is causing duplication -- as the phiidrefs are indeed different.

> RODBC::sqlQuery(channel, "select * from phsample_View_1 where labsampnum = '02N00379';")
  phiidref seqnum labsampnum fldsampid layerdepthtop layerdepthbottom numberofbulksampbags
1   595448     NA   02N00379        NA             0               20                   NA
2  5408125     NA   02N00379        NA             0               20                   NA
  numberofbulkdensityclods numberofnaturalfabricclods numberofothersamples
1                       NA                         NA                   NA
2                       NA                         NA                   NA
  wt20to76mmdiscardedfragments wtlt20mm        recwlupdated recuseriidref phlabsampiid row_state
1                           NA       NA                <NA>            NA       120287         0
2                           NA       NA 2015-03-18 14:16:27          1727       929543         0
  root_row_state GroupPerms SitePerms
1              0      30423       105
2              0      24639       153

brownag · 2020-01-21T21:21:44Z

@phytoclast

I determined the instances of duplication are related to unique values of labsampnum that are valid because they are, at least in some cases, subsamples.

SoilProfileCollections do not currently support analysis of combination horizons/ multi-sample horizons, so returning these as "horizon errors" is expected behavior -- currently only the pedon ID is reported to the console when this "error" occurs which might hamper figuring out the exact problem.

Here is a local pedon S2001MI101005 that "correctly" returns a horizon errro.

Here is the corresponding record in the lab pedon (note 02N00385 is absent) -- where duplication is not occuring becasue labsampnum and phiid are 1:1

This is another type of data entry error causing duplication (98IN045006) -- null labsampnums should be removed.

smroecker · 2020-05-28T19:05:43Z

In other instances, I've dealt with this type of situation by concatenating the results, which to my knowledge can't be done in T-SQL. Therefore I'd suggest creating a separate query to return the results, concatenate the results, and then join them to the horizon table. It would also seem advisable to generate an error message.

brownag · 2020-05-29T01:58:03Z

I don't know that it was ever appropriate to assume phorizon and phsample were 1:1 -- it just works most of the time. CONCAT might be possible but not easily able to handle combining non-null with NULL cases -- that would I think nuke legit entries. I think the main issue with doing concatenation, via SQL or R, is "backwards compatibility"

do we have scripts that rely on @horizons labsampnum containing a single labsampnum ?
would we eliminate the labsampnum column, that typically would not be found at the phorizon level, and name it something else if it could contain concatenated result?

As I suggested above, we could define some sort of calculation .pickBestLabsampnum that picks the best labsampnum (i.e. most complete data with depths, dominant, unique after eliminating NULL etc) as the "RV" that retains current convenience of having labsampnum easily accessible at horizon level for most pedons that have lab data -- as true many:1 relationships are pretty rare. A second field could have all labsampnum concatenated for parsing if needed.

In principle, there needs to be a defined way of supporting records that have actual subsamples or other reasons why repeated measures were made. Perhaps this can be achieved by allowing for weighted averaging overlapping portions as some sort of a pre-filter on the horizon data. A parallel problem is the "disaggregated glossic horizons" @phytoclast brought up in #122 -- which would benefit from a defined way of resolving overlap where the overlapping layers are unique and have one or more attributes filled out that defines the weighting. This probably would involve concatenation or dominant condition aggregation for categoricals.

brownag · 2023-04-28T22:42:47Z

This issue is now resolved since the default for rmHzErrors has been FALSE for several months and the aqp SoilProfileCollection object has been made more robust to overlapping horizons. So, in the event of multiple labsampnum per genetic horizon, you get duplication of the standard morphologic data, but as many labsampnum as you have child phsample records. A QC note is printed to the console, which would allow finding/fixing cases where this is not intended.

There may still be room for a .pickBestLabsampnum()-type function for picking the dominant or most complete lab data for a layer, but I think that is outside the scope of default get_hz_data_from_NASIS_db() and fetchNASIS() behavior.
We can also probably concatenate the labsampnum and provide a comma-separated character as a new column

brownag · 2024-01-31T23:02:03Z

There may still be room for a .pickBestLabsampnum()-type function for picking the dominant or most complete lab data for a layer, but I think that is outside the scope of default get_hz_data_from_NASIS_db() and fetchNASIS() behavior.
We can also probably concatenate the labsampnum and provide a comma-separated character as a new column

Looks like probably both of these things need to be implemented, especially in light of the current DSP lab sampling guidance. I think there may be value in having some concept of a 1:1 relationship with phorizon--in current labsampnum column for backward compatibility--but then also a list column, or a comma-separated character column with all the labsampnum for a particular horizon.

brownag · 2024-02-01T00:54:25Z

Related test added to aqp RE: overlapping horizons: ncss-tech/aqp@37d6e81

Full sample SPC c/o @andypaolucci
ajp 1.rds.zip

brownag self-assigned this Jan 17, 2020

brownag added the NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions label Jan 16, 2021

This was referenced Jan 16, 2021

consolidate / merge / coordinate all "flattening" code #135

Open

Automatic QC/filtering of NASIS records w/ fetchNASIS rmHzErrors ETC. #159

Closed

hammerly mentioned this issue Jul 1, 2021

fetchNASIS phlabresults join issue #192

Closed

brownag closed this as completed Apr 28, 2023

brownag reopened this Jan 31, 2024

brownag mentioned this issue Feb 24, 2024

Add collapseHz() ncss-tech/aqp#307

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_hz_data_from_NASIS_db - issues with join to phsample #120

get_hz_data_from_NASIS_db - issues with join to phsample #120

brownag commented Jan 17, 2020 •

edited

Loading

brownag commented Jan 17, 2020 •

edited

Loading

brownag commented Jan 21, 2020

brownag commented Jan 21, 2020 •

edited

Loading

smroecker commented May 28, 2020

brownag commented May 29, 2020 •

edited

Loading

brownag commented Apr 28, 2023

brownag commented Jan 31, 2024

brownag commented Feb 1, 2024

get_hz_data_from_NASIS_db - issues with join to phsample #120

get_hz_data_from_NASIS_db - issues with join to phsample #120

Comments

brownag commented Jan 17, 2020 • edited Loading

brownag commented Jan 17, 2020 • edited Loading

brownag commented Jan 21, 2020

brownag commented Jan 21, 2020 • edited Loading

smroecker commented May 28, 2020

brownag commented May 29, 2020 • edited Loading

brownag commented Apr 28, 2023

brownag commented Jan 31, 2024

brownag commented Feb 1, 2024

brownag commented Jan 17, 2020 •

edited

Loading

brownag commented Jan 17, 2020 •

edited

Loading

brownag commented Jan 21, 2020 •

edited

Loading

brownag commented May 29, 2020 •

edited

Loading