Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_hz_data_from_NASIS_db - issues with join to phsample #120

Open
brownag opened this issue Jan 17, 2020 · 8 comments
Open

get_hz_data_from_NASIS_db - issues with join to phsample #120

brownag opened this issue Jan 17, 2020 · 8 comments
Assignees
Labels
NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions

Comments

@brownag
Copy link
Member

brownag commented Jan 17, 2020

The LEFT OUTER JOIN to phsample used in the main NASIS horizon data query has the potential to duplicate horizon records in the event of a many:one relationship.

There are some data in NASIS that have duplicate entries in phsample -- same labsampnum, sample bag number, depth intervals etc. These are not true subsamples but are rather are the result of duplicate lab pedon entries both being in the database, or a data entry error.

In the rare event of true subsamples (i.e. multiple labsampnum per phiid), the same duplication will occur.

Currently the LEFT OUTER JOIN works for 99% of data and is extremely convenient for getting labsampnum into @horizons for fetchNASIS result. That said, it can cause spurious data quality issues -- and exclusion of pedons with valid data with default argument rmHzErrors=TRUE.

Doing all of the joins in SQL severely limits options for programmatically dealing with the child table phsample. I think this code should be re-vamped to utilize fewer assumptions (see related issue #52 RE: MIN(texcl)) and other SQL acrobatics.

Preference should be to return the unique labsampnum for subsample spanning the whole depth interval of horizon in question -- this would deal cleanly with straight duplications of records in phsample.

Subsequent preference could be given to the largest proportional subsample interval. An additional field could contain a comma-separated string of non-dominant labsampnums.

SELECT peiid, phiid, upedonid as pedon_id,
  hzname, dspcomplayerid as genhz, hzdept, hzdepb,
  claytotest AS clay, CASE WHEN silttotest IS NULL THEN 100 - (claytotest + sandtotest) ELSE silttotest END AS silt, 
  sandtotest AS sand, fragvoltot, texture, texcl, lieutex, phfield, effclass, phs.labsampnum, rupresblkdry, stickiness, plasticity, ksatpedon
  
  FROM 

  pedon_View_1 p 
  INNER JOIN phorizon_View_1 ph ON ph.peiidref = p.peiid 
  LEFT OUTER JOIN phsample_View_1 phs ON phs.phiidref = ph.phiid
  LEFT OUTER JOIN 
  (
  SELECT phiidref, MIN(texcl) AS texcl, MIN(lieutex) as lieutex
  FROM phtexture_View_1 
  GROUP BY phiidref
  ) AS pht ON pht.phiidref = ph.phiid
  
  ORDER BY p.upedonid, ph.hzdept ASC;
@brownag brownag self-assigned this Jan 17, 2020
@brownag
Copy link
Member Author

brownag commented Jan 17, 2020

Recently resolved issue #119 has some lab pedon userpedonIDs that would be a good test case for this.

NOTICE: multiple labsampnum values / horizons; see pedon IDs:
98IN045006,F2008MI121016,F2010MI117001,F2010MI117008,S07MI121014,S07MI139011,S08MI139004,S1968MI057002,S1970IN141001,S1978MI085002,S1979MI081002,S1979MI081005,S1979MI107004,S1980MI073005,S1980MI073006,S1980MI081002,S1980MI081003,S1980MI081007,S1992IN035011,S1994IN035005,S1994IN035005a,S1994IN039021,S1994IN039022,S1994IN039023,S2001MI101005,S2007MI139011,S2009MI063001,S2009MI123002,S2010MI117001,S2010MI117008,S2011MI117102,S2011OH003117,S2012IN003002,S2013MI091028,S2013MI149001,S2013MI161021,S2014MI067001A,S2014MI067001B,S2014MI067001C,S2014MI067001D,S2014MI067001E,S2014MI081001A,S2014MI081001B,S2014MI081001C,S2014MI081001D,S2014MI081001E,S2014MI117001A,S2014MI117001B,S2014MI117001C,S2014MI117001D,S2014MI117001E,S2014MI123001A,S2014MI123001B,S2014MI123001C,S2014MI123001D,S2014MI123001E,S2015MI015001A,S2015MI015001B,S2015MI015001C,S2015MI015001D,S2015MI015001E,S2015MI037001A,S2015MI037001B,S2015MI037001C,S2015MI037001D,S2015MI037001E,S2015MI037002A,S2015MI045001A,S2015MI045001B,S2015MI045001C,S2015MI045001D,S2015MI045001E,S2015MI045002A,S2015MI045002B,S2015MI045002C,S2015MI045002D,S2015MI045002E,S2015MI107001A,S2015MI107001B,S2015MI107001C,S2015MI107001D,S2015MI107001E,S2015OH065002,S2015OH065003,S2015OH109001,S2016OH065004,S2017MI127001

@brownag
Copy link
Member Author

brownag commented Jan 21, 2020

Breadcrumbs for picking up this issue in a bit. Thanks to @phytoclast and Matt Bromley in Grand Rapids for highlighting the severity of this issue.

It is unclear why this type of result is causing duplication -- as the phiidrefs are indeed different.

> RODBC::sqlQuery(channel, "select * from phsample_View_1 where labsampnum = '02N00379';")
  phiidref seqnum labsampnum fldsampid layerdepthtop layerdepthbottom numberofbulksampbags
1   595448     NA   02N00379        NA             0               20                   NA
2  5408125     NA   02N00379        NA             0               20                   NA
  numberofbulkdensityclods numberofnaturalfabricclods numberofothersamples
1                       NA                         NA                   NA
2                       NA                         NA                   NA
  wt20to76mmdiscardedfragments wtlt20mm        recwlupdated recuseriidref phlabsampiid row_state
1                           NA       NA                <NA>            NA       120287         0
2                           NA       NA 2015-03-18 14:16:27          1727       929543         0
  root_row_state GroupPerms SitePerms
1              0      30423       105
2              0      24639       153

@brownag
Copy link
Member Author

brownag commented Jan 21, 2020

@phytoclast

I determined the instances of duplication are related to unique values of labsampnum that are valid because they are, at least in some cases, subsamples.

SoilProfileCollections do not currently support analysis of combination horizons/ multi-sample horizons, so returning these as "horizon errors" is expected behavior -- currently only the pedon ID is reported to the console when this "error" occurs which might hamper figuring out the exact problem.

Here is a local pedon S2001MI101005 that "correctly" returns a horizon errro.
image

Here is the corresponding record in the lab pedon (note 02N00385 is absent) -- where duplication is not occuring becasue labsampnum and phiid are 1:1
image

This is another type of data entry error causing duplication (98IN045006) -- null labsampnums should be removed.
image

@smroecker
Copy link
Member

In other instances, I've dealt with this type of situation by concatenating the results, which to my knowledge can't be done in T-SQL. Therefore I'd suggest creating a separate query to return the results, concatenate the results, and then join them to the horizon table. It would also seem advisable to generate an error message.

@brownag
Copy link
Member Author

brownag commented May 29, 2020

I don't know that it was ever appropriate to assume phorizon and phsample were 1:1 -- it just works most of the time. CONCAT might be possible but not easily able to handle combining non-null with NULL cases -- that would I think nuke legit entries. I think the main issue with doing concatenation, via SQL or R, is "backwards compatibility"

  • do we have scripts that rely on @horizons labsampnum containing a single labsampnum ?
  • would we eliminate the labsampnum column, that typically would not be found at the phorizon level, and name it something else if it could contain concatenated result?

As I suggested above, we could define some sort of calculation .pickBestLabsampnum that picks the best labsampnum (i.e. most complete data with depths, dominant, unique after eliminating NULL etc) as the "RV" that retains current convenience of having labsampnum easily accessible at horizon level for most pedons that have lab data -- as true many:1 relationships are pretty rare. A second field could have all labsampnum concatenated for parsing if needed.

In principle, there needs to be a defined way of supporting records that have actual subsamples or other reasons why repeated measures were made. Perhaps this can be achieved by allowing for weighted averaging overlapping portions as some sort of a pre-filter on the horizon data. A parallel problem is the "disaggregated glossic horizons" @phytoclast brought up in #122 -- which would benefit from a defined way of resolving overlap where the overlapping layers are unique and have one or more attributes filled out that defines the weighting. This probably would involve concatenation or dominant condition aggregation for categoricals.

@brownag brownag added the NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions label Jan 16, 2021
@brownag
Copy link
Member Author

brownag commented Apr 28, 2023

This issue is now resolved since the default for rmHzErrors has been FALSE for several months and the aqp SoilProfileCollection object has been made more robust to overlapping horizons. So, in the event of multiple labsampnum per genetic horizon, you get duplication of the standard morphologic data, but as many labsampnum as you have child phsample records. A QC note is printed to the console, which would allow finding/fixing cases where this is not intended.

There may still be room for a .pickBestLabsampnum()-type function for picking the dominant or most complete lab data for a layer, but I think that is outside the scope of default get_hz_data_from_NASIS_db() and fetchNASIS() behavior.
We can also probably concatenate the labsampnum and provide a comma-separated character as a new column

@brownag brownag closed this as completed Apr 28, 2023
@brownag
Copy link
Member Author

brownag commented Jan 31, 2024

There may still be room for a .pickBestLabsampnum()-type function for picking the dominant or most complete lab data for a layer, but I think that is outside the scope of default get_hz_data_from_NASIS_db() and fetchNASIS() behavior.
We can also probably concatenate the labsampnum and provide a comma-separated character as a new column

Looks like probably both of these things need to be implemented, especially in light of the current DSP lab sampling guidance. I think there may be value in having some concept of a 1:1 relationship with phorizon--in current labsampnum column for backward compatibility--but then also a list column, or a comma-separated character column with all the labsampnum for a particular horizon.

@brownag brownag reopened this Jan 31, 2024
@brownag
Copy link
Member Author

brownag commented Feb 1, 2024

Related test added to aqp RE: overlapping horizons: ncss-tech/aqp@37d6e81

Full sample SPC c/o @andypaolucci
ajp 1.rds.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions
Projects
None yet
Development

No branches or pull requests

2 participants