Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include P tags within imported text #26

Open
waldoj opened this issue Jul 22, 2016 · 13 comments
Open

Include P tags within imported text #26

waldoj opened this issue Jul 22, 2016 · 13 comments

Comments

@waldoj
Copy link
Member

waldoj commented Jul 22, 2016

Lexis-Nexis' XML includes paragraph tags, but we're not using them. That's creating problems, e.g. definitions that all run together. Stop stripping these out.

@waldoj
Copy link
Member Author

waldoj commented Jul 30, 2016

The lack of paragraph tags is also causing text to run together. e.g.:

of Health, willfully fail to comply with a confinement, isolation, or quarantine order.</p>
<p>Any person violating the provisions of this section shall be guilty of a Class 2 misdemeanor.</p>

is rendered as:

of Health, willfully fail to comply with a confinement, isolation, or quarantine order.Any person violating the provisions of this section shall be guilty of a Class 2 misdemeanor.

@krusynth
Copy link
Collaborator

krusynth commented Mar 8, 2017

This is also happening with references, where we get the section number 15.2-2141.Notwithstanding in 15.2-2143.

@waldoj
Copy link
Member Author

waldoj commented Mar 8, 2017

Argh. I think this is some fundamental misunderstanding of XSLT for me.

@krusynth
Copy link
Collaborator

krusynth commented Mar 9, 2017

Here's my proposed change to the XSLT and a couple of results, I'm explicitly including P tags, dropping locator and heading, and extracting the contents of citation.

https://gist.github.com/krues8dr/7d6c40f944e3b6abe07edfd1dc73f72d

@krusynth
Copy link
Collaborator

krusynth commented Mar 9, 2017

Ah. Looks like pre and br tags are needed too, per § 16.1-69.16.

@krusynth
Copy link
Collaborator

krusynth commented Mar 9, 2017

Oh no, they're misusing the pre tag.

<p><pre/>Juvenile and Domestic<pre/><br/><pre/>General District Court<pre/>Relations District<pre/><br/><pre/>Judges<pre/>Court Judges<pre/><br/>First<pre/>4<pre/>4<pre/><br/>Second<pre/>7<pre/>7<pre/><br/>Two-A<pre/>1<pre/>1<pre/><br/>Third<pre/>2<pre/>3<pre/><br/>Fourth<pre/>6<pre/>5<pre/><br/>Fifth<pre/>2<pre/>2<pre/><br/>Sixth<pre/>4<pre/>2<pre/><br/>Seventh<pre/>4<pre/>4<pre/><br/>Eighth<pre/>3<pre/>3<pre/><br/>Ninth<pre/>3<pre/>4<pre/><br/>Tenth<pre/>3<pre/>4<pre/><br/>Eleventh<pre/>3<pre/>3<pre/><br/>Twelfth<pre/>5<pre/>6<pre/><br/>Thirteenth<pre/>6<pre/>4<pre/><br/>Fourteenth<pre/>5<pre/>5<pre/><br/>Fifteenth<pre/>8<pre/>10<pre/><br/>Sixteenth<pre/>4<pre/>6<pre/><br/>Seventeenth<pre/>3<pre/>2<pre/><br/>Eighteenth<pre/>2<pre/>2<pre/><br/>Nineteenth<pre/>11<pre/>8<pre/><br/>Twentieth<pre/>4<pre/>3<pre/><br/>Twenty-first<pre/>1<pre/>2<pre/><br/>Twenty-second<pre/>2<pre/>4<pre/><br/>Twenty-third<pre/>4<pre/>5<pre/><br/>Twenty-fourth<pre/>3<pre/>6<pre/><br/>Twenty-fifth<pre/>4<pre/>5<pre/><br/>Twenty-sixth<pre/>5<pre/>7<pre/><br/>Twenty-seventh<pre/>5<pre/>5<pre/><br/>Twenty-eighth<pre/>2<pre/>3<pre/><br/>Twenty-ninth<pre/>2<pre/>3<pre/><br/>Thirtieth<pre/>2<pre/>2<pre/><br/>Thirty-first<pre/>5<pre/>5</p>

@waldoj waldoj closed this as completed in 5e753b6 Mar 10, 2017
@waldoj
Copy link
Member Author

waldoj commented Mar 10, 2017

Wow, that's...that's really something.

@krusynth
Copy link
Collaborator

So, the issue here is that in the export they're translating a nicely-formatted table into a garbage pile of pre tags http://law.lis.virginia.gov/vacode/title16.1/chapter4.1/section16.1-69.6:1/

Since they didn't bother to preserve the top-left blank spot, I have no reasonable way of reassembling this sanely. This "works":

	<xsl:template match="p">
		<p>
			<xsl:choose>
				<!--We must handle preformatted tables specially-->
				<xsl:when test="pre">
					<xsl:for-each select="node()">
						<xsl:choose>
							<!--If this is a text node, wrap it in pre tags-->
							<xsl:when test="self::text()">
								<pre><xsl:value-of select="current()"/></pre>
							</xsl:when>
							<!--If it's an element, parse it as usual-->
							<xsl:when test="self::*">
								<xsl:apply-templates select="."/>
							</xsl:when>
							<!--Everything else gets removed-->
						</xsl:choose>
					</xsl:for-each>
				</xsl:when>

				<xsl:otherwise>
					<xsl:apply-templates />
				</xsl:otherwise>
			</xsl:choose>
		</p>
	</xsl:template>

However, there are other places where they use these self-closing <pre/> tags for no apparent reason, and the code above ends up formatting these incorrectly. E.g., § 15.2-2272.

Also noteworthy in § 15.2-2272 is that the second paragraph of paragraph 2. is incorrectly being stuck into a new bodytext and level. This means our parsing pushes into a separate, higher-level section entirely, which is incorrect.

TLDR: I give up.

@krusynth
Copy link
Collaborator

I'm going to preserve the <br> and broken <pre/> tags into the xml we're generating, and maybe someone can do something useful with it down the road; we can just ignore it ourselves.

krusynth pushed a commit to krusynth/va-decoded that referenced this issue Mar 10, 2017
krusynth pushed a commit to statedecoded/statedecoded that referenced this issue Mar 11, 2017
@krusynth
Copy link
Collaborator

screen shot 2017-03-11 at 7 07 36 pm

This'll do. Also shows #29.

waldoj added a commit that referenced this issue Mar 12, 2017
Add template for pre and br tags.  Per #26
@krusynth
Copy link
Collaborator

I'm still cleaning this up a bit, on the parser side – I'm getting duplicate text from statedecoded/statedecoded#113

@waldoj waldoj reopened this Mar 12, 2017
@waldoj
Copy link
Member Author

waldoj commented Mar 12, 2017

I'm still amazed at how poorly Lexis handles this. They had tabular data in the SGML, and they just punted on it with their move to XML.

krusynth pushed a commit to statedecoded/statedecoded that referenced this issue Mar 15, 2017
@krusynth
Copy link
Collaborator

That should do it. There's still some seriously broken XML that Lexis is putting out, like 38.2-1705, 13.1-514, etc., where prefixes aren't marked up as such, but we can't do anything about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants