[jsword-devel] JSword and Map/Image modules

DM Smith dmsmith555 at yahoo.com
Wed Jan 28 18:24:18 MST 2009


Brian Fernandes wrote:
> DM Smith wrote:
>> Brian Fernandes wrote:
>>> To answer my own questions:
>>>
>>> 1) Map modules can be both hierarchical and simple lists. I should 
>>> have realized that all the modules I talked about earlier were in 
>>> fact Map modules; but two (FrarsiBibleAtlas and AbsMaps) were stored 
>>> in gen book format? (not sure of the right terminology here). The 
>>> other two were in dictionary format.
>>>
>>> So basically we'll have to make a small change to BD code to see 
>>> which of these formats is being used and use a list or a tree 
>>> accordingly.
>>
>> This should be handled already. All dictionary modules are treated as 
>> lists and all gen book modules are treated as trees.
> What's happening is slightly different. BD is looking at the stated 
> module *category* and not the actual module type to decide between 
> tree and list. Both FarsiBibleAtlas and Epiphany Maps are in the Map 
> *category*. However, for FarsiBibleAtlas the actual Book object is an 
> instance of SwordGenBook. For Epiphany Maps, the Book object is an 
> instance of SwordDictionary.
>
> Here is relevant info from the corresponding .conf files.
>
> [EpiphanyMaps]
> DataPath=./modules/lexdict/rawld4/epiphany-maps/maps
> ModDrv=RawLD4
> Category=Maps
>
> [ABSMaps]
> DataPath=./modules/genbook/rawgenbook/absmaps/maps
> ModDrv=RawGenBook
> Category=Maps
>
>
> Since BD is taking a decision based on category, it chooses to use a 
> tree for both, ergo FarsiBibleAtlas works just fine, but Epiphany 
> fails to show any sort of list.
>
> The options I see are:
> a) Change logic (just for maps) to make a list/tree decision based on 
> the type of book object.
I think this is the right choice.

>
> b) Decide that all maps should be either hierarchical or linear and 
> make sure BD works for that decision. This decision can only be taken 
> if there is some consensus on what a map module should be (which I 
> haven't found yet) and then the "losing type" of modules would need to 
> be rebuilt.
>
> I prefer the former option, of course :) The fix would be pretty 
> simple; my development version of FireBible now supports both types.
>
>
>>> 2) NetMaps was not working because it's content is something like this:
>>>
>>> <br><br>Journey of Paul (JP) #1, grid D2<br><img 
>>> src="/images/jp1.jpg"/><br><br>Journey of Paul (JP) #2, grid 
>>> D2<br><img src="/images/jp2.jpg"/><br><br>Journey of Paul (JP) #3, 
>>> grid D2<br><img src="/images/jp3.jpg"/><br><br>Journey of Paul (JP) 
>>> #4, grid D2<br><img src="/images/jp4.jpg"/>
>>>
>>>
>>> This is not valid XML as the <br> tags are not closed and the 
>>> fallback code simply removes all tags in an effort to display 
>>> something.
>>>
>>> If you replace <br> with <br/>, it works just fine. The THMLFilter 
>>> class makes 3 attempts to parse the text.
>>>
>>> a) The first attempt is made after removing invalid '&' characters 
>>> in the text.
>>>
>>> b) If the above fails, it does some further character clean up, 
>>> removing disallowed characters from the XML.
>>>
>>> c) If this still fails, it simply removes all tags.
>>>
>>> Maybe we can add an additional step between b and c which would 
>>> replace "<br>" with "<br/>"?  Or perhaps do it as part of step b. 
>>> Any other tags like this which we may want to clean up?
>>>
>>> DM, what do you think? 
>> The behavior of JSword certainly could be improved. The typical 
>> problem that JSword encounters is a verse that is not well-formed 
>> XML. This can readily happen in modules. I have tracked the problem 
>> to the following:
>> - Modules built from IMP format and are not validated against the 
>> spec. For ThML and OSIS, this should be both well-formed XML and 
>> valid against the schema. Further for ThML it should only contain the 
>> SWORD supported elements. For GBF, it should match the spec. When it 
>> is in IMP format, there are no validation tools. Also, osis2mod is 
>> transformational to what SWORD can handle. This is side stepped, 
>> which can cause problems.
>> - OSIS modules in BSP structure can build verses that are not 
>> well-formed. This causes problems for all front-ends. But for JSword 
>> it is worse than all others. The version of osis2mod in SVN fixes 
>> this by using milestoned versions of all
>> - The module building tools do not validate input. It is expected 
>> that the module creator does that first. In fact the module creators 
>> are relatively brain-dead and merely look for start and end of verses 
>> and pass everything in between as
>
> Appreciate the insight & experience. So this really is a bad ThML module.
>
>>
>> Here is what I think the fallback mechanism should be changed by 
>> adding another step (before the tag stripping):
>> As each un-matched end element is encountered, an opening tag for 
>> that element should be prefixed.
>> As each un-matched begin tag is encountered, a closing tag for that 
>> element should be suffixed.
>> This is not trivial, essentially a quasi-xml streaming parser needs 
>> to be written that uses a stack to know what is opened and what is 
>> closed. And the insertions need to be written in the correct order.
>> I say quasi because the xml parser standard requires it to fail on 
>> bad input.
> Agree, malformed XML is usually a fatal error which generally kills 
> the parsing.
>
> I suggested replace "<br>" with "<br/>" because whenever you try to 
> parse say HTML as XML, the <br> tag is the primary cause of failure. 
> Most of the other tags are closed, even in simple HTML and once this 
> "fix" is made, parsing succeeds most of the time. I thought we could 
> achieve a similar quick fix by using this approach only because it's 
> cheap. If it fails, we do move on to stripping all tags out anyway.
>
> My experience however, is limited to parsing HTML as XML, and I have 
> *no experience* with the actual content of Bible modules. So if they 
> are prone to more failures from malformed XML where other tags are 
> involved, then just fixing <br> does not make sense.
>
> Given the option of either making the parser more "accommodating" or 
> insisting on well formed input, I will choose the latter and agree 
> with your bottom line below - let's get Karl to fix the module :)
Is <br> is the only element in HTML defined to have not content? If we 
have a complete list, I'd be happy for your suggested change to be added.


>
>
>>
>> One of the reasons I advocate using an XML parser for our module 
>> creator is that it would not allow input that is not well-formed.
>
> Agreed.
>
>>
>> The other thing would be to change the ThML filter to not use an XML 
>> parser. This too is not trivial.
>
> Sticking with XML seems to be the way to go, especially since that is 
> what ThML is supposed to contain. Does JSword already use another 
> parsing mechanism for some other source formats?
Yes. For plain text new lines are replace with <lb/> in the 
transformation to OSIS. This is a simple substitution. GBF looks a lot 
like XML but only superficially. We have a custom parser for that.

>
>>
>> Bottom Line: The definition of ThML is that it is not a superset of 
>> html but of xhtml. I don't think we should handle invalid ThML but 
>> only valid ThML.
>>
>> Karl is very responsive to fixing problems in his modules. (Yeah!!!) 
>> I think that Karl should fix his module to be valid ThML.
>>
> I'll make a post to sword-devel about this. Unless he's already 
> listening here too ;)
>
> Brian.

Brian,
If you can work up a patch for any of this it would be appreciated.

In Him,
    DM




More information about the jsword-devel mailing list