Sunday, October 14, 2007

So, i'm looking for a bit of help. It's not a particularly hard task, it's just it'll take me a while and i need to go do some study. Basically, i need a method which parses XML-like data and adds the element name and the element contents to a Dictionary. It is fairly trivial, but i don't like the idea of having to think through all the string mangling required ;) First person to get me a method which works gets 10 brownie points.

You need to be able to handle something like this:
http://monoport.com/5078 (sorry for pasting there, but blogger borks on the XML tags)

Thing is, this is, technically speaking, invalid XML. The tag should be escaped, and the '&' in ObjectFileName should also be escaped. Unfortunately, i can't rely on the source of this XML being fixed any time soon, so i need to be able to hand parse the XML. I've outlined a procedure at the end which should be able to handle most of the mess that will be thrown at it. The only thing i'll add is that the manual parser doesn't have to cope with every eventuality. If something goes wrong (for example an artist called themselves which breaks the parsing) then don't worry, just abort. This is a last ditch effort to parse. It should succeed 99% of the time if you follow the procedure below.

Here's some psuedo code:
while(currentIndex < nexttag =" string.IndexOf('<'," elementname =" string.SubString(nextTag,">', currentPosition); // read the element name

currentPosition += the number of characters i've just 'parsed'.
string data = string.SubString(currentPosition, string.IndexOf("'); // The contents are between currentPosition and the end tag.
currentPosition += data.Length;
currentPosition += length of closing tag;

if(nextCharacter != '<') AbortParse(); // If the next character is not '<', then something has gone wrong, so give up. myDictionary.Add(elementName, contents);

11 comments:

Anonymous said...

Out of curiosity, what was wrong with the regex I gave you on #mono last friday?

*lt;\s(\S+?)[^>]*?>(.*?)</\s*\1\s*>

Anonymous said...

Er:

<\s(\S+?)[^>]*?>(.*?)</\s*\1\s*>

Alan said...

That doesn't seem to match the XML, it gets no results ;)

Anonymous said...

SgmlReader is an XmlReader implementation that is very forgiving.
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Alan said...

I don't want to drag in another dependency for something as trivial as this ;) To be honest, it needs to be fixed at the source, but until it is, a simple handwritten parser would be perfect.

Anonymous said...

BeautifulSoup (in python) seems to cope too. I just tried it under IronPython and it works fine for that pasted snippet.

It treats the artist as an unterminated tag and therefore inserts a closing tag for it.

Cetin Sert said...

For a super naive solution I've come up with and tested on your sample, see: http://tenkatext.sourceforge.net/stuff/snxp.cs

1) It is based on the assumption that there will be no 'nested elements' (element in element) in the input xml. That was the case in your sample file.

2) Xml entity resolution is not implemented. '&' remains also '&'.

Best Regards,
Cetin Sert

Alan said...

Cert: It works a charm.

I've pushed that block of code into the codebase as my last-ditch effort to decode the XML. So far, it hasn't failed.

No entity unquoting has to be done, the quoting was a mistake from a copy/paste into blogger and there will never be nested tags (unless some ****** of an artist creates a song with an xml-tag like name)
So, all in all, this seems like the winning solution. You get 10 brownie points.

Cetin Sert said...

I'm so happy with my 10 brownie points *^o^* thanx

What I actually meant by 'no xml entity resolution is done' was that an xml character entity such as &amp; remains as is whereas it may be desirable to convert that to a normal single character representation like &

That's why my first solution leaves you with something like:

objectfilename=M83 - 04 - Dead Cities, Red Seas &amp; Lost Ghosts.mp3

instead of the more desirable:

objectfilename=M83 - 04 - Dead Cities, Red Seas & Lost Ghosts.mp3

If it's indeed the latter what you want to get, then there is already a method in which you can provide the proper entity2character conversion logic. I left that part unimplemented in my first post because I did not know how many and which character entities were present in XML.

I just checked the wikipedia: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entities_in_XML and slightly improved my suggestion so that it now handles the above mentioned case properly:

http://tenkatext.sourceforge.net/stuff/snxp2.cs

I do not expect to get any additional brownie points with this but just want to make sure that my 10 points are well-earned...

Best Regards,
Cetin

Alan said...

What i was saying was that the 'XML' i'm getting has the '&' symbol (which is one reason why that block is invalid ;) ). I know it showed up in the pastebin as '&'. That happened when i copied it out of blogger into pastebin. So, as it stands, it works 100%.

Alan said...

Damnit. That should read: I know it showed up in pastebin as '& amp;', but that quoting was done by blogger.

Hit Counter