pellelil Posted March 11, 2021 Report Posted March 11, 2021 Hi Pete, I have an issue where I cannot read the runways.xml generated by MkRwy version 5.11 as the generated XML contains an illegal char. It occurs for the airport EDLP. When loading the xml into Notepad++ the name of the city contains (shown in reversed colors, as its a "special" char) "xF6" where it below is show as the char "ö" ... <ICAO id="EDLP"> <ICAOName>Paderborn-Lippstadt</ICAOName> <Country>Germany</Country> <State>North Rhine-Westphalia</State> <City>B ören</City> <File>G:\FS\2020Packages\Official\OneStore\fs-base\scenery\0601\APX50130.bgl</File> <SceneryName>fs-base scenery 0601</SceneryName> ...
Pete Dowson Posted March 12, 2021 Report Posted March 12, 2021 Hi Pelle What should it be? What would you suggest MakeRwys do about it? If Notepad++ can read it, why can’t your program? Are XML files limited to ASCII ? For string names MakeRwys simply provides what is in the BGL. Do you want them all scanned and non ASCII replaced by . or something? Pete
pellelil Posted March 12, 2021 Author Report Posted March 12, 2021 Hi Pete I am guessing you are forming your own XML in stead of using a framework/library to do it, as a framework/library would normally take care of encoding/escaping the values. There are many values that cannot be written directly into XML-fields and needs to be encoded/escaped, e.g. an ampersand "&" should be encoded as "&" and less-than "<" should be encoded as "<" (the reason for why the ampersand on its own needs to be encoded).
pellelil Posted March 12, 2021 Author Report Posted March 12, 2021 In stead of "proper" encoding/escaping the content of the field-values, you could consider simply removing the "diacritics" (accents and so on) so an "ô" is simply changed into an "o". Here some C# code that does exactly that: EDIT: Sorry ... bad idea
pellelil Posted March 12, 2021 Author Report Posted March 12, 2021 Just an idea? Since Notepad++ did not show the char correct (reversed "xF6" in stead of "ö"), perhaps its just a matter of a missing/incorrect BOM (Byte Order Mask) in the beginning of the file? EDIT: I'm leaning towards the BOM. Just tried entering "ö" into a text-field and saved it to XML. The letter is not escaped in anyway, and it shows up just fine in Notepad++ (so BOM and/or the actual char encoding): <AirlineInfo ICAO="MOA" IATA="" Name="My öwn airline" ...>
Pete Dowson Posted March 12, 2021 Report Posted March 12, 2021 2 hours ago, pellelil said: Since Notepad++ did not show the char correct (reversed "xF6" in stead of "ö"), perhaps its just a matter of a missing/incorrect BOM (Byte Order Mask) in the beginning of the file? EDIT: I'm leaning towards the BOM. Just tried entering "ö" into a text-field and saved it to XML. The letter is not escaped in anyway, and it shows up just fine in Notepad++ (so BOM and/or the actual char encoding): Sorry, I don't understand. What change do you actually want me to do? What's a BOM and where is it? I know nothing about XML I'm afraid. I just created it as a text file with the keys and data folks asked for. Pete
pellelil Posted March 12, 2021 Author Report Posted March 12, 2021 Back in the days with ASCII things were easy (as long as you didn't had to use special characters). Now there are multiple ways to encode text-files (ASCII, UTF-7, -8, -32 and what not). BOM (Byte Order Mask) is a 2, 3, or 4 byte binary-code in the beginning of a text-file that can be used by the programs reading those files (to know how the characters within that file were encoded). So when Notepad++ shows this character wrong, its a result of it not knowing the correct encoding of this character and why the .Net framework (that I am using to read/parse the XML) is complaining about illegal characters. Do you know how the XML-files you generate are encoded? I will later today try to do a test where I plan to add the BOM for UTF-8 (0xFE, 0xBB, 0xBF) in the front of the file and then see how it looks in Notepad++ and/or if it can be loaded by the .Net framework without it complaining - I'll let you know if/what I see.
Pete Dowson Posted March 12, 2021 Report Posted March 12, 2021 11 minutes ago, pellelil said: Do you know how the XML-files you generate are encoded? I just write them out as text files. No encoding. Should I change the header (first line)? Currently it's just <?xml version="1.0"?> Each character is one byte as far as I know. At least all those I generate are. The strings are just copied from the BGL as I said. Pete
pellelil Posted March 12, 2021 Author Report Posted March 12, 2021 Some kind of encoding have to/will take place when you have special characters in the file. Text-files are seldom ASCII these days, so special chars gets encoded into 2, 3 or 4 byte values, which is just fine as long as the program generating the file and the program reading the file agree on the encoding in use (why BOM is a thing). I fully understand that you are only using the value you get from the BGL, and in this case I only saw this issue with Paderborn so I am guessing its the Aerosoft airport that either have a "strange" city-name or that you for some reason see it as such (don't know if its possible for them to generate/write the BGL in a way, where MkRwy have trouble reading/decoding the BGL). I don't think changing the XML header will change any thing. I lean more towards BOM or the encoding used when generating the XML-file. But I will see if I can figure something out, trying to see if anything changes by adding a BOM
Pete Dowson Posted March 12, 2021 Report Posted March 12, 2021 1 hour ago, pellelil said: I don't think changing the XML header will change any thing. I lean more towards BOM or the encoding used when generating the XML-file. But I will see if I can figure something out, trying to see if anything changes by adding a BOM So a "BOM" isn't something in the header? 1 hour ago, pellelil said: Some kind of encoding have to/will take place when you have special characters in the file. Text-files are seldom ASCII these days, so special chars gets encoded into 2, 3 or 4 byte values Well when I look at text files in a hex editor (in my case either NotePad++ or more usually UEDIT), ones I created by simple "fprintf" function calls and the like, they always seem to have a one-byte to one character relationship to the text. The editor just puts dots in the positions for those it doesn't know. Same when debugging in Visual Studio, but i suppose all this depends on VS settings? I think UEDIT allows yo to change to other encodings but I leave it at default, whatever that is. Anyway, let me know what you think is needed. i've no idea where to put a bom! (Hope these exchanges aren't picked out as suspicious by the security services! 😉 Pete
Luke Kolin Posted March 13, 2021 Report Posted March 13, 2021 To paraphrase Dorothy, "Toto, I don't think we're in ASCII any more." You're likely to see that 1:1 byte/character mapping for most ASCII characters, but once you get out of that world all bets are off. I'd check the SDK to see what character encoding they are using in the BGLs - I suspect it is UTF-8 but you really need to be sure. If it is, you can probably just do the standard copy of bytes but I expect there are a number of Visual C constructs that treat them as Unicode strings rather than ASCII byte arrays. From there, instead of a BOM I would just add the encoding to the XML header. A parser should probably ignore the BOM and focus on the declaration; that's why it's there. I'll also echo pallelil - you really don't want to be rolling your own XML (or JSON or other structured text format). Cheers! Luke
pellelil Posted March 13, 2021 Author Report Posted March 13, 2021 Yes the BOM is just the very first 2,3 or 4 bytes in the file before the "text-content". But I've tried adding it and it does not fix it. I should have known - as the rest of the file is not UTF-8 encoded (at least not these two chars causing problems), so naturally its a bad idea to add a BOM saying "here is some UTF-8 encoded text" while the text is not actually UTF-8. The "correct way" to fix it would be as Luke write, would be to find out how text is encoded in BGLs (and I would guess UTF-8 as well) and then use a library/framework to generate the XML as it takes care of encoding the chars, and escaping those that needs escaping. But I am guessing that will NOT be a quick fix. No doubt this issue is caused by some weird city-name in a BGL. I see it for Paderborn only, and guessing its the free Aerosoft version, so I raised the question on their forum if something strange is going on with this city name, but so far no replies. I tried analyzing the generated runways.xml and there are many chars with a byte value above 0xA0 that works just fine. In this city name I see two values (shown in reversed in Notepad++) "xA1xF6" where the "0xF6" is only value at/above 0xF0 in the file. So I tried removing only 0xF6 and leaving "0xA1" in the file, but both Notepad++ and the XML parser in .Net still complained. So for a quick fix I would scan the text line and see it if contains bytes at- or greater than 0xF0, then remove this and any prefixing/suffixing chars greater than 0X7F. Its not pretty, its not perfect and if anything more a hack ... but at least it would allow the .net XML parser to do its work and Notepad++ would also stop complaining.
Pete Dowson Posted March 14, 2021 Report Posted March 14, 2021 15 hours ago, pellelil said: So for a quick fix I would scan the text line and see it if contains bytes at- or greater than 0xF0, then remove this and any prefixing/suffixing chars greater than 0X7F. So, to make sure I understand this, only remove 0x7F - 0xEF if next to >=0xF0, then remove all >=xF0. Shouldn't I replace at least the >=0xF0 with another character, like '.' or '?', or even something more obviously out of place like '#'? And this is for any true text content, of course excepting ICAO airport and frequency IDs. Right? Pete
pellelil Posted March 14, 2021 Author Report Posted March 14, 2021 I know its a hack. I would first look if the string contains any char/byte values equal or greater than 0xF0. If it doesn't I would use the string as it is. If however it does contain a char/byte equal or greater than 0xF0 I would remove (or replace) this char, and if there are any pre-/suffixing char greater that 0x7F I would remove them as well. If a string contain: "0xA1, 0x12, 0x34, 0xCD, 0xF0, 0xBC ,0x56" I would remove (or replace) "0xCD,0xF0, 0xBC" so the resulting string would be "0xA1, 0x12, 0x23, 0x56". Its probably better to just replace the chars with a dummy-char (e.g. "*") to indicate that something should have been there, but was removed.
Pete Dowson Posted March 14, 2021 Report Posted March 14, 2021 2 hours ago, pellelil said: Its probably better to just replace the chars with a dummy-char (e.g. "*") to indicate that something should have been there, but was removed. '*' is a more normally used character. Wouldn't perhaps '#' be better as I suggested? Pete
pellelil Posted March 14, 2021 Author Report Posted March 14, 2021 I don't know which of the two are more likely to appear natural, but "#" is fine be me
Luke Kolin Posted March 14, 2021 Report Posted March 14, 2021 4 hours ago, pellelil said: I know its a hack. I would first look if the string contains any char/byte values equal or greater than 0xF0. If it doesn't I would use the string as it is. If however it does contain a char/byte equal or greater than 0xF0 I would remove (or replace) this char, and if there are any pre-/suffixing char greater that 0x7F I would remove them as well. If a string contain: "0xA1, 0x12, 0x34, 0xCD, 0xF0, 0xBC ,0x56" I would remove (or replace) "0xCD,0xF0, 0xBC" so the resulting string would be "0xA1, 0x12, 0x23, 0x56". Its probably better to just replace the chars with a dummy-char (e.g. "*") to indicate that something should have been there, but was removed. Pete doesn't realize it yet, but he's going to go insane - or in a few years, John will. 😄 These are WCHAR or wchar_t typedefs, not byte arrays or char pointers. I'd work with them as such if possible. Cheers Luke
Pete Dowson Posted March 14, 2021 Report Posted March 14, 2021 1 hour ago, Luke Kolin said: These are WCHAR or wchar_t typedefs, not byte arrays or char pointers. I'd work with them as such if possible. Sorry, you've lost me. What are you suggesting I actually do, other than go insane? MakeRunways is a freeware program which i hope not to maintain much further (or even any further). John says he might help me publish the source in GitHub (I know nothing about GitHub), but the source has become so mangled over the years since FS98 days that I'd be worried about questions arriving to try to explain things I wouldn't remember. I'm afraid the comments (or lack of them, mostly) won't help. Pete
Pete Dowson Posted March 14, 2021 Report Posted March 14, 2021 I've just checked the source. I "convert" the string fields uing this own-coded function (Idon't remember actually doing it though -- must have been an explicit request for a while back). What would you suggest I do instead? /****************************************************************************** StringXML ******************************************************************************/ char *StringXML(char *pszTo, char *pszFrom) { char *pszNow = pszTo; while (*pszFrom) { if (*pszFrom == '&') { strcpy(pszNow, "&"); pszNow += 5; } else if (*pszFrom == 0x22) { strcpy(pszNow, """); pszNow += 6; } else if (*pszFrom == 0x27) { strcpy(pszNow, "'"); pszNow += 6; } else if (*pszFrom == '<') { strcpy(pszNow, "<"); pszNow += 4; } else if (*pszFrom == '>') { strcpy(pszNow, ">"); pszNow += 4; } else { *pszNow = *pszFrom; pszNow++; } pszFrom++; } *pszNow = 0; return pszTo; } Pete
Luke Kolin Posted March 14, 2021 Report Posted March 14, 2021 3 hours ago, Pete Dowson said: What would you suggest I do instead? Stop mucking about with char*. Seriously.... these aren't ASCII strings and probably everything is a wchar_t or LPWSTR/LPWCSTR (which is a pointer to a wchar_t). What is the character encoding in the BGLs? Given how P3D has moved to Unicode I suspect that they are UTF-8 but you'll need to ensure that you are reading them as Unicode strings and then doing your translations on wchar_t variables, not chars. Cheers Luke
pellelil Posted March 15, 2021 Author Report Posted March 15, 2021 Can't comment on the char/wchar side of things, but this should take care of escaping these 5 chars in XML. So I would not change this part. My suggested hack was regrading looking at multiple chars at the same time (e.g. finding a char equal-to/greater-than 0xF0 and if found look at the previous/next char), so I don't think it can be handled by this method only looking at a single char at a time. Anyway what I suggested is a hack, so perhaps make it optional via a command-line argument.
Pete Dowson Posted March 15, 2021 Report Posted March 15, 2021 12 hours ago, Luke Kolin said: Stop mucking about with char*. Seriously.... these aren't ASCII strings and probably everything is a wchar_t or LPWSTR/LPWCSTR (which is a pointer to a wchar_t). What is the character encoding in the BGLs? Given how P3D has moved to Unicode I suspect that they are UTF-8 but you'll need to ensure that you are reading them as Unicode strings and then doing your translations on wchar_t variables, not chars. Sorry, I'm not learning any new programming methods at my age. I am retired, and this isn't even a hobby now, it's a favour. If you want to re-program it all I'll send you the source. 2 hours ago, pellelil said: My suggested hack was regrading looking at multiple chars at the same time (e.g. finding a char equal-to/greater-than 0xF0 and if found look at the previous/next char), so I don't think it can be handled by this method only looking at a single char at a time. I can work it isn't the same loop no worry. Better than multiple loops over the same strings. 2 hours ago, pellelil said: Anyway what I suggested is a hack, so perhaps make it optional via a command-line argument. Thanks, but it shouldn't need it to be optional. Pete
pellelil Posted March 15, 2021 Author Report Posted March 15, 2021 Pete regarding it being optional, I was just thinking that if this hack have some side-effects, people at least have a chance to disable it, or only those who noticed an issue can enable it. As of now it appears I'm the only one who have run into this issue, and currently I have added my own "pre-processor" to pre-process the XML-content before I hand it over to the .Net XML-parser.
Pete Dowson Posted March 15, 2021 Report Posted March 15, 2021 5 minutes ago, pellelil said: Pete regarding it being optional, I was just thinking that if this hack have some side-effects, people at least have a chance to disable it, or only those who noticed an issue can enable it. But I think safer from a support point of view to have it enabled and only disabled by option. However, really I can't think of any unwanted side effect it could have. It's only simple character removal/substitution. Pete
pellelil Posted March 15, 2021 Author Report Posted March 15, 2021 It appears to me the chars in the XML are encoded using Windows-1252. And if you look at the chart from this following link there are (should be) "valid" chars above 0xF0, so my suggested hack will fix the issue I am seeing, but it might result in other "valid" chars being replaced: https://en.wikipedia.org/wiki/Windows-1252 So I don't know why these two chars are being flagged as invalid, as there are many other "special chars" that seems to work just fine (e.g. some of the Danish letters seems to work fine). The two chars that are flagged as invalid for me, is the 0xA1 followed by 0xF6. That is why I suggested the hack that I did. But to be honest I don't know at byte level how UTF-8 chars gets encoded (the .Net framework takes care if this for me). I am guessing that 0xA1,0xF6 is a two-byte code for a single char, but when encoded into Windows-1252 they appear as two chars.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now