- Posts: 63
missing escape \t
- brian clark
- Topic Author
- Offline
- User
Less
More
6 months 4 weeks ago #3786
by brian clark
missing escape \t - Post(3786) was created by brian clark
did some testing with couchdb and it seems tab escape is missing from your jsonadaptor.The following characters are reserved characters and can not be used in JSON and must be properly escaped to be used in strings.
- Backspace to be replaced with \b
- Form feed to be replaced with \f
- Newline to be replaced with \n
- Carriage return to be replaced with \r
- Tab to be replaced with \t
- Double quote to be replaced with \"
- Backslash to be replaced with \\
I would check this i have html with all of the above and tested it with and without, when all the above are escaped it works fine
ho hum i got to go back through 87 million html files and fix the json files.
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3787
by brian clark
Replied by brian clark on topic missing escape \t - Post(3787)
also another issue is using null instead of plain old blank
[
{
"filename": "file 81\\original_https_knighz.cfd_.html",
"HTML": "<p>Coming Soon !</p>",
"Original Url": null
}
]
when it should be
{
"filename": "file 81\\original_https_knighz.cfd_.html",
"HTML": "<p>Coming Soon !</p>",
"Original Url": ""
}
[
{
"filename": "file 81\\original_https_knighz.cfd_.html",
"HTML": "<p>Coming Soon !</p>",
"Original Url": null
}
]
when it should be
{
"filename": "file 81\\original_https_knighz.cfd_.html",
"HTML": "<p>Coming Soon !</p>",
"Original Url": ""
}
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3788
by brian clark
Replied by brian clark on topic missing escape \t - Post(3788)
https://jsonlint.com/
user that to validate or add your own json validator with preview to save people's sanity!
user that to validate or add your own json validator with preview to save people's sanity!
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3789
by brian clark
Replied by brian clark on topic missing escape \t - Post(3789)
also the removal or escape
[ ]
{ }
in any content they are really messing with importing across lots of databases with json output. the html will have javascript content within the json string
[ ]
{ }
in any content they are really messing with importing across lots of databases with json output. the html will have javascript content within the json string
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3790
by brian clark
Replied by brian clark on topic missing escape \t - Post(3790)
recap
original url: null
when it should be
original url: ""
[
]
{
}
\t tabs to be escaped
\ or any content with that in messes it all up for most import engines unless they correct it their end. which alot do not.
"HTML": "\n<!DOCTYPE HTML>\n<html lang=\"en\">\n\n\n <head> \n <!--\if IE\>
it wont pass validation because of the \> at the end or \any other character
I literally have replaced all the above just for it to be valid with couchdb, mongodb sucks with its 16mb limits. but does correct on import.
couchdb is better but wow is it fussy on import
https://jsoneditoronline.org/#left=local.nuxaci&right=local.mipino
try this validator at least it highlights the issues. try out the zip file i sent... your end needs to be far more stricter for it to be truly valid json (when dealing with html data)
Cheers
Brian
original url: null
when it should be
original url: ""
[
]
{
}
\t tabs to be escaped
\ or any content with that in messes it all up for most import engines unless they correct it their end. which alot do not.
"HTML": "\n<!DOCTYPE HTML>\n<html lang=\"en\">\n\n\n <head> \n <!--\if IE\>
it wont pass validation because of the \> at the end or \any other character
I literally have replaced all the above just for it to be valid with couchdb, mongodb sucks with its 16mb limits. but does correct on import.
couchdb is better but wow is it fussy on import
https://jsoneditoronline.org/#left=local.nuxaci&right=local.mipino
try this validator at least it highlights the issues. try out the zip file i sent... your end needs to be far more stricter for it to be truly valid json (when dealing with html data)
Cheers
Brian
Attachments:
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3791
by brian clark
Replied by brian clark on topic missing escape \t - Post(3791)
and also " in the html is not escaped for some odd reason. that messes with validation too
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3792
by brian clark
Replied by brian clark on topic missing escape \t - Post(3792)
[\[|\]|\{|\}|\\|\t]*
as you have it the structure is correct, just not the full checking of the content
Tested 1 file with this regex and now the json is valid and imports into couchdb and anything else just fine.
So i would correct this issue as the workflow grinds to a hault.
will test further on the most bizarre files
as you have it the structure is correct, just not the full checking of the content
Tested 1 file with this regex and now the json is valid and imports into couchdb and anything else just fine.
So i would correct this issue as the workflow grinds to a hault.
will test further on the most bizarre files
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 4 weeks ago #3793
by brian clark
Replied by brian clark on topic missing escape \t - Post(3793)
same for csv or any other adaptor
multiple regexes and now it all works, couchdb = 0 errors
I thought these adaptors were by default escaping per standard for x adaptor
multiple regexes and now it all works, couchdb = 0 errors
I thought these adaptors were by default escaping per standard for x adaptor
Please Log in or Create an account to join the conversation.
- FlowHeater-Team
- Offline
- Admin
6 months 3 weeks ago #3794
by FlowHeater-Team
Best wishes
Robert Stark
Replied by FlowHeater-Team on topic missing escape \t - Post(3794)
Hi Brian,
Thank you for your notification about escaping special characters. I´ve fixed the JSON Adapter in the latest Beta version now. You could download the fixed version here: Download Beta Version
For TextFile Adapter (CSV files) or some other Adapters this escaping doesn´t make sense. In case you´ve got problems with some Adapters please open a new topic for that with detailed information’s, Thanks.
Please note: The JSON Adapter is still under development and just available as Beta version.
Thank you for your notification about escaping special characters. I´ve fixed the JSON Adapter in the latest Beta version now. You could download the fixed version here: Download Beta Version
For TextFile Adapter (CSV files) or some other Adapters this escaping doesn´t make sense. In case you´ve got problems with some Adapters please open a new topic for that with detailed information’s, Thanks.
Please note: The JSON Adapter is still under development and just available as Beta version.
Best wishes
Robert Stark
Please Log in or Create an account to join the conversation.
- brian clark
- Topic Author
- Offline
- User
Less
More
- Posts: 63
6 months 3 weeks ago #3795
by brian clark
Replied by brian clark on topic missing escape \t - Post(3795)
in any normal data sense yes, as most data is text and very basic. but this is RAWhtml the whole page. So a tick to allow complex javascripts as well as anything out of the ordinary.
I had to use regex to make sure its all escaped or at least the minimal odd ball things like tabs within the html itself (it messes up csv if tabular).
Only since \t has been escaped in my case completely removed as I have compressed all the html without whitespaces too. Makes a massive difference over 500 million+ html pages. :0)
Anyway will report anything else about JSON I find.
Cheers
Brian
I had to use regex to make sure its all escaped or at least the minimal odd ball things like tabs within the html itself (it messes up csv if tabular).
Only since \t has been escaped in my case completely removed as I have compressed all the html without whitespaces too. Makes a massive difference over 500 million+ html pages. :0)
Anyway will report anything else about JSON I find.
Cheers
Brian
Please Log in or Create an account to join the conversation.
Time to create page: 0.308 seconds