Accept non-characters when validating Unicode

Unicode Corrigendum #9 clarifies that the non-characters U+nFFFE (for n in the range 0 to 0x10), U+nFFFF (for n in the same range), and U+FDD0..U+FDEF are valid for interchange, and their presence does not make a string ill-formed. GLib 2.36 made the corresponding change in its definition of UTF-8 as used by g_utf8_validate() and similar functions. Bug: https://bugs.freedesktop.org/show_bug.cgi?id=63072 Signed-off-by: Simon McVittie <simon.mcvittie@collabora.co.uk>
author: Simon McVittie <simon.mcvittie@collabora.co.uk> 2013-04-22 15:36:32 +0100
committer: Simon McVittie <simon.mcvittie@collabora.co.uk> 2013-04-22 15:36:32 +0100
commit: 6b2add5e70252c513f506f84cc386f47953df48d (patch)
tree: cb5390549936a81565de69ff5ce5039511a99db8 /test
parent: 540e5692e07d48fb41a4e977e0c9078fa19bd677 (diff)
1 files changed, 4 insertions, 2 deletions
diff --git a/test/syntax.c b/test/syntax.c
index 88db9638..e26b3643 100644
--- a/test/syntax.c
+++ b/test/syntax.c
@@ -178,12 +178,14 @@ const char * const invalid_single_signatures[] = {
 
 const char * const valid_strings[] = {
     "",
-    "\xc2\xa9",
+    "\xc2\xa9",       /* UTF-8 (c) symbol */
+    "\xef\xbf\xbe",   /* U+FFFE is reserved but Corrigendum 9 says it's OK */
     NULL
 };
 
 const char * const invalid_strings[] = {
-    "\xa9",
+    "\xa9",           /* Latin-1 (c) symbol */
+    "\xed\xa0\x80",   /* UTF-16 surrogates are not valid in UTF-8 */
     NULL
 };
author	Simon McVittie <simon.mcvittie@collabora.co.uk>	2013-04-22 15:36:32 +0100
committer	Simon McVittie <simon.mcvittie@collabora.co.uk>	2013-04-22 15:36:32 +0100
commit	6b2add5e70252c513f506f84cc386f47953df48d (patch)
tree	cb5390549936a81565de69ff5ce5039511a99db8 /test
parent	540e5692e07d48fb41a4e977e0c9078fa19bd677 (diff)