Fix JSON problems with non-ASCII characters, by explicitly decoding test output.

Due to a design flaw in Python 2.x, if an 8-bit string containing a non-ASCII character is ever used in a context requiring a unicode string, the Python interpreter will raise an exception. This happens because Python lazily decodes 8-bit strings to unicode as needed, and when it does the decoding it assumes the 8-bit string is in ASCII format. Because of this lazy decoding behavior, this means that 8-bit strings containing non-ASCII characters are unsafe in any Python program that also uses unicode. Since Python 3.x doesn't have 8-bit strings (it has "byte arrays", which don't autoconvert to unicode strings), the most forward-compatible way to address this problem is to find the source of the unsafe 8-bit strings, and introduce an explicit decoding step that translates them to unicode strings and handles errors properly. This problem manifested in Piglit if the output of a test ever contained a non-ASCII character. Due to Python 2.x's lazy string decoding semantics, the exception wouldn't occur until attempting to serialize the test output as JSON. This patch introduces an explicit step, right after running a test, which decodes the test output from 8-bit strings into unicode. The decoding assumes UTF-8, and silently replaces invalid UTF-8 sequences with the unicode "replacement character" rather than raise an exception.
author: Paul Berry <stereotype441@gmail.com> 2011-08-08 08:47:45 -0700
committer: Paul Berry <stereotype441@gmail.com> 2011-08-11 13:09:30 -0700
commit: 43542f5a3b6cd8d08b430aef960fe04e61075e5b (patch)
tree: 2fb0e1b4dd14e3c1e2a9a64b460b7a178a51ff59
parent: da3126ce20538c0906226c62c089be4965aa2e41 (diff)
1 files changed, 17 insertions, 0 deletions
diff --git a/framework/exectest.py b/framework/exectest.py
index 91a674e0..e3327b10 100644
--- a/framework/exectest.py
+++ b/framework/exectest.py
@@ -55,6 +55,23 @@ class ExecTest(Test):
 				)
 			out, err = proc.communicate()
 
+			# proc.communicate() returns 8-bit strings, but we need
+			# unicode strings.  In Python 2.x, this is because we
+			# will eventually be serializing the strings as JSON,
+			# and the JSON library expects unicode.  In Python 3.x,
+			# this is because all string operations require
+			# unicode.  So translate the strings into unicode,
+			# assuming they are using UTF-8 encoding.
+			#
+			# If the subprocess output wasn't properly UTF-8
+			# encoded, we don't want to raise an exception, so
+			# translate the strings using 'replace' mode, which
+			# replaces erroneous charcters with the Unicode
+			# "replacement character" (a white question mark inside
+			# a black diamond).
+			out = out.decode('utf-8', errors='replace')
+			err = err.decode('utf-8', errors='replace')
+
 			results = TestResult()
 
 			out = self.interpretResult(out, results)
author	Paul Berry <stereotype441@gmail.com>	2011-08-08 08:47:45 -0700
committer	Paul Berry <stereotype441@gmail.com>	2011-08-11 13:09:30 -0700
commit	43542f5a3b6cd8d08b430aef960fe04e61075e5b (patch)
tree	2fb0e1b4dd14e3c1e2a9a64b460b7a178a51ff59
parent	da3126ce20538c0906226c62c089be4965aa2e41 (diff)