Character Encoding


php create invalid utf-8 character

Use the chr() function and any number above 127 to create an invalid utf8 character. This may be useful if you want to test your UTF-8 validation script. See the below code as an example. It’ll print ‘UTF-8’ if the character is valid UTF-8 or nothing otherwise.

echo mb_detect_encoding(chr(128), 'UTF-8', true);

PHP detect encoding of multibyte characters

This function will test if the characters are UTF-8 or not. You may have to modify the character set to test against in the mb_detect_encoding function to fit your needs.

function mb_str_split( $string ) {
	return preg_split('/(?<!^)(?!$)/u', $string );
}
 
$string = "更多學習";
 
$charlist = mb_str_split($string);
echo '<table>';
foreach($charlist as $char) {
	echo '<tr>';
	echo '<td>' . $char . '</td>';
	echo '<td>' . mb_detect_encoding($char, 'UTF-8', true) . '</td>';
	echo '</tr>';
}
echo '</table>';

PHP detect encoding of each character in a string

Here is a little script I wrote that will detect the encoding of each character in a string. If the encoding is not UTF-8, it will try to convert the character using each of the below character encodings. The result of the encoding will be printed to the screen and if the character appears as it should, then it fits that encoding. Note that the character may match more than one encoding and this will not work with multi byte characters.

$encodings = array("UTF-8", "UTF-16", "ASCII",
		"Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1255", "Windows-1256", "Windows-1257", "Windows-1258",
		"ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10",
		"ISO-8859-11", "ISO-8859-12", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16",
		"CP1256", "CP1250", "CP1252", 'CP437', 'CP737', 'CP850', 'CP852', 'CP855', 'CP857', 'CP858', 'CP860', 'CP861', 'CP862', 'CP863', 'CP865',
		'CP866', 'CP869', 'CP37', 'CP930', 'CP1047', 'MIK', 'ISCII', 'TSCII', 'VISCII', 'JIS X 0208', 'EUC-JP', 'GB 2312', 'GBK', 'Big5',
		'HKSCS', 'KS X 1001', 'EUC-KR', 'ISO-2022-KR', 'Mac OS Roman', 'KOI7', 'KOI8-U', 'KOI8-R', 'GB18030', 'GB2312 80'
);
 
$string = 'This is a test string';
echo '<table>';
$len = strlen($string);
for ($i = 0; $i < $len; $i++) {
	$encoding = mb_detect_encoding($string[$i], 'UTF-8', true);
	echo '<tr><td>' . $i . '</td><td>' . $string[$i] . '</td><td>' . $encoding . '</td>';
	if($encoding != 'UTF-8') {
		foreach ($encodings as $j) {
			echo '<td>' . iconv($j, 'UTF-8', $string[$i]) . '</td>';
		}
	}
	echo '</tr>';
}
echo '</table>';

PHP character encodings array

This isn’t an exhaustive list, but covers the most common. Let me know of any that should be added.

$encodings = array("UTF-8", "UTF-16", "ASCII",
		"Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1255", "Windows-1256", "Windows-1257", "Windows-1258",
		"ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10",
		"ISO-8859-11", "ISO-8859-12", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16",
		"CP1256", "CP1250", "CP1252", 'CP437', 'CP737', 'CP850', 'CP852', 'CP855', 'CP857', 'CP858', 'CP860', 'CP861', 'CP862', 'CP863', 'CP865',
		'CP866', 'CP869', 'CP37', 'CP930', 'CP1047', 'MIK', 'ISCII', 'TSCII', 'VISCII', 'JIS X 0208', 'EUC-JP', 'GB 2312', 'GBK', 'Big5',
		'HKSCS', 'KS X 1001', 'EUC-KR', 'ISO-2022-KR', 'Mac OS Roman', 'KOI7', 'KOI8-U', 'KOI8-R', 'GB18030', 'GB2312 80'		
);