PHP


php create invalid utf-8 character

Use the chr() function and any number above 127 to create an invalid utf8 character. This may be useful if you want to test your UTF-8 validation script. See the below code as an example. It’ll print ‘UTF-8’ if the character is valid UTF-8 or nothing otherwise.

echo mb_detect_encoding(chr(128), 'UTF-8', true);

PHP detect encoding of multibyte characters

This function will test if the characters are UTF-8 or not. You may have to modify the character set to test against in the mb_detect_encoding function to fit your needs.

function mb_str_split( $string ) {
	return preg_split('/(?<!^)(?!$)/u', $string );
}
 
$string = "更多學習";
 
$charlist = mb_str_split($string);
echo '<table>';
foreach($charlist as $char) {
	echo '<tr>';
	echo '<td>' . $char . '</td>';
	echo '<td>' . mb_detect_encoding($char, 'UTF-8', true) . '</td>';
	echo '</tr>';
}
echo '</table>';

PHP detect encoding of each character in a string

Here is a little script I wrote that will detect the encoding of each character in a string. If the encoding is not UTF-8, it will try to convert the character using each of the below character encodings. The result of the encoding will be printed to the screen and if the character appears as it should, then it fits that encoding. Note that the character may match more than one encoding and this will not work with multi byte characters.

$encodings = array("UTF-8", "UTF-16", "ASCII",
		"Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1255", "Windows-1256", "Windows-1257", "Windows-1258",
		"ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10",
		"ISO-8859-11", "ISO-8859-12", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16",
		"CP1256", "CP1250", "CP1252", 'CP437', 'CP737', 'CP850', 'CP852', 'CP855', 'CP857', 'CP858', 'CP860', 'CP861', 'CP862', 'CP863', 'CP865',
		'CP866', 'CP869', 'CP37', 'CP930', 'CP1047', 'MIK', 'ISCII', 'TSCII', 'VISCII', 'JIS X 0208', 'EUC-JP', 'GB 2312', 'GBK', 'Big5',
		'HKSCS', 'KS X 1001', 'EUC-KR', 'ISO-2022-KR', 'Mac OS Roman', 'KOI7', 'KOI8-U', 'KOI8-R', 'GB18030', 'GB2312 80'
);
 
$string = 'This is a test string';
echo '<table>';
$len = strlen($string);
for ($i = 0; $i < $len; $i++) {
	$encoding = mb_detect_encoding($string[$i], 'UTF-8', true);
	echo '<tr><td>' . $i . '</td><td>' . $string[$i] . '</td><td>' . $encoding . '</td>';
	if($encoding != 'UTF-8') {
		foreach ($encodings as $j) {
			echo '<td>' . iconv($j, 'UTF-8', $string[$i]) . '</td>';
		}
	}
	echo '</tr>';
}
echo '</table>';

PHP character encodings array

This isn’t an exhaustive list, but covers the most common. Let me know of any that should be added.

$encodings = array("UTF-8", "UTF-16", "ASCII",
		"Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1255", "Windows-1256", "Windows-1257", "Windows-1258",
		"ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10",
		"ISO-8859-11", "ISO-8859-12", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16",
		"CP1256", "CP1250", "CP1252", 'CP437', 'CP737', 'CP850', 'CP852', 'CP855', 'CP857', 'CP858', 'CP860', 'CP861', 'CP862', 'CP863', 'CP865',
		'CP866', 'CP869', 'CP37', 'CP930', 'CP1047', 'MIK', 'ISCII', 'TSCII', 'VISCII', 'JIS X 0208', 'EUC-JP', 'GB 2312', 'GBK', 'Big5',
		'HKSCS', 'KS X 1001', 'EUC-KR', 'ISO-2022-KR', 'Mac OS Roman', 'KOI7', 'KOI8-U', 'KOI8-R', 'GB18030', 'GB2312 80'		
);

Writing Chinese, French, and other double byte characters to excel using phpexcel

If you want to export double byte characters used in French and German for example, and Chinese characters, you’ll need to export those to excel rather than a simple delimited file. There are a number of excel reader/writer libraries out there, but my experience has been phpexcel. You may run into problems when writing these characters to the worksheet such as your string being truncated when a double byte character is encountered or the Chinese characters turning into questions marks ??? .

  1. Ensure your MySQL database collation is using UTF-8 encoding. Also make sure the field you’re storing the text in is set to the same collation. Either UTF-8_general_ci or UTF-8_Unicode_ci works, with UTF-8_general_ci being more commonly used. More on the difference here http://stackoverflow.com/questions/2344118/utf-8-general-bin-unicode
  2. Connect to the database and ensure your connection is set to UTF-8. If you’re using the older built in mysql functions, this will do the trick mysql_set_charset(“UTF8”, $dbLink); .
  3. Verify the characters coming out of your database are UTF-8 using mb_detect_encoding($str, ‘UTF-8’, true); . If the data is not UTF-8, then convert it from it’s current character set to UTF-8 using iconv(). You can also use iconv(‘UTF-8’, ‘UTF-8//IGNORE’, $data) to simply strip out any non UTF-8 characters.

Writing the data to phpexcel will work just fine as long as your data is UTF-8. Everything about phpexcel is UTF-8 out of the box, so not worries there.