PHP detect encoding of each character in a string 1


Here is a little script I wrote that will detect the encoding of each character in a string. If the encoding is not UTF-8, it will try to convert the character using each of the below character encodings. The result of the encoding will be printed to the screen and if the character appears as it should, then it fits that encoding. Note that the character may match more than one encoding and this will not work with multi byte characters.

$encodings = array("UTF-8", "UTF-16", "ASCII",
		"Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1255", "Windows-1256", "Windows-1257", "Windows-1258",
		"ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10",
		"ISO-8859-11", "ISO-8859-12", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16",
		"CP1256", "CP1250", "CP1252", 'CP437', 'CP737', 'CP850', 'CP852', 'CP855', 'CP857', 'CP858', 'CP860', 'CP861', 'CP862', 'CP863', 'CP865',
		'CP866', 'CP869', 'CP37', 'CP930', 'CP1047', 'MIK', 'ISCII', 'TSCII', 'VISCII', 'JIS X 0208', 'EUC-JP', 'GB 2312', 'GBK', 'Big5',
		'HKSCS', 'KS X 1001', 'EUC-KR', 'ISO-2022-KR', 'Mac OS Roman', 'KOI7', 'KOI8-U', 'KOI8-R', 'GB18030', 'GB2312 80'
);
 
$string = 'This is a test string';
echo '<table>';
$len = strlen($string);
for ($i = 0; $i < $len; $i++) {
	$encoding = mb_detect_encoding($string[$i], 'UTF-8', true);
	echo '<tr><td>' . $i . '</td><td>' . $string[$i] . '</td><td>' . $encoding . '</td>';
	if($encoding != 'UTF-8') {
		foreach ($encodings as $j) {
			echo '<td>' . iconv($j, 'UTF-8', $string[$i]) . '</td>';
		}
	}
	echo '</tr>';
}
echo '</table>';

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “PHP detect encoding of each character in a string

  • hannes kraft

    1+ ! very good idea 🙂 … i made some modifications for better reading the table …

    $encodings = array(“UTF-8”, “UTF-16”, “ASCII”, “Windows-1250”, “Windows-1251”, “Windows-1252”, “Windows-1253”, “Windows-1254”, “Windows-1255”, “Windows-1256”, “Windows-1257”, “Windows-1258”, “ISO-8859-1”, “ISO-8859-2”, “ISO-8859-3”, “ISO-8859-4”, “ISO-8859-5”, “ISO-8859-6”, “ISO-8859-7”, “ISO-8859-8”, “ISO-8859-9”, “ISO-8859-10”, “ISO-8859-11”, “ISO-8859-12”, “ISO-8859-13”, “ISO-8859-14”, “ISO-8859-15”, “ISO-8859-16”, “CP1256”, “CP1250”, “CP1252”, ‘CP437’, ‘CP737’, ‘CP850’, ‘CP852’, ‘CP855’, ‘CP857’, ‘CP858’, ‘CP860’, ‘CP861’, ‘CP862’, ‘CP863’, ‘CP865’, ‘CP866’, ‘CP869’, ‘CP37’, ‘CP930’, ‘CP1047’, ‘MIK’, ‘ISCII’, ‘TSCII’, ‘VISCII’, ‘JIS X 0208’, ‘EUC-JP’, ‘GB 2312’, ‘GBK’, ‘Big5’, ‘HKSCS’, ‘KS X 1001’, ‘EUC-KR’, ‘ISO-2022-KR’, ‘Mac OS Roman’, ‘KOI7’, ‘KOI8-U’, ‘KOI8-R’, ‘GB18030’, ‘GB2312 80’);

    $filename = “2018_01_klein.TXT”;
    $txt = file_get_contents($filename);

    $string = ‘This is a test string’;
    $string = $txt;

    print ”;
    print “”;
    foreach ($encodings as $j)
    {
    print ” . $j . ”;
    }
    print “”;

    $len = strlen($string);
    for ($i = 0; $i < $len; $i++)
    {
    $encoding = mb_detect_encoding($string[$i], 'UTF-8', true);
    print '’ . $i . ” . $string[$i] . ” . $encoding . ”;
    if($encoding != ‘UTF-8’)
    {
    foreach ($encodings as $j)
    {
    print ” . iconv($j, ‘UTF-8’, $string[$i]) . ”;
    }
    }
    else
    {
    foreach ($encodings as $j)
    {
    print ” “;
    }
    }
    print ”;
    }
    print ”;