Unicode to Chinese conversion notes

Unicode expressed like “\u4a44“.

Chinese words located from 0x3400 to 0x9fa5, including simplified and traditional words.

Online tool: Unicode to Chinese covnerter

C#

Stores unicode data as a string, for each unicode in a string lookup the string by hex or integer number to get the right characters to replace them in the string.

string ZHTable = "㐀㐁㐂㐃㐄㐅㐆㐇㐈㐉㐊㐋㐌㐍㐎㐏㐐... ...";

 

private string GetChar(string unicode)
{
    unicode = unicode.Replace("\\u", "");
    var code = Convert.ToInt32(unicode, 16);
    return ZHTable[code - 13312].ToString();
}
var matches = Regex.Matches(text, @"\\u[a-f0-9]{4}");

string result = text;
if (matches.Count > 0) {
    foreach (var m in matches) {
        var key = m.ToString();
        result = result.Replace(key, GetChar(m.ToString()));
    }
}

 

public string ToChinese(string text) {
    var matches = Regex.Matches(text, @"\\u[a-f0-9]{4}");

    string result = text;
    if (matches.Count > 0) {
        foreach (var m in matches) {
            var key = m.ToString();
            result = result.Replace(m.ToString(), GetChar(key));
        }
    }
    return result;
}

PHP

Stores unicode data in a dictionary like array, for each unicode in a string look up the dictionary to get the right characters to replace them in the string.

$unicodedata = [
… …
0x3437 => ‘㐷’,
0x3438 => ‘㐸’,
0x3439 => ‘㐹’,
0x343a => ‘㐺’,
0x343b => ‘㐻’,
0x343c => ‘㐼’,
… …
];

for number calculate as power of 16;

for a-f, convert to ascii code – 87, so as a=10, f=15, then calculate as power of 16;

for hex letter only has a-f, other than that throw exception.

if($this->is_number($u[$i])){
	$val += $u[$i] * pow(16, 3-$i);
}
elseif($this->is_hexletter($u[$i])){
	$val += ord($u[$i])-87 * pow(16,3-$i);
}

to match all unicode, which starts with \u and followed by 4 chars of number of a to f letters.
replace these matches with the final char value.

$patten = "/\\u[\da-f]{4}/i";
preg_match_all($patten, $text, $matches);

foreach($matches[0] as $m){
	$char = $this->get_char($m);
	$text = str_replace('\\'.$m, $char, $text);
}

Web API

As web API, for performance consideration, it is best to request by an unicode array json and returns a dictionary json string to the consumer, in front end using ajax event to replace the unicode.

C#

public Dictionary<string,string> ToChinese(string[] unicodes) {
    foreach (var uc in unicodes) {
        if (!Regex.IsMatch(uc, @"\\u[a-f0-9]{4}"))
            throw new ArgumentException("The array contains non-unicode characters.");
    }

    Dictionary<string, string> dict = new Dictionary<string, string>();

    foreach(var uc in unicodes) {
        dict.Add(uc, GetChineseWord(uc).ToString());
    }
    return dict;
}
//best feed with minimum size: @"\u4e07\u9526\u534e\u4eba"
public string ToChineseAPI(string unicodejson) {
    var matches = Regex.Matches(unicodejson, @"\\u[a-f0-9]{4}").Select(x=>x.ToString());

    var dict = ToChinese(matches.ToArray());
    var entries = dict.Select(d => string.Format("\"{0}\": {1}", d.Key, string.Join(",", d.Value)));
    return "{" + string.Join(",", entries) + "}";
}

PHP

function convertjson($text){
    $patten = "/\\u[\da-f]{4}/i";
    preg_match_all($patten, $text, $matches);

    $arr = array();
    foreach($matches[0] as $m){
        $arr['\\'.$m] = $this->get_char($m);
    }

    return json_encode($arr, JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
}

 57 total views

Author: Albert

Leave a Reply