现在的位置: 首页 > 综合 > 正文

判断是否utf8编码的算法

2018年04月12日 ⁄ 综合 ⁄ 共 4993字 ⁄ 字号 评论关闭
 408人阅读 评论(0) 收藏 举报

参考:http://www.cnblogs.com/powertoolsteam/archive/2010/09/20/1831638.html

问题:由于utf8编码分为有bom头和无bom头,而有bom头的编码很好判断,如果前三个字节是0xE 0xBB 0xBF,那么就是utf8带bom的编码,如果没有bom头,则不能好快的判断到底是utf8还是ansi编码,因为他们都没有头,且全英文的编码是兼容的,者就是问题所在。

解决方案:

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理:

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte) 

•结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。 

因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则,我给出的C#代码如下:

///
<summary>
///  
Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.
///
</summary>
///
<param name="inputStream">
///   
The input stream.
/// 
</param>
///
<returns>
///  
<see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.
///
</returns>
///
<remarks>
///  
All ASCII chars will regards not UTF8 encoding.
///
</remarks>
public static bool IsTextUTF8(ref byte[]
inputStream)
{
    int encodingBytesCount
= 0;
    bool allTextsAreASCIIChars
=
true;
 
    for (int i
= 0; i < inputStream.Length; i++)
    {
        byte current
= inputStream[i];
 
        if ((current
& 0x80) == 0x80)
        {                   
            allTextsAreASCIIChars
=
false;
        }
        //
First byte
        if (encodingBytesCount
== 0)
        {
            if ((current
& 0x80) == 0)
            {
                //
ASCII chars, from 0x00-0x7F
                continue;
            }
 
            if ((current
& 0xC0) == 0xC0)
            {
                encodingBytesCount
= 1;
                current
<<= 2;
 
                //
More than two bytes used to encoding a unicode char.
                //
Calculate the real length.
                while ((current
& 0x80) == 0x80)
                {
                    current
<<= 1;
                    encodingBytesCount++;
                }
            }                   
            else
            {
                //
Invalid bits structure for UTF8 encoding rule.
                return false;
            }
        }               
        else
        {
            //
Following bytes, must start with 10.
            if ((current
& 0xC0) == 0x80)
            {                       
                encodingBytesCount--;
            }
            else
            {
                //
Invalid bits structure for UTF8 encoding rule.
                return false;
            }
        }
    }
 
    if (encodingBytesCount
!= 0)
    {
        //
Invalid bits structure for UTF8 encoding rule.
        //
Wrong following bytes count.
        return false;
    }
 
    //
Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
    return !allTextsAreASCIIChars;
}

 

 

再附上单元测试代码:

 

///
<summary>
///This
is a test class for EncodingHelperTest and is intended
///to
contain all EncodingHelperTest Unit Tests
///</summary>
[TestClass()]
public class EncodingHelperTest
{
    ///
<summary>
    /// 
Normal test for this method.
    ///</summary>
    [TestMethod()]
    public void IsTextUTF8Test()
    {
        for (int i
= 0; i < 1000; i++)
        {
            List<Char>
chars =
new List<char>();
            chars.Add('中');
 
            List<UnicodeCategory>
temp =
new List<UnicodeCategory>();
            Random
rd =
new Random((int)(DateTime.Now.Ticks
& 0x7FFFFFFF));
 
            for (int j
= 0; j < 255; j++)
            {
                char ch
= (
char)rd.Next(0xFFFF);
                UnicodeCategory
uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                if (uc
== UnicodeCategory.Surrogate ||
//
Single surrogate could not be encoding correctly.
                    uc
== UnicodeCategory.PrivateUse ||
//
Private use blocks should be excluded.
                    uc
== UnicodeCategory.OtherNotAssigned
                    )
                {
                    j--;
                }
                else
                {
                    chars.Add(ch);
                    temp.Add(uc);
                }
            }
 
            string str
=
new string(chars.ToArray());
 
            byte[]
inputStream = Encoding.UTF8.GetBytes(str);
            bool expected
=
true;
            bool actual;
            actual
= EncodingHelper.IsTextUTF8(
ref inputStream);
            Assert.AreEqual(expected,
actual,
string.Format("UTF8_Assert
Fails at:{0}"
,
str));
 
            inputStream
= Encoding.GetEncoding(932).GetBytes(str);
            expected
=
false;
 
            actual
= EncodingHelper.IsTextUTF8(
ref inputStream);
            Assert.AreEqual(expected,
actual,
string.Format("ShiftJIS_Assert
Fails at:{0}"
,
str));
        }
    }
 
    ///
<summary>
    ///  
Check with All ASCII chars
    ///
</summary>
    [TestMethod]
    public void IsTextUTF8Test_AllASCII()
    {
        string str
=
"ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";
 
        byte[]
inputStream = Encoding.UTF8.GetBytes(str);
        bool expected
=
false;
        bool actual;
        actual
= EncodingHelper.IsTextUTF8(
ref inputStream);
        Assert.AreEqual(expected,
actual,
string.Format("UTF8_Assert
Fails at:{0}"
,
str));
 
 
    }
}

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

参考:

抱歉!评论已关闭.