自动检测文本的字符集：URL的页面编码+文本文件编码

现在的位置: 首页 > 综合 > 正文

自动检测文本的字符集：URL的页面编码+文本文件编码

2018年02月10日 ⁄ 综合 ⁄ 共 6992字 ⁄ 字号小中大 ⁄ 评论关闭

自动检测文本的字符集
（Java port of Mozilla charset detector 工具的使用方法）

Java port of Mozilla charset detector是Mozilla组织开发的用于浏览器自动识别网页字符集的工具的一个JAVA实现。Mozilla有关该工具的地址是：http://www.mozilla.org/projects/intl/chardet.html 。
本JAVA实现版本的地址是：http://sourceforge.net/projects/jchardet/ 。
在JAVA系统开发中，经常遇到字符集的问题（其实不是JAVA的问题，而是软件行业的历史遗留问题），在时候不知道文本内容的编码方式，在处理的时候出现一些意想不到的问题，通过使用此工具，虽不能百分百准确检测字符集，但还可将就用用。此工具检测效率不是很高，使用后对系统性能会有一点影响，对性能要求很高的系统谨慎使用。

下面是如何使用此工具的一个简单说明：

第一步：
在需要检测字符集的类实现接口nsICharsetDetectionObserver，需要实现其中的一个方法：Notify()。该方法会在检测结束的时候被调用并返回检测的字符集（不论对错）。

接口代码如下：

Java代码

package org.mozilla.intl.chardet ;
import java.lang.* ;
public interface nsICharsetDetectionObserver {
public void Notify(String charset) ;
}

package org.mozilla.intl.chardet ;

import java.lang.* ;

public interface nsICharsetDetectionObserver {
  public void Notify(String charset) ;
}

第二步：
初始化nsDetector类，然后对字符串流进行检测，如果检测到有非ASCII的字符，则调用nsDetector的DoIt()方法进行检测。

第三步：
当字符串流处理结果后，调用nsDetector的DataEnd()方法，检测引擎会向调用者返回检测到的字符集。下面是一个调用该工具检测指定URL的页面编码方式的例子：

Java代码

package org.mozilla.intl.chardet ;
import java.io.* ;
import java.net.* ;
import java.util.* ;
import org.mozilla.intl.chardet.* ;
public class HtmlCharsetDetector {
public static boolean found = false ;
public static void main(String argv[]) throws Exception {
if (argv.length != 1 && argv.length != 2) {
System.out.println("Usage: HtmlCharsetDetector <url> [<languageHint>]");
System.out.println("");
System.out.println("Where <url> is http://...");
System.out.println("For optional <languageHint>. Use following...");
System.out.println(" 1 => Japanese");
System.out.println(" 2 => Chinese");
System.out.println(" 3 => Simplified Chinese");
System.out.println(" 4 => Traditional Chinese");
System.out.println(" 5 => Korean");
System.out.println(" 6 => Dont know (default)");
return ;
}
//初始化nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1]) : nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
//设置观察者...
//如果检测出了字符集，则会调用Notify()方法.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
URL url = new URL(argv[0]);
BufferedInputStream imp = new BufferedInputStream(url.openStream());
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
//检测字符串流中是否只包含ASCII
if (isAscii)
isAscii = det.isAscii(buf,len);
//如果包含非ASCII字符，则调用DoIt()方法进行检测
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}
if (!found) {
String prob[] = det.getProbableCharsets() ;
for(int i=0; i<prob.length; i++) {
System.out.println("Probable Charset = " + prob[i]);
}
}
}
}

package org.mozilla.intl.chardet ;
import java.io.* ;
import java.net.* ;
import java.util.* ;
import org.mozilla.intl.chardet.* ;

public class HtmlCharsetDetector {

    public static boolean found = false ;

    public static void main(String argv[]) throws Exception {

	if (argv.length != 1 && argv.length != 2) {

	  System.out.println("Usage: HtmlCharsetDetector <url> [<languageHint>]");

	  System.out.println("");
	  System.out.println("Where <url> is http://...");
	  System.out.println("For optional <languageHint>. Use following...");
	  System.out.println("		1 => Japanese");
	  System.out.println("		2 => Chinese");
	  System.out.println("		3 => Simplified Chinese");
	  System.out.println("		4 => Traditional Chinese");
	  System.out.println("		5 => Korean");
	  System.out.println("		6 => Dont know (default)");

	  return ;
	} 


	//初始化nsDetector() ;
	int lang = (argv.length == 2)? Integer.parseInt(argv[1]) : nsPSMDetector.ALL ;
	nsDetector det = new nsDetector(lang) ;

	//设置观察者...
	//如果检测出了字符集，则会调用Notify()方法.

	det.Init(new nsICharsetDetectionObserver() {
		public void Notify(String charset) {
		    HtmlCharsetDetector.found = true ;
		    System.out.println("CHARSET = " + charset);
		}
    	});

	URL url = new URL(argv[0]);
	BufferedInputStream imp = new BufferedInputStream(url.openStream());
	
	byte[] buf = new byte[1024] ;
	int len;
	boolean done = false ;
	boolean isAscii = true ;
	   
	while( (len=imp.read(buf,0,buf.length)) != -1) {

		//检测字符串流中是否只包含ASCII
		if (isAscii)
		    isAscii = det.isAscii(buf,len);

                //如果包含非ASCII字符，则调用DoIt()方法进行检测
		if (!isAscii && !done)
 		    done = det.DoIt(buf,len, false);
	}
	det.DataEnd();

	if (isAscii) {
	   System.out.println("CHARSET = ASCII");
	   found = true ;
	}

	if (!found) {
	   String prob[] = det.getProbableCharsets() ;
	   for(int i=0; i<prob.length; i++) {
		System.out.println("Probable Charset = " + prob[i]);
	   }
	}
    }
}

下面是检测文本文件字符编码的例子，跟检测URL的差不多，只改了输入流获取方式：

Java代码

package com.iteye.sheng.util.tools;
import java.io.* ;
import java.net.* ;
import java.util.* ;
import org.mozilla.intl.chardet.* ;
public class FileCharsetDetector {
public static boolean found = false ;
public static void main(String argv[]) throws Exception {
if (argv.length != 1 && argv.length != 2) {
System.out.println("Usage: FileCharsetDetector <file> [<languageHint>]");
System.out.println("");
System.out.println("Where <file> is C:/xxxxx.txt");
System.out.println("For optional <languageHint>. Use following...");
System.out.println(" 1 => Japanese");
System.out.println(" 2 => Chinese");
System.out.println(" 3 => Simplified Chinese");
System.out.println(" 4 => Traditional Chinese");
System.out.println(" 5 => Korean");
System.out.println(" 6 => Dont know (default)");
return ;
}
//初始化nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1]) : nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
//设置观察者...
//如果检测出了字符集，则会调用Notify()方法.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
FileCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
String filename = argv[0];
BufferedInputStream imp = new BufferedInputStream(new FileInputStream(filename));
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
//检测字符串流中是否只包含ASCII
if (isAscii)
isAscii = det.isAscii(buf,len);
//如果包含非ASCII字符，则调用DoIt()方法进行检测
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}
if (!found) {
String prob[] = det.getProbableCharsets() ;
for(int i=0; i<prob.length; i++) {
System.out.println("Probable Charset = " + prob[i]);
}
}
}
}

【上篇】TabActivity标签界面的实现
【下篇】Binder机制之Server端—defaultServiceManager()

作者: 07168236991

该日志由 07168236991 于6年前发表在综合分类下，最后更新于 2018年02月10日.
转载请注明: 自动检测文本的字符集：URL的页面编码+文本文件编码 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

自动检测文本的字符集：URL的页面编码+文本文件编码

作者: 07168236991

书签

最新文章New

本站推荐

返回首页