使用正则替换文章屏蔽词，1500个屏蔽词，6KB的文章，替换用时1毫秒

现在的位置: 首页 > 综合 > 正文

使用正则替换文章屏蔽词，1500个屏蔽词，6KB的文章，替换用时1毫秒

2012年05月11日 ⁄ 综合 ⁄ 共 1671字 ⁄ 字号小中大 ⁄ 评论关闭

使用正则替换文章屏蔽词，这个功能很早就用到了，由于使用过程中并未感觉到什么压力，所以一直没有对其性能进行优化。

今天应leader要求，对性能进行了一下测试并作出改进，发现改进后的性能提高了100多倍！原来替换一篇文章用时130多毫秒，现在只需要不到1毫秒的时间！

前后主要差别在于正则的生成和循环文章内容的次数。

下边贴出主要代码供大家参考。

        private static readonly Regex reg_b = new Regex(@"\B", RegexOptions.Compiled);
        private static readonly Regex reg_en = new Regex(@"[a-zA-Z]+", RegexOptions.Compiled);
        private static readonly Regex reg_num = new Regex(@"^[\-\.\s\d]+$", RegexOptions.Compiled);

        private static Regex reg_word = null; //组合所有屏蔽词的正则

        private static Regex GetRegex()
        {
            if (reg_word == null)
            {
                reg_word = new Regex(GetPattern(), RegexOptions.Compiled | RegexOptions.IgnoreCase);
            }
            return reg_word;
        }

        /// <summary>
        /// 检查输入内容是否包含脏词（包含返回true）
        /// </summary>
        public static bool HasBlockWords(string raw)
        {
            return GetRegex().Match(raw).Success;
        }
        /// <summary>
        /// 脏词替换成*号
        /// </summary>
        public static string WordsFilter(string raw)
        {
            return GetRegex().Replace(raw, "***");
        }
        /// <summary>
        /// 获取内容中含有的脏词
        /// </summary>
        public static IEnumerable<string> GetBlockWords(string raw)
        {
            foreach (Match mat in reg_word.Matches(raw))
            {
                yield return (mat.Value);
            }
        }
        private static string GetPattern()
        {
            StringBuilder patt = new StringBuilder();
            string s;
            foreach (string word in GetBlockWords())
            {
                if (word.Length == 0) continue;
                if (word.Length == 1)
                {
                    patt.AppendFormat("|({0})", word);
                }
                else if (reg_num.IsMatch(word))
                {
                    patt.AppendFormat("|({0})", word);
                }
                else if (reg_en.IsMatch(word))
                {
                    s = reg_b.Replace(word, @"(?:[^a-zA-Z]{0,3})");
                    patt.AppendFormat("|({0})", s);
                }
                else
                {
                    s = reg_b.Replace(word, @"(?:[^\u4e00-\u9fa5]{0,3})");
                    patt.AppendFormat("|({0})", s);
                }
            }
            if (patt.Length > 0)
            {
                patt.Remove(0, 1);
            }
            return patt.ToString();
        }

        /// <summary>
        /// 获取所有脏词
        /// </summary>
        public static string[] GetBlockWords()
        {
            return new string[]{"国民党","fuck","110"};//这里应该从数据库获取
        }

这个程序可替换以下内容：

国民党

国-民-党

国o民o党

fuck

f.u.c.k

110（110的变形写法不被替换）

【上篇】[软件工程]关于调查报告问题的第二次响应张恂之文
【下篇】常用数据绑定控件详解

作者: cc19861106

该日志由 cc19861106 于12年前发表在综合分类下，最后更新于 2012年05月11日.
转载请注明: 使用正则替换文章屏蔽词，1500个屏蔽词，6KB的文章，替换用时1毫秒 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

使用正则替换文章屏蔽词，1500个屏蔽词，6KB的文章，替换用时1毫秒

作者: cc19861106

书签

最新文章New

本站推荐

返回首页