找出一个文件中出现次数最多的10个单词

现在的位置: 首页 > 综合 > 正文

RSS

找出一个文件中出现次数最多的10个单词

2013年10月14日 ⁄ 综合 ⁄ 共 5965字 ⁄ 字号小中大 ⁄ 评论关闭

记得当初面试的时候，一哥们给我出的上机题。

咋一下感觉很简单，可做着觉得有了些问题，反正当时做的挺糟糕的。

这几天又时间，又把这个问题总结了下。

总体来说，它涉及到两点主要的算法问题：查找（相同单词，统计出现次数），排序（按次数排序）

这个题目有几点很容易忽略：

1)单靠出现次数能唯一确定前10个单词么？如果有个数重复的怎么办？随机抽取？那么程序的输出岂不是也随机？但算法的一个很重要的概念就是对于特定的输入，应该有固定的输出。否则谁敢用你的算法。

2)既然一个纬度不足够唯一确定结果，则需要再添加一个纬度。比如，当出现次数一致的情况下，按字母顺序排序。这样就可以了。

3)所以，一定要排序。我当时的想法时，首先统计出所有出现的单词以及他们出现的次数。（这个可以有很多种方法，但用hashtable是最方便的应该）。然后呢？选定前10个作为基准值，按顺序排好，从第11个开始，逐个同10个当中按顺序比较，如果遇见比它小的，则从当前位置把元素往后移动（最后一个被挤出去），然后把它加在这个空出的位置上。虽然实现了，但是结果是不确定的。取决与他们出现在hashtable中的次序。而且，反复移动，效率也不高

4)现在觉得，算法一定得排序，然后才能得到唯一确定的输出结果。

下面是用C#实现的算法，基本需求的是根据输入文件，把统计后的单词按出现次数从小到大写到输入文件中

    class Program
    {
        static void Main(string[] args)
        {
            QuickSortUtil sortUitl = new QuickSortUtil();
            sortUitl.CountWord(@"../data/input.txt", @"../data/output.txt");
            Console.WriteLine("Done!");
            Console.Read();
        }
    }

class QuickSortUtil
    {
        public void CountWord(string inputFilePath, string outputFilePath)
        {
            Hashtable allWords = getAllWords(inputFilePath);
            List<WordInfo> allWordInfos = new List<WordInfo>();
            foreach (string key in allWords.Keys)
            {
                allWordInfos.Add(new WordInfo(key, (int)allWords[key]));
            }
            qucikSort(allWordInfos, 0, allWordInfos.Count - 1);
            writeToFile(allWordInfos, outputFilePath);
        }

        private Hashtable getAllWords(string filePath)
        {
            //since if the hashtable need to increase it's capacity frequently,
            //it will affect the efficence,so give it a large enough basic capacity
            Hashtable allWords = new Hashtable(10240);
            using (StreamReader sr = new StreamReader(filePath, Encoding.Default))
            {
                string line = null;
                char[] seperators = new char[] { ' ', ',', ';', '.','!','"' };
                string[] words = null;
                while ((line = sr.ReadLine()) != null)
                {
                    //you can decide whether to add the ToLower() function
                    line = line.ToLower();
                    //do use the StringSplitOptions.RemoveEmptyEntries when use Split
                    words = line.Split(seperators, StringSplitOptions.RemoveEmptyEntries);
                    if (words != null && words.Length > 0)
                    {
                        for (int i = 0; i < words.Length; i++)
                        {
                            if (allWords.ContainsKey(words[i]))
                            {
                                allWords[words[i]] = (int)allWords[words[i]] + 1;
                            }
                            else
                            {
                                allWords.Add(words[i], 1);
                            }
                        }
                    }
                }
            }

return allWords;
}

        private void qucikSort(List<WordInfo> allWordInfos, int low, int high)
        {
            if (low >= high)
            {
                return;
            }

            //use the local varaiable for index.
            //do not change the original low and high because we will use them later
            int pLow = low;
            int pHigh = high;
            //record the stand value
            WordInfo value = allWordInfos[low];

            while (pLow < pHigh)
            {
                //note: if two value equals, do not move.
                //we only swap the value if it's bigger than the stand value
                while ((WordInfo.Compare(allWordInfos[pHigh],value) >=0) && pHigh > pLow)
                {
                    pHigh--;
                }
                //so here, if pHigh == pLow, the "<" check will be false
                if (WordInfo.Compare(allWordInfos[pHigh], value) < 0)
                {
                    allWordInfos[pLow] = allWordInfos[pHigh];
                    allWordInfos[pHigh] = value;
                    pLow++;
                }

                while ((WordInfo.Compare(allWordInfos[pLow],value) <= 0) && pHigh > pLow)
                {
                    pLow++;
                }
                if (WordInfo.Compare(allWordInfos[pLow], value) > 0)
                {
                    allWordInfos[pHigh] = allWordInfos[pLow];
                    allWordInfos[pLow] = value;
                    pHigh--;
                }
            }

//pLow and pHigh is expected to be equal here
System.Diagnostics.Trace.Assert(pLow == pHigh);

            qucikSort(allWordInfos, low, pLow - 1);
            qucikSort(allWordInfos, pLow + 1, high);
        }

        private void writeToFile(List<WordInfo> allWordInfos, string outputFilePath)
        {
            using (StreamWriter sw = new StreamWriter(outputFilePath, false, Encoding.Default))
            {
                foreach (WordInfo wi in allWordInfos)
                {
                    sw.WriteLine("{0}:{1}", wi.Word, wi.Count);
                }
            }
        }
    }

    public class WordInfo
    {
        private string _word = string.Empty;

        public string Word
        {
            get { return _word; }
            set { _word = value; }
        }
        private int _count = 0;

        public int Count
        {
            get { return _count; }
            set { _count = value; }
        }

        public WordInfo()
        {
        }

        public WordInfo(string word, int count)
        {
            this._word = word;
            this._count = count;
        }

        public WordInfo(WordInfo info)
        {
            this._word = info.Word;
            this._count = info.Count;
        }

public static int Compare(WordInfo info1, WordInfo info2)
{

            if (info1 == null || info2 == null)
            {
                if (info1 == null)
                {
                    if (info2 == null)
                    {
                        return 0;
                    }
                    else
                    {
                        return -1;
                    }
                }

return 1;
}

            if (info1.Count > info2.Count)
            {
                return 1;
            }

            if (info1.Count < info2.Count)
            {
                return -1;
            }

            return string.Compare(info1.Word, info2.Word);
        }
    }

算法中几点注意的：

1)使用hashtable时给它设置个相对合理的初始容量。否则hashtable扩容时需要重新对以前的元素计算位置，很影响效率

2)排序的时候用快速排序

3)Split字符串的时候，自动的去掉空内容StringSplitOptions.RemoveEmptyEntries

4)是否考虑字符串的大小写。即大小写不一样的单词是否视为同一单词

【上篇】字符数组输出的问题
【下篇】dynamic做反射提升性能

作者: hammam

该日志由 hammam 于11年前发表在综合分类下，最后更新于 2013年10月14日.
转载请注明: 找出一个文件中出现次数最多的10个单词 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

找出一个文件中出现次数最多的10个单词

作者: hammam

书签

最新文章New

本站推荐

返回首页