《编程之美》学习笔记——计算字符串的距离

现在的位置: 首页 > 综合 > 正文

《编程之美》学习笔记——计算字符串的距离

2018年03月30日 ⁄ 综合 ⁄ 共 9543字 ⁄ 字号小中大 ⁄ 评论关闭

一、题目

摘自书中:

许多程序会大量使用字符串。对于不同的字符串，我们希望能够有办法判断其相似程度。我们定义了一套操作方法来把两个不相同的字符串变得相同，具体的操作方法为：
1.修改一个字符（如把“a”替换为“b”）。
2.增加一个字符（如把“abdd”变为“aebdd”）。
3.删除一个字符（如把“travelling”变为“traveling”）。
比如，对于“abcdefg”和“abcdef”两个字符串来说，我们认为可以通过增加/减少一个“g“的方式来达到目的。上面的两种方案，都仅需要一次操作。把这个操作所需要的次数定义为两个字符串的距离，给定任意两个字符串，你是否能写出一个算法来计算出它们的距离？

问题分析:

修改、增加、删除是进行字符串处理的三种的基本操作。我想之所以设定这三种基本操作，所考虑的一个因素就是我们可以通过这三种最简单的基本操作，把任何一个字符串变成我们所设想的目标字符串。

衡量两个字符串的相似度，我们需要找到一种计算两个字符串距离值的表示方法，并且这个距离值与两个字符串的相似度成反比例关系。题目中利用了使两个字符串相同的最少基本操作次数作为两个字符串的距离d，定义字符相似度为s = 1 / (d + 1)。这恰好符合我们衡量两个字符串相似度的标准。

二、递归算法分析和实现

算法思考:

首先自己通过设计一些例子尝试了下三种数据操作对计算两个字符相似度距离的影响：

设字符串A = “abcdef” 字符串B = “abcef”:

1.如果先采用删除操作，那么只需把字符串A中的’d‘字符删去即可，字符串相似度s = 1 / 2；

2.如果先采用增加操作，若在字符串B中的c‘字符后面增加‘d'字符，字符串相似度s = 1 / 2；若在字符串A中的字符'c'后面增加‘e'，会导致计算出来的字符串距离值增加；

3.如果先采用修改操作，若把字符串A中的’d‘字符修改为’e‘，那么字符相似度s小于1 / 2；若把字符串B中的’e‘字符修改为’d‘，那么字符相似度s也是小于1 / 2。

通过这个例子，可以发现了字符串相似度比较中以下一些特点：1.我们只需要从两个字符串中出现首个不相同字符的位置开始进行比较，而忽略前面相同的字符。2.对于增加、删除、修改三种基本操作，不同的处理顺序和处理对象会导致不同的相似度结果，当两个字符串相对复杂时，得到的这些相似度结果应该会更多。根据我们的定义，我们取这些相似度结果中的最大值作为两个字符串的相似度，对应的是我们保留三种操作中计算出来的字符距离的最小值结果作为最终结果。3.（来自书中）通过这些基本操作之后得到的字符串内容是什么并不在我们的考虑范围之内，我们可以把操作合并为：

    1.一步操作之后，再将A[2,...,lenA]和B[1,...,lenB]变成相字符串。
    2.一步操作之后，再将A[2,...,lenA]和B[2,...,lenB]变成相字符串。
    3.一步操作之后，再将A[1,...,lenA]和B[2,...,lenB]变成相字符串。

基于以上这些考虑，可以利用递归算法（1.需要计算不同字符串基本操作顺序下(即多种情况)的相似度结果 2.只保留相似度最大的结果）解决这个问题。

下面为递归算法实现程序：

版本一：递归法

/**
 * @file count_string_similarity_v1.c
 * @brief count the similarity between two strings by recursion.
 * @author chenxilinsidney
 * @version 1.0
 * @date 2014-12-31
 */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

// min macro, warning: can not use with '++' and '--' operator
#ifndef MIN
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#endif

typedef int TYPE;

/**
 * @brief calculate string distance by recursion.
 *
 * @param[in]      str_a    string a
 * @param[in]      a_begin  begin index of a(included)
 * @param[in]      a_end    end index of a(included)
 * @param[in]      str_b    string b
 * @param[in]      b_begin  begin index of b(included)
 * @param[in]      b_end    end index of b(included)
 *
 * @return distance between string a and b
 */
TYPE calculate_string_distance(char* str_a, TYPE a_begin, TYPE a_end,
        char* str_b, TYPE b_begin, TYPE b_end)
{
    /// check if index of string a or b exceeds their length.
    if(a_begin > a_end) {
        if(b_begin > b_end)
            return 0;
        else
            return b_end - b_begin + 1;
    }
    if(b_begin > b_end) {
        if(a_begin > a_end)
            return 0;
        else
            return a_end - a_begin + 1;
    }
    /// check if the begin value by index of a and b are equal
    if(str_a[a_begin] == str_b[b_begin]) {
        return calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin + 1, b_end);
    } else {
        /// get the minimum distance for sub-string by three methods.
        TYPE d_1 = calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin, b_end);
        TYPE d_2 = calculate_string_distance(str_a, a_begin, a_end,
                str_b, b_begin + 1, b_end);
        TYPE d_3 = calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin + 1, b_end);
        return MIN(MIN(d_1, d_2), d_3) + 1;
    }
}

TYPE calculate_string_distance_by_recursion(char* str_a, TYPE a_length,
        char* str_b, TYPE b_length)
{
    return calculate_string_distance(str_a, 0, a_length - 1,
            str_b, 0, b_length - 1);
}

int main(void)
{
    char* string_a = "d";
    char* string_b = "s";
    TYPE distance = calculate_string_distance_by_recursion(string_a,
            strlen(string_a),
            string_b,
            strlen(string_b));
    printf("string %s and string %s\ndistance: %d similarity: %f\n",
            string_a, string_b, distance, 1 / ((double)distance + 1));
    return EXIT_SUCCESS;
}

分析递归算法问题：递归过程中有些数据或迭代过程重复出现导致重新计算，能否通过改善算法来减少重复计算部分从而提高算法效率。

三、动态规划算法分析和实现

采用动态规划算法主要是为了解决分治算法中重复子问题的情况。这个题目满足了动态规划要求的重叠子问题和最优子结构两个性质，因此可以用动态规划的方法去解决问题。关于动态规划的学习可以查看《算法导论》中动态规划学习笔记博客，这里不再讲解。

(文章链接：
blog.csdn.net/chensilly8888/article/details/42396603)

版本二：动态规划——带备忘的自顶向下法(Top-down with memoization)

这里把把子问题的解存放到一个数组里面，存放不同的索引值下的两个字符串的最小距离。关于初始化这个内存空间要注意的地方是：对于两个字符串索引值超过他们长度时返回的最小距离是可以事先计算的，我们事先把它们放入数组从而减少递归里的计算量。另外通过输出递归信息可以方便调试和分析递归过程。

下面实现中实际存放子问题解的数组是一维数组，并传递一个数组宽度辅助。（可以修改为真正的二维数组：TYPE** cache = (TYPE**)malloc(a * sizeof(TYPE*)) ... ; *cache = (TYPE*)malloc(b*sizeof(TYPE)）

/**
 * @file count_string_similarity_by_memoized.c
 * @brief count the similarity between two strings by dynamic programming
 * with top-down with memoization method.
 * @author chenxilinsidney
 * @version 1.0
 * @date 2014-12-31
 */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
// #define NDEBUG
#include <assert.h>

#include "memory.h"
// #define NDBG_PRINT
#include "debug_print.h"

// min macro, warning: can not use with '++' and '--' operator
#ifndef MIN
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#endif

typedef int TYPE;

/**
 * @brief calculate string distance by dynamic programming.
 *
 * @param[in]      str_a    string a
 * @param[in]      a_begin  begin index of a(included)
 * @param[in]      a_end    end index of a(included)
 * @param[in]      str_b    string b
 * @param[in]      b_begin  begin index of b(included)
 * @param[in]      b_end    end index of b(included)
 * @param[in]      cache    save sub-problem result by memoization.
 * @param[in]      cache_width width of cache for two dimensional array.
 *
 * @return distance between string a and b
 */
TYPE calculate_string_distance(char* str_a, TYPE a_begin, TYPE a_end,
        char* str_b, TYPE b_begin, TYPE b_end, TYPE* cache, TYPE cache_width)
{
    DEBUG_PRINT_STRING("In Recursion\n");
    DEBUG_PRINT_VALUE("%d", a_begin);
    DEBUG_PRINT_VALUE("%d", b_begin);
    /// try to get distance from cache
    TYPE value_temp;
    if((value_temp = cache[a_begin * cache_width + b_begin]) > INT_MIN) {
        DEBUG_PRINT_STRING("get distance from cache!\n");
        DEBUG_PRINT_VALUE("%d", a_begin);
        DEBUG_PRINT_VALUE("%d", b_begin);
        DEBUG_PRINT_STRING("Out of Recursion\n");
        return value_temp;
    }
    /// check if the begin value by index of a and b are equal
    if(str_a[a_begin] == str_b[b_begin]) {
        DEBUG_PRINT_STRING("have same word\n");
        DEBUG_PRINT_STRING("set distance to cache!\n");
        DEBUG_PRINT_VALUE("%d", a_begin);
        DEBUG_PRINT_VALUE("%d", b_begin);
        DEBUG_PRINT_STRING("Out of Recursion\n");
        return cache[a_begin * cache_width + b_begin] =
            calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin + 1, b_end, cache, cache_width);
    } else {
        /// get the minimum distance for sub-string by three methods.
        TYPE d_1 = calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin, b_end, cache, cache_width);
        TYPE d_2 = calculate_string_distance(str_a, a_begin, a_end,
                str_b, b_begin + 1, b_end, cache, cache_width);
        TYPE d_3 = calculate_string_distance(str_a, a_begin + 1, a_end,
                str_b, b_begin + 1, b_end, cache, cache_width);
        /// set distance to cache
        DEBUG_PRINT_STRING("set distance to cache!\n");
        DEBUG_PRINT_VALUE("%d", a_begin);
        DEBUG_PRINT_VALUE("%d", b_begin);
        DEBUG_PRINT_STRING("Out of Recursion\n");
        return cache[a_begin * cache_width + b_begin] =
            MIN(MIN(d_1, d_2), d_3) + 1;
    }
}

TYPE calculate_string_distance_by_memoized(char* str_a, TYPE a_length,
        char* str_b, TYPE b_length)
{
    assert(a_length >= 1 && b_length >= 1);
    /// initialize cache first
    TYPE memoized_length = (a_length + 1) * (b_length + 1);
    TYPE* memoized_cache = SMALLOC(memoized_length, TYPE); 
    TYPE i;
    for(i = 0; i < memoized_length; i++)
        memoized_cache[i] = INT_MIN;
    /// set edge elements first to reduce calculation in program.
    for(i = 0; i < b_length + 1; i++)
        memoized_cache[a_length * (b_length + 1) + i] = b_length - i;
    for(i = 0; i < a_length + 1; i++)
        memoized_cache[i * (b_length + 1) + b_length] = a_length - i;
    /// get distance by memoized method
    TYPE value = calculate_string_distance(str_a, 0, a_length - 1,
            str_b, 0, b_length - 1, memoized_cache, b_length + 1);
    /// free memory
    SFREE(&memoized_cache);
    return value;
}

int main(void)
{
    char* string_a = "ddsag";
    char* string_b = "sdsg";
    TYPE distance = calculate_string_distance_by_memoized(string_a,
            strlen(string_a),
            string_b,
            strlen(string_b));
    printf("string %s and string %s\ndistance: %d similarity: %f\n",
            string_a, string_b, distance, 1 / ((double)distance + 1));
    return EXIT_SUCCESS;
}

拓展：我们可以建立一个二维数组的表格，通过表格中逐步填入信息，可以快速分析并得到结果。

版本三：动态规划——自底向上法(bottom-up method)

自底向上的动态规划方法避免了自顶向下法的递归调用消耗。考虑到简化算法，这里存放的数组和版本二的数组不一致的地方是他们是属于镜像关系，因为版本二中索引值i,j含义是原字符串a,b分别从i,j开始索引到结尾的子字符串，而版本三中索引值i,j含义是原字符串a,b分别从结尾向前保留i,j个值字符的子字符串。当然也可以修改算法使他们一致，这样数组表格代表的含义就一致了。

/**
 * @file count_string_similarity_by_bottom_up.c
 * @brief count the similarity between two strings by dynamic programming
 * with bottom-up method.
 * @author chenxilinsidney
 * @version 1.0
 * @date 2014-12-31
 */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
// #define NDEBUG
#include <assert.h>

#include "memory.h"
// #define NDBG_PRINT
#include "debug_print.h"

// min macro, warning: can not use with '++' and '--' operator
#ifndef MIN
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#endif

typedef int TYPE;

TYPE calculate_string_distance_by_bottom_up(char* str_a, TYPE a_length,
        char* str_b, TYPE b_length)
{
    assert(a_length >= 1 && b_length >= 1);
    /// initialize cache first
    TYPE* memoized_cache = SMALLOC((a_length + 1) * (b_length + 1), TYPE); 
    TYPE* cache_offset;
    TYPE i, j;
    /// set edge elements first to reduce calculation in program.
    for(i = 0; i < b_length + 1; i++)
        memoized_cache[i] = i;
    for(i = 1; i < a_length + 1; i++)
        memoized_cache[i * (b_length + 1)] = i;
    /// get distance by bottom-up method
    /// i, j to index string from end to begin from the string now
    for (i = 0; i < a_length; i++) {
        for (j = 0; j < b_length; j++) {
            /// save offet pointer
            cache_offset = memoized_cache + (b_length + 1) * i + j;
            if (str_a[a_length - i - 1] == str_b[b_length - j - 1]) {
                /// use old distance if same word
                cache_offset[b_length + 1 + 1] =
                    cache_offset[0];
            } else {
                /// use minimum distance if not same word
                cache_offset[b_length + 1 + 1] = MIN(
                        MIN(cache_offset[b_length + 1],
                            cache_offset[1]),
                        cache_offset[0]) + 1;
            }
        }
    }
    /// get final distance
    TYPE distance = memoized_cache[a_length * (b_length + 1) + b_length];
    /// free memory
    SFREE(&memoized_cache);
    return distance;
}

int main(void)
{
    char* string_a = "ddsag";
    char* string_b = "sdsg";
    TYPE distance = calculate_string_distance_by_bottom_up(string_a,
            strlen(string_a),
            string_b,
            strlen(string_b));
    printf("string %s and string %s\ndistance: %d similarity: %f\n",
            string_a, string_b, distance, 1 / ((double)distance + 1));
    return EXIT_SUCCESS;
}

拓展：我把代码中str_a[a_length - i - 1] == str_b[b_length - j - 1]语句改为str_a[i] == str_b[j]语句后算法的解一致，但这时二维数组表格（由算法一维数组获得）中的内容含义不太一致，实际上一个是从字符串尾部开始计算字符串距离，一个是从字符串头部开始计算字符串距离。

四、总结

1.递归算法中若含有重叠子问题现象，又满足最优子结构性质，可利用动态规划算法取得更好的效果。

2.动态规划方法中自底向上的方法实现比带备忘的自顶向下的方法简便的多，不会产生递归开销，但需注意存放子问题解的内存的初始化。自底向上的另一个好处是子问题的解的内存可以根据实际问题变化，使内存的开销更加小。

备注：

1.转载请注明出处：http://blog.csdn.net/chensilly8888/

2.全文源码均开源（在UBUNTU + GCC4.8.2下编译并测试通过），可下载或查看：https://github.com/chenxilinsidney/funnycprogram/tree/master/beauty_of_programming/count_string_similarity

3.有写错或写漏之处请指正，谢谢！