自制编译器：词法分析器

现在的位置: 首页 > 综合 > 正文

自制编译器：词法分析器

2019年05月27日 ⁄ 综合 ⁄ 共 10078字 ⁄ 字号小中大 ⁄ 评论关闭

词法分析器代码已上传到个人资源中。

当我们的程序源文件进入编译器，首先遇到的就是词法分析器。

词法分析器的作用就是解析源文件，分析出其中的词素，并把这个词素的顺序集输入给语法分析器。

接上篇把所谓的词素也就是终结符号列出来：

if else while ( ) { } cpreop bitop logiop armtcop number literal id NUL new [ ] basetype class private public static return break continue . this

其中cprop包括 > < >= <= == != 即比较运算符

bitop 为位运算符，包括<< >> & | ^

logiop 逻辑运算符包括 && ||

armtcop 算数运算符包括 + - * /

number 数字常量例如12345整形火 1.2345小数

id 标识符按java规则

literal 字符串常量如"ROgerwong"

NUL 空串

basetype 基本类型包括 int char double 三种

当然，为了简单，在这里并不打算讨论非确定有穷自动机和确定有穷自动机的理论以及其之间的转换算法，只是用最朴素的方法，不断的将字符读入缓冲区，然后和这些词素进行比较，然后把这个词素加入到一个ArrayList中。

按着这个方法定义几个数据结构：

定义词素数据结构，共含两个域，1个表示类型，一个表示具体的值，类型的取值也已经标出。

[java]view

 plaincopy

<p>package ravaComplier.lexer;</p><p>public class Lexeme {  

 public int type;  

 public Object value;  

 public Lexeme(int t,Object v)  

 {  

  type=t;  

  value=v;  

 }  

 @Override  

 public String toString()  

 {  

  return new String("<"+type+":"+value.toString()+">");  

 }  

 public static int IF=0;//if  

 public static int ELSE=1;//else  

 public static int WHILE=2;//while  

 public static int BRACKET=3;//各种括号  

 public static int CPREOP=4;//比较符号  

 public static int BITOP=5;//位操作符  

 public static int LOGIOP=6;//逻辑运算符  

 public static int ARMTOP=7;//算术运算符  

 public static int NUMBER=8;//立即数  

 public static int LITERAL=9;//字符串  

 public static int ID=10;//id  

 public static int NUL=11;//空  

 public static int NEW=12;//new 操作符  

 public static int BASETYPE=13;//基本数据类型  

 public static int CLASS=14;//关键字class  

 public static int ACCESSFLAG=15;//public 或者private  

 public static int STATIC=16;//关键字static  

 public static int RETURN=17;//关键字return  

 public static int BREAK=18;//break  

 public static int CONTINUE=19;//continue  

 public static int DOT=20;//.  

 public static int THIS=21;//关键字this  

 public static int SEMI=22;//分号  

 public static int EQUAL=23;//等号  

}  

</p>

其次，因为是用朴素的笨办法，所以我们需要构造规则：

定义分隔符：空格、制表符、换行符、+、-、*、/、.、；、各种括号运算符等。

若遇到分隔符，则分隔符前面的缓冲区为一个词素，分隔符为一个词素（空格、制表符、换行符）除外。

但注意特殊情况，若遇到>和>=，&和&& 之类的，需要多向前看一个字符来确定词素。

然后再把分割出的词素实例化成Lexeme类型，并加入到返回结果中。

代码很简单，但写起来比较费事：

[java]view

 plaincopy

[java]view

 plaincopy

package ravaComplier.lexer;  

import java.io.*;  

import java.util.*;  

public class Lexer {  

    private static ArrayList<Lexeme> result;//返回的结果  

    private static BufferedReader br;  

    private static StringBuffer buffer;//缓冲区  

    public static ArrayList<Lexeme> getLexerOutput(InputStream is)  

    {  

        result=new ArrayList<Lexeme>();  

        br=new BufferedReader(new InputStreamReader(is));  

        buffer=new StringBuffer();  

        while(Read())  

        {  

            addLexeme();  

        }  

        return result;  

    }  

    //尝试将缓冲区分解出词素并加入词素集合  

    private static void addLexeme()  

    {  

        String str=buffer.toString();  

        String endstr=str.substring(str.length()-1,str.length());  

        //判断单字符的分割符号  

        if(endstr.equals(" ") || endstr.equals("\t")  || endstr.equals(";") || endstr.equals("{") || endstr.equals("}") || endstr.equals("(") || endstr.equals(")") || endstr.equals("[") || endstr.equals("]") || endstr.equals("+") || endstr.equals("-") || endstr.equals("*") || endstr.equals("/") )  

        {  

            Lexeme lex=getLexeme(str.substring(0,str.length()-1));  

            if(lex!=null)  

            {  

                result.add(lex);  

            }  

            lex=getLexeme(endstr);  

            if(lex!=null)  

            {  

                result.add(lex);  

            }  

            buffer=new StringBuffer();  

        }  

        //判断双字符的分割符号  

        if(str.length()>=2)  

        {  

            endstr=str.substring(str.length()-2,str.length());  

            if(endstr.equals(">=") ||endstr.equals("<=") ||endstr.equals("==") || endstr.equals("||") ||endstr.equals("&&") || endstr.equals("!=") ||endstr.equals("\r\n"))  

            {  

                Lexeme lex=getLexeme(str.substring(0,str.length()-2));  

                if(lex!=null)  

                {  

                    result.add(lex);  

                }  

                lex=getLexeme(endstr);  

                if(lex!=null)  

                {  

                    result.add(lex);  

                }  

                buffer=new StringBuffer();  

            }  

            else if(endstr.charAt(0)=='=' || endstr.charAt(0)=='>' || endstr.charAt(0)=='<' || endstr.charAt(0)=='&' || endstr.charAt(0)=='|' )  

            {  

                Lexeme lex=getLexeme(str.substring(0,str.length()-2));  

                if(lex!=null)  

                {  

                    result.add(lex);  

                }  

                lex=getLexeme(endstr.substring(0,1));  

                if(lex!=null)  

                {  

                    result.add(lex);  

                }  

                buffer=new StringBuffer();  

                buffer.append(endstr.charAt(1));  

            }  

        }  

    }  

    //根据一个字符串获取词素  

    private static Lexeme getLexeme(String lex)  

    {  

        Lexeme result=null;  

        if(lex.equals(" ") || lex.equals("\t") || lex.equals("\r\n") || lex==null|| lex.length()==0)  

        {  

            return null;  

        }  

        if(lex.equals("if"))  

        {  

            result=new Lexeme(Lexeme.IF,lex);  

        }  

        else if(lex.equals("else"))  

        {  

            result=new Lexeme(Lexeme.ELSE,lex);  

        }  

        else if(lex.equals("while"))  

        {  

            result=new Lexeme(Lexeme.WHILE,lex);  

        }  

        else if(lex.equals("{") || lex.equals("}")|| lex.equals("[") || lex.equals("]") || lex.equals("(") || lex.equals(")"))  

        {  

            result=new Lexeme(Lexeme.BRACKET,lex);  

        }  

        else if(lex.equals(">") || lex.equals("<") || lex.equals("==") || lex.equals(">=") || lex.equals("<=") || lex.equals("!="))  

        {  

            result=new Lexeme(Lexeme.CPREOP,lex);  

        }  

        else if(lex.equals("&") || lex.equals("|") || lex.equals("^"))  

        {  

            result=new Lexeme(Lexeme.BITOP,lex);  

        }  

        else if(lex.equals("&&") || lex.equals("||"))  

        {  

            result=new Lexeme(Lexeme.LOGIOP,lex);  

        }  

        else if(lex.equals("+") || lex.equals("-") || lex.equals("*") || lex.equals("/"))  

        {  

            result=new Lexeme(Lexeme.ARMTOP,lex);  

        }  

        else if(isNumber(lex))  

        {  

            result=new Lexeme(Lexeme.NUMBER,lex);  

        }  

        else if(isStr(lex))  

        {  

            result=new Lexeme(Lexeme.LITERAL,lex);  

        }  

        else if(lex.equals("new"))  

        {  

            result=new Lexeme(Lexeme.NEW,lex);  

        }  

        else if(lex.equals("int") || lex.equals("char") || lex.equals("double"))  

        {  

            result=new Lexeme(Lexeme.BASETYPE,lex);  

        }  

        else if(lex.equals("class"))  

        {  

            result=new Lexeme(Lexeme.CLASS,lex);  

        }  

        else if(lex.equals("private") || lex.equals("public"))  

        {  

            result=new Lexeme(Lexeme.ACCESSFLAG,lex);  

        }  

        else if(lex.equals("static"))  

        {  

            result=new Lexeme(Lexeme.STATIC,lex);  

        }  

        else if(lex.equals("return"))  

        {  

            result=new Lexeme(Lexeme.RETURN,lex);  

        }  

        else if(lex.equals("break"))  

        {  

            result=new Lexeme(Lexeme.BREAK,lex);  

        }  

        else if(lex.equals("continue"))  

        {  

            result=new Lexeme(Lexeme.CONTINUE,lex);  

        }  

        else if(lex.equals("."))  

        {  

            result=new Lexeme(Lexeme.DOT,lex);  

        }  

        else if(lex.equals("this"))  

        {  

            result=new Lexeme(Lexeme.THIS,lex);  

        }  

        else if(lex.equals(";"))  

        {  

            result=new Lexeme(Lexeme.SEMI,lex);  

        }  

        else if(lex.equals("="))  

        {  

            result=new Lexeme(Lexeme.EQUAL,lex);  

        }  

        else  

        {  

            result=new Lexeme(Lexeme.ID,lex);  

        }  

        return result;  

    }  

    private static boolean isStr(String lex)  

    {  

        if(lex.charAt(0)!='\"' || lex.charAt(lex.length()-1)!='\"')  

            return false;  

        for(int i=1;i<=lex.length()-2;i++)  

        {  

            if(lex.charAt(i)=='\"')  

            {  

                return false;  

            }  

        }  

        return true;  

    }  

    private static boolean isNumber(String str)  

    {  

        try  

        {  

            int i=Integer.valueOf(str);  

            return true;  

        }  

        catch(Exception e)  

        {}  

        try  

        {  

            double j=Double.valueOf(str);  

            return true;  

        }  

        catch(Exception e)  

        {}  

        return false;  

    }  

    //从流中读取一个字符  

    private static boolean Read()  

    {  

        int d;  

        try {  

            d = br.read();  

            if(d==-1)  

            {  

                return false;  

            }  

            buffer.append((char)d);  

        } catch (IOException e) {  

            e.printStackTrace();  

            return false;  

        }  

        return true;  

    }  

}

然后自己写一段程序，试一试能不能正确的解析：

[java]view

 plaincopy

class testclass{  

   private static int j=0;  

   public int i=1;  

   public testclass()  

  {  

     double c=1;  

     char[] d="123456";  

  }  

  private static double func1()  

  {  

     if(j==0)  

     {  

     return 1.5  

     }  

     else  

     {  

       while(i<=10)  

       {  

         i=i+1;  

       }  

       return i;  

     }    

   }  

}

然后看一看输出的结果：

[java]view

 plaincopy

<14:class>  

<10:testclass>  

<3:{>  

<15:private>  

<16:static>  

<13:int>  

<10:j>  

<23:=>  

<8:0>  

<22:;>  

<15:public>  

<13:int>  

<10:i>  

<23:=>  

<8:1>  

<22:;>  

<15:public>  

<10:testclass>  

<3:(>  

<3:)>  

<3:{>  

<13:double>  

<10:c>  

<23:=>  

<8:1>  

<22:;>  

<13:char>  

<3:[>  

<3:]>  

<10:d>  

<23:=>  

<9:"123456">  

<22:;>  

<3:}>  

<15:private>  

<16:static>  

<13:double>  

<10:func1>  

<3:(>  

<3:)>  

<3:{>  

<0:if>  

<3:(>  

<10:j>  

<4:==>  

<8:0>  

<3:)>  

<3:{>  

<17:return>  

<8:1.5>  

<3:}>  

<1:else>  

<3:{>  

<2:while>  

<3:(>  

<10:i>  

<4:<=>  

<8:10>  

<3:)>  

<3:{>  

<10:i>  

<23:=>  

<10:i>  

<7:+>  

<8:1>  

<22:;>  

<3:}>  

<17:return>  

<10:i>  

<22:;>  

<3:}>  

<3:}>  

<3:}>

貌似比较正确

【上篇】自制Runtime:虚拟执行环境设计
【下篇】自制编译器-语言语法

作者: JacquelynV05

该日志由 JacquelynV05 于5年前发表在综合分类下，最后更新于 2019年05月27日.
转载请注明: 自制编译器：词法分析器 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

自制编译器：词法分析器

作者: JacquelynV05

书签

最新文章New

本站推荐

返回首页