How do you tackle something as complex as understanding how to create an interpreter or compiler? In the beginning it all looks pretty much like a tangled mess of yarn that you need to untangle to get that perfect ball.

如何处理像理解如何创建解释器或编译器这样复杂的事情?一开始,它看起来就像一团纠结的纱线,你需要解开它才能得到那个完美的球。

The way to get there is to just untangle it one thread, one knot at a time. Sometimes, though, you might feel like you don’t understand something right away, but you have to keep going. It will eventually “click” if you’re persistent enough, I promise you (Gee, if I put aside 25 cents every time I didn’t understand something right away I would have become rich a long time ago 😃.

到达那里的方法只是解开它的一根线,一次一个结。但有时候,你可能会觉得自己不能马上理解某些事情,但是你必须继续前进。如果你足够坚持,它最终会豁然开朗(click有豁然开朗,突然明白的意思~),我向你保证(哎呀,如果我每次不明白一件事情就存25美分,我早就发财了:)。

Probably one of the best pieces of advice I could give you on your way to understanding how to create an interpreter and compiler is to read the explanations in the articles, read the code, and then write code yourself, and even write the same code several times over a period of time to make the material and code feel natural to you, and only then move on to learn new topics. Do not rush, just slow down and take your time to understand the basic ideas deeply. This approach, while seemingly slow, will pay off down the road. Trust me.

也许在你理解如何创建解释器和编译器的过程中,我能给你的最好的建议之一就是阅读文章中的解释,阅读代码,然后自己编写代码,甚至在一段时间内多次编写相同的代码,使材料和代码对你来说感觉自然,然后再继续学习新的主题。不要匆忙,只是放慢速度,花时间深入理解基本思想。这种方法虽然看似缓慢,但最终会得到回报。相信我。

You will eventually get your perfect ball of yarn in the end. And, you know what? Even if it is not that perfect it is still better than the alternative, which is to do nothing and not learn the topic or quickly skim over it and forget it in a couple of days.

最终你会得到一个完美的线团。你知道吗?即使它不是那么完美,也比另一种选择要好,那就是什么都不做,不学习主题,或者快速浏览一下,几天后就忘记了。

Remember - just keep untangling: one thread, one knot at a time and practice what you’ve learned by writing code, a lot of it:

记住——只要保持理顺: 一条线,一次一个结,通过编写代码来练习你所学到的东西,很多:

Today you’re going to use all the knowledge you’ve gained from previous articles in the series and learn how to parse and interpret arithmetic expressions that have any number of addition, subtraction, multiplication, and division operators. You will write an interpreter that will be able to evaluate expressions like “14 + 2 * 3 - 6 / 2”.

今天,你将使用从本系列前几篇文章中获得的所有知识,并学习如何解析和解释任意数量的加、减、乘和除运算符的算术表达式。你将编写一个能够计算像“14 + 2 * 3-6/2”这样的表达式的解释器。

Before diving in and writing some code let’s talk about the associativity and precedence of operators.

在深入编写一些代码之前,让我们先讨论一下操作符的结合性和优先级。

By convention 7 + 3 + 1 is the same as (7 + 3) + 1 and 7 - 3 - 1 is equivalent to (7 - 3) - 1. No surprises here. We all learned that at some point and have been taking it for granted since then. If we treated 7 - 3 - 1 as 7 - (3 - 1) the result would be unexpected 5 instead of the expected 3.

根据约定,7 + 3 + 1等于(7 + 3) + 1,7-3-1等于(7-3)-1。这里没有惊喜。我们都在某个时刻学到了这一点,并且从那时起就一直认为这是理所当然的。如果我们把7-3-1当作7-(3-1) ,结果将是意外的5,而不是预期的3。

In ordinary arithmetic and most programming languages addition, subtraction, multiplication, and division are left-associative:

在普通算术和大多数编程语言中,加、减、乘、除都是左结合的:

7 + 3 + 1 is equivalent to (7 + 3) + 1
7 - 3 - 1 is equivalent to (7 - 3) - 1
8 * 4 * 2 is equivalent to (8 * 4) * 2
8 / 4 / 2 is equivalent to (8 / 4) / 2

What does it mean for an operator to be left-associative?

一个操作符是左结合的,这意味着什么?

When an operand like 3 in the expression 7 + 3 + 1 has plus signs on both sides, we need a convention to decide which operator applies to 3. Is it the one to the left or the one to the right of the operand 3? The operator + associates to the left because an operand that has plus signs on both sides belongs to the operator to its left and so we say that the operator + is left-associative. That’s why 7 + 3 + 1 is equivalent to (7 + 3) + 1 by the associativity convention.

当表达式7 + 3 + 1中类似于3的操作数两边都有加号时,我们需要一个约定来决定哪个运算符应用于3。是操作数3左边的那个还是右边的那个?运算符 + 关联到左边,因为两边都有加号的操作数属于左边的运算符,所以我们说运算符 + 是左关联的。这就是为什么7 + 3 + 1等价于(7 + 3) + 1的结合性约定。

Okay, what about an expression like 7 + 5 * 2 where we have different kinds of operators on both sides of the operand 5? Is the expression equivalent to 7 + (5 * 2) or (7 + 5) * 2? How do we resolve this ambiguity?

好的,对于像7 + 5 * 2这样的表达式,操作数5的两边都有不同类型的运算符怎么办?表达式是否等价于7 + (5 * 2)或(7 + 5) * 2?我们如何解决这个模棱两可的问题?

In this case, the associativity convention is of no help to us because it applies only to operators of one kind, either additive (+, -) or multiplicative (*, /). We need another convention to resolve the ambiguity when we have different kinds of operators in the same expression. We need a convention that defines relative precedence of operators.

在这种情况下,结合性约定对我们没有帮助,因为它只适用于一种操作符,加法(+ ,-)或乘法(* ,/)。当同一表达式中有不同类型的运算符时,需要另一个约定来解决这种歧义。我们需要一个约定来定义运算符的相对优先级。

And here it is: we say that if the operator * takes its operands before + does, then it has higher precedence. In the arithmetic that we know and use, multiplication and division have higher precedence than addition and subtraction. As a result the expression 7 + 5 * 2 is equivalent to 7 + (5 * 2) and the expression 7 - 8 / 4 is equivalent to 7 - (8 / 4).

这就是: 我们说如果运算符 * 在 + 之前取得了它的操作数,那么它的优先级就更高。在我们所熟悉和使用的算术中,乘除法的优先级高于加减法。结果表达式7 + 5 * 2等价于7 + (5 * 2) ,表达式7-8/4等价于7-(8/4)。

In a case where we have an expression with operators that have the same precedence, we just use the associativity convention and execute the operators from left to right:

如果我们有一个具有相同优先级的操作符的表达式,我们只需要使用结合性约定并从左到右执行操作符:

7 + 3 - 1 is equivalent to (7 + 3) - 1
8 / 4 * 2 is equivalent to (8 / 4) * 2

I hope you didn’t think I wanted to bore you to death by talking so much about the associativity and precedence of operators. The nice thing about those conventions is that we can construct a grammar for arithmetic expressions from a table that shows the associativity and precedence of arithmetic operators. Then, we can translate the grammar into code by following the guidelines I outlined in Part 4, and our interpreter will be able to handle the precedence of operators in addition to associativity.

我希望你们不会认为我想通过过多地谈论操作符的结合律和优先级来让你们感到厌烦。这些约定的好处是,我们可以从一个表中为算术表达式构造一种grammar,它显示了算术运算符的结合性和优先级。然后,我们可以按照第4部分中概述的准则将grammar转换为代码,我们的解释器将能够处理除了结合性之外的操作符的优先级。

Okay, here is our precedence table:

好的,这是我们的优先级表:

From the table, you can tell that operators + and - have the same precedence level and they are both left-associative. You can also see that operators * and / are also left-associative, have the same precedence among themselves but have higher-precedence than addition and subtraction operators.

从表中可以看出,运算符 + 和-具有相同的优先级,它们都是左关联的。还可以看到运算符 * 和/也是左关联的,它们之间的优先级相同,但是比加减运算符的优先级高。

Here are the rules for how to construct a grammar from the precedence table:

以下是如何从优先表构造grammar的规则:

  1. For each level of precedence define a non-terminal. The body of a production for the non-terminal should contain arithmetic operators from that level and non-terminals for the next higher level of precedence.
    • 对于每个优先级,定义一个非终结符。非终结符的产生式应该包含该级别的算术运算符和下一个更高优先级的非终结符
  2. Create an additional non-terminal factor for basic units of expression, in our case, integers. The general rule is that if you have N levels of precedence, you will need N + 1 non-terminals in total: one non-terminal for each level plus one non-terminal for basic units of expression.
    • 为基本的表达式单位(在我们的例子中是整数)创建一个额外的非终结factor。一般的规则是,如果你有N个优先级,你将总共需要N + 1个非终结符:每个层次一个非终结符加上一个基本表达式单元的非终结符。

Onward!

前进!

Let’s follow the rules and construct our grammar.

让我们按照规则来构建我们的grammar。

According to Rule 1 we will define two non-terminals: a non-terminal called expr for level 2 and a non-terminal called term for level 1. And by following Rule 2 we will define a factor non-terminal for basic units of arithmetic expressions, integers.

根据规则1,我们将定义两个非终结符: 一个用于第2级的称为 expr 的非终结符,一个用于第1级的称为term的非终结符。通过遵循规则2,我们将为算术表达式的基本单位,即整数,定义一个factor非终结符。

The start symbol of our new grammar will be expr and the expr production will contain a body representing the use of operators from level 2, which in our case are operators + and - , and will contain term non-terminals for the next higher level of precedence, level 1:

新grammar的起始符号是 expr,expr 产生式将包含一个body,表示来自第2级的运算符的使用,在我们的例子中是运算符 + 和-,并且将包含下一个更高级别优先级1的非终结符:

The term production will have a body representing the use of operators from level 1, which are operators * and / , and it will contain the non-terminal factor for the basic units of expression, integers:

term产生式将有一个body,代表第1级操作符的使用,这些操作符是操作符 * 和/,它将包含表达式基本单位的非终结factor,即整数:

And the production for the non-terminal factor will be:

非终结factor的产生式是:

You’ve already seen above productions as part of grammars and syntax diagrams from previous articles, but here we combine them into one grammar that takes care of the associativity and the precedence of operators:

你已经在以前文章的grammar和syntax图表中看到过上面的结果,但是在这里我们把它们组合成一种grammar,这种grammar考虑了结合性和操作符的优先级:

Here is a syntax diagram that corresponds to the grammar above:

下面是一个与上面的grammar相对应的syntax图:

Each rectangular box in the diagram is a “method call” to another diagram. If you take the expression 7 + 5 * 2 and start with the top diagram expr and walk your way down to the bottommost diagram factor, you should be able to see that higher-precedence operators * and / in the lower diagram execute before operators + and - in the higher diagram.

图中的每个矩形框都是对另一个图的“方法调用”。如果使用表达式7 + 5 * 2并从顶部图 expr 开始,然后一直向下到最底部的图factor,那么应该能够看到在较高的图中,较高优先级的运算符 * 和/在较低的图中,在运算符 + 和-之前执行。

To drive the precedence of operators point home, let’s take a look at the decomposition of the same arithmetic expression 7 + 5 * 2 done in accordance with our grammar and syntax diagrams above. This is just another way to show that higher-precedence operators execute before operators with lower precedence:

为了使运算符的优先级更明确,让我们看一下根据上面的grammar和syntax图对同一个算术表达式7 + 5 * 2进行的分解。这只是显示高优先级运算符先于低优先级运算符执行的另一种方式:

Okay, let’s convert the grammar to code following guidelines from Part 4 and see how our new interpreter works, shall we?

好的,让我们按照第4部分中的指导方针将grammar转换为代码,看看我们的新解释器是如何工作的,好吗?

Here is the grammar again:

这里再次提到grammar:

And here is the complete code of a calculator that can handle valid arithmetic expressions containing integers and any number of addition, subtraction, multiplication, and division operators.

下面是一个计算器的完整代码,它可以处理包含整数和任意数量的加、减、乘、除运算符的有效算术表达式。

The following are the main changes compared with the code from Part 4:

以下是与第4部分代码相比的主要变化:

  • The Lexer class can now tokenize +, -, *, and / (Nothing new here, we just combined code from previous articles into one class that supports all those tokens)
    • Lexer类现在可以标记+、-、*和/(这里没有什么新内容,我们只是将前几篇文章中的代码组合成一个支持所有这些标记的类)
  • Recall that each rule (production), R, defined in the grammar, becomes a method with the same name, and references to that rule become a method call: R(). As a result the Interpreter class now has three methods that correspond to non-terminals in the grammar: expr, term, and factor.
    • 回想一下,grammar中定义的每个规则(产生式)R变成了同名的方法,而对该规则的引用变成了方法调用R()。因此,Interpreter类现在有三个方法对应语法中的非终结符:expr、term和factor。

Source code:

源代码:

# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, MUL, DIV, EOF = (
    'INTEGER', 'PLUS', 'MINUS', 'MUL', 'DIV', 'EOF'
)


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, MUL, DIV, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', '*', '/', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS, '+')
            Token(MUL, '*')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Lexer(object):
    def __init__(self, text):
        # client string input, e.g. "3 * 5", "12 / 3 * 4", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Invalid character')

    def advance(self):
        """Advance the `pos` pointer and set the `current_char` variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens. One token at a time.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')

            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')

            if self.current_char == '*':
                self.advance()
                return Token(MUL, '*')

            if self.current_char == '/':
                self.advance()
                return Token(DIV, '/')

            self.error()

        return Token(EOF, None)


class Interpreter(object):
    def __init__(self, lexer):
        self.lexer = lexer
        # set current token to the first token taken from the input
        self.current_token = self.lexer.get_next_token()

    def error(self):
        raise Exception('Invalid syntax')

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            self.error()

    def factor(self):
        """factor : INTEGER"""
        token = self.current_token
        self.eat(INTEGER)
        return token.value

    def term(self):
        """term : factor ((MUL | DIV) factor)*"""
        result = self.factor()

        while self.current_token.type in (MUL, DIV):
            token = self.current_token
            if token.type == MUL:
                self.eat(MUL)
                result = result * self.factor()
            elif token.type == DIV:
                self.eat(DIV)
                result = result // self.factor()

        return result

    def expr(self):
        """Arithmetic expression parser / interpreter.

        calc>  14 + 2 * 3 - 6 / 2
        17

        expr   : term ((PLUS | MINUS) term)*
        term   : factor ((MUL | DIV) factor)*
        factor : INTEGER
        """
        result = self.term()

        while self.current_token.type in (PLUS, MINUS):
            token = self.current_token
            if token.type == PLUS:
                self.eat(PLUS)
                result = result + self.term()
            elif token.type == MINUS:
                self.eat(MINUS)
                result = result - self.term()

        return result


def main():
    while True:
        try:
            text = input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        lexer = Lexer(text)
        interpreter = Interpreter(lexer)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

Save the above code into the calc5.py file or download it directly from GitHub. As usual, try it out and see for yourself that the interpreter properly evaluates arithmetic expressions that have operators with different precedence.

将上面的代码保存到 calc5.py 文件中,或者直接从 GitHub 下载。像往常一样,试一试,亲自看看解释器是否正确地计算优先级不同的运算符的算术表达式。

Here is a sample session on my laptop:

下面是我笔记本电脑上的一个例子:

$ python calc5.py
calc> 3
3
calc> 2 + 7 * 4
30
calc> 7 - 8 / 4
5
calc> 14 + 2 * 3 - 6 / 2
17

Here are new exercises for today:

以下是今天的新练习:

  • Write an interpreter as described in this article off the top of your head, without peeking into the code from the article. Write some tests for your interpreter, and make sure they pass.
    • 根据本文的描述写一个解释器,不要偷看文章中的代码。为你的解释器编写一些测试,并确保它们通过。
  • Extend the interpreter to handle arithmetic expressions containing parentheses so that your interpreter could evaluate deeply nested arithmetic expressions like: 7 + 3 * (10 / (12 / (3 + 1) - 1))
    • 扩展解释器以处理包含括号的算术表达式,这样你的解释器就可以计算深嵌套的算术表达式,比如: 7 + 3 * (10/(12/(3 + 1)-1))

Check your understanding.

检查你的理解。

  1. What does it mean for an operator to be _left-associative ? _

当两边都有同一个操作符的操作数属于左边的操作符时,说该操作符是左关联的

  1. Are operators + and - _left-associative _ or _right-associative _? What about * and / ?

+和-都是左关联,*和/也是左关联

  1. Does operator + have _higher precedence _than operator * ?

*有更高的优先级

Hey, you read all the way to the end! That’s really great. I’ll be back next time with a new article - stay tuned, be brilliant, and, as usual, don’t forget to do the exercises.

嘿,你从头读到尾!真是太好了。下次我会带着一篇新文章回来——继续关注,保持精彩,像往常一样,不要忘记做练习。

Here is a list of books I recommend that will help you in your study of interpreters and compilers:

以下是我推荐的一些书籍,它们可以帮助你学习解释器和编译器:

  1. Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages (Pragmatic Programmers)
    语言实现模式: 创建您自己的特定领域和通用编程语言(实用程序员)
  2. Writing Compilers and Interpreters: A Software Engineering Approach
    编写编译器和解释器: 一种软件工程方法
  3. Modern Compiler Implementation in Java
    现代编译器在 Java 中的实现
  4. Modern Compiler Design
    现代编译器设计
  5. Compilers: Principles, Techniques, and Tools (2nd Edition)
    编译器: 原理、技术和工具(第二版)

If you want to get my newest articles in your inbox, then enter your email address below and click "Get Updates!"

All articles in this series: