In their amazing book “The 5 Elements of Effective Thinking” the authors Burger and Starbird share a story about how they observed Tony Plog, an internationally acclaimed trumpet virtuoso, conduct a master class for accomplished trumpet players. The students first played complex music phrases, which they played perfectly well. But then they were asked to play very basic, simple notes. When they played the notes, the notes sounded childish compared to the previously played complex phrases. After they finished playing, the master teacher also played the same notes, but when he played them, they did not sound childish. The difference was stunning. Tony explained that mastering the performance of simple notes allows one to play complex pieces with greater control. The lesson was clear - to build true virtuosity one must focus on mastering simple, basic ideas.1

在著作《有效思考的五个要素》中,作者Burger 和 Starbird分享了一个关于他们如何观察Tony Plog的故事,Tony Plog是一个国际知名的小号演奏家,他在一个大师班指导成功的小号演奏家们。学生们首先演奏复杂的乐句,他们演奏得非常好。但之后他们被要求演奏非常基本、简单的音符时,与之前演奏的复杂乐句相比,这些音符听起来更像孩子气。演奏结束后,大师也演奏了同样的音符,但当他演奏时,这些音符听起来并不幼稚。这种差异令人震惊。Tony解释说,掌握简单音符的演奏能让人更好地控制复杂的乐章。这个教训很明显——要建立真正的精湛技艺,你必须专注于掌握简单、基本的思想。

The lesson in the story clearly applies not only to music but also to software development. The story is a good reminder to all of us to not lose sight of the importance of deep work on simple, basic ideas even if it sometimes feels like a step back. While it is important to be proficient with a tool or framework you use, it is also extremely important to know the principles behind them. As Ralph Waldo Emerson said:

这个故事中的教训显然不仅适用于音乐,也适用于软件开发。这个故事很好地提醒了我们所有人,不要忽视在简单基本的想法上进行深入工作的重要性,即使有时感觉像是后退了一步。虽然熟练使用您所使用的工具或框架很重要,但了解它们背后的原则也极为重要。正如拉尔夫·沃尔多·爱默生所说:

“If you learn only methods, you’ll be tied to your methods. But if you learn principles, you can devise your own methods.”

“如果你只学习方法,你就会受制于你的方法。但如果你学会了原理,你就可以设计出自己的方法。”

On that note, let’s dive into interpreters and compilers again.

关于这一点,让我们再次深入到解释器和编译器中。

Today I will show you a new version of the calculator from Part 1 that will be able to:

今天我将向您展示来自第1部分的新版计算器,它将能够:

  1. Handle whitespace characters anywhere in the input string 处理输入字符串中任意位置的空格字符
  2. Consume multi-digit integers from the input 在输入中使用多位整数
  3. Subtract two integers (currently it can only add integers) 减去整数(目前只能加上整数)

Here is the source code for your new version of the calculator that can do all of the above:

下面是你的新版计算器的源代码,它可以做到以上所有的事情:

# Token types
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, EOF = 'INTEGER', 'PLUS', 'MINUS', 'EOF'


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Interpreter(object):
    def __init__(self, text):
        # client string input, e.g. "3 + 5", "12 - 5", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance
        self.current_token = None
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Error parsing input')

    def advance(self):
        """Advance the 'pos' pointer and set the 'current_char' variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')

            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')

            self.error()

        return Token(EOF, None)

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()

    def expr(self):
        """Parser / Interpreter

        expr -> INTEGER PLUS INTEGER
        expr -> INTEGER MINUS INTEGER
        """
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()

        # we expect the current token to be an integer
        left = self.current_token
        self.eat(INTEGER)

        # we expect the current token to be either a '+' or '-'
        op = self.current_token
        if op.type == PLUS:
            self.eat(PLUS)
        else:
            self.eat(MINUS)

        # we expect the current token to be an integer
        right = self.current_token
        self.eat(INTEGER)
        # after the above call the self.current_token is set to
        # EOF token

        # at this point either the INTEGER PLUS INTEGER or
        # the INTEGER MINUS INTEGER sequence of tokens
        # has been successfully found and the method can just
        # return the result of adding or subtracting two integers,
        # thus effectively interpreting client input
        if op.type == PLUS:
            result = left.value + right.value
        else:
            result = left.value - right.value
        return result


def main():
    while True:
        try:
            text = input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

Save the above code into the calc2.py file or download it directly from GitHub. Try it out. See for yourself that it works as expected: it can handle whitespace characters anywhere in the input; it can accept multi-digit integers, and it can also subtract two integers as well as add two integers.

将上面的代码保存到 calc2.py 文件中,或者直接从 GitHub 下载。尝试一下。你可以亲自查看它是否正常工作::它可以处理输入中的任何地方的空格字符;它可以接受多位整数,还可以做两个整数的减法以及加法。

Here is a sample session that I ran on my laptop:

下面是我在笔记本电脑上运行的一个示例会话:

$ python calc2.py
calc> 27 + 3
30
calc> 27 - 7
20
calc>

The major code changes compared with the version from Part 1 are:

与第1部分的版本相比,代码的主要变化如下:

  1. The get_next_token method was refactored a bit. The logic to increment the pos pointer was factored into a separate method advance.
    • 对get_next_token方法进行了一点重构。pos指针增量的逻辑被分解为一个单独的方法。
  2. Two more methods were added: skip_whitespace to ignore whitespace characters and integer to handle multi-digit integers in the input.
    • 另外添加了两个方法:忽略空白字符的skip_whitespace和处理输入中的多数字整数的integer。
  3. The expr method was modified to recognize INTEGER -> MINUS -> INTEGER phrase in addition to INTEGER -> PLUS -> INTEGER phrase. The method now also interprets both addition and subtraction after having successfully recognized the corresponding phrase.
    • 改进了expr方法,除INTEGER -> PLUS -> INTEGER短语外,还可以识别INTEGER -> MINUS -> INTEGER短语。该方法在识别出相应的短语后,还能同时解释加减运算。

In Part 1 you learned two important concepts, namely that of a token and a lexical analyzer. Today I would like to talk a little bit about lexemes, parsing, and parsers.

在第1部分中,你学习了两个重要的概念,即token和词法分析器。今天,我想稍微谈谈词位、解析和解析器。

You already know about tokens. But in order for me to round out the discussion of tokens I need to mention lexemes. What is a lexeme? A lexeme is a sequence of characters that form a token. In the following picture you can see some examples of tokens and sample lexemes and hopefully it will make the relationship between them clear:

你已经知道token的事了。但是为了使关于token的讨论更加圆满,我需要提到词位(lexemes)。什么是词位?词位是构成token的字符序列。在下面的图片中,你可以看到一些token和词位的例子,希望它们之间的关系能够清晰起来:

Now, remember our friend, the expr method? I said before that that’s where the interpretation of an arithmetic expression actually happens. But before you can interpret an expression you first need to recognize what kind of phrase it is, whether it is addition or subtraction, for example. That’s what the expr method essentially does: it finds the structure in the stream of tokens it gets from the get_next_token method and then it interprets the phrase that is has recognized, generating the result of the arithmetic expression.

还记得我们的朋友 expr 方法吗?我之前说过,这就是解释算术表达式实际发生的地方。但是在你解释一个表达式之前,你首先需要知道它是什么类型的短语,例如,是加法还是减法。这就是 expr 方法的本质: 它在从 get _ next _ token 方法获得的token流中找到结构,然后解释已被识别的短语,生成算术表达式的结果。

The process of finding the structure in the stream of tokens, or put differently, the process of recognizing a phrase in the stream of tokens is called parsing. The part of an interpreter or compiler that performs that job is called a parser.

在token流中查找结构的过程,或者换一种说法,在token流中识别一个短语的过程称为解析。解释器或编译器中执行该任务的部分称为解析器

So now you know that the expr method is the part of your interpreter where both parsing and interpreting happens - the expr method first tries to recognize (parse) the INTEGER -> PLUS -> INTEGER or the INTEGER -> MINUS -> INTEGER phrase in the stream of tokens and after it has successfully recognized (parsed) one of those phrases, the method interprets it and returns the result of either addition or subtraction of two integers to the caller.

现在你知道了 expr 方法是解释器的一部分,解析和解释都在其中进行—— expr 方法首先尝试在token流中识别(解析) INTEGER-> PLUS-> INTEGER 或 INTEGER-> MINUS-> INTEGER 短语,并在成功识别(解析)这些短语中的一个之后,该方法对其进行解释,并将两个整数的加法或减法结果返回给调用者。

And now it’s time for exercises again.

现在又到了练习的时间了。

  1. Extend the calculator to handle multiplication of two integers 扩展计算器以处理两个整数的乘法
  2. Extend the calculator to handle division of two integers 扩展计算器以处理两个整数的除法

前两个问题的答案:

# Token types
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, PLUS, MINUS, MUL, DIV,  EOF = 'INTEGER', 'PLUS', 'MINUS', 'MUL', 'DIV', 'EOF'


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, PLUS, MINUS, or EOF
        self.type = type
        # token value: non-negative integer value, '+', '-', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Interpreter(object):
    def __init__(self, text):
        # client string input, e.g. "3 + 5", "12 - 5", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        # current token instance
        self.current_token = None
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Error parsing input')

    def advance(self):
        """Advance the 'pos' pointer and set the 'current_char' variable.
        pos指针前进一步
        """
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        """
            跳过空格
        """
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '+':
                self.advance()
                return Token(PLUS, '+')

            if self.current_char == '-':
                self.advance()
                return Token(MINUS, '-')

            if self.current_char == '*':
                self.advance()
                return Token(MUL, '*')

            if self.current_char == '/':
                self.advance()
                return Token(DIV, '/')
            self.error()

        return Token(EOF, None)

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.get_next_token()
        else:
            self.error()

    def expr(self):
        """Parser / Interpreter

        expr -> INTEGER PLUS INTEGER
        expr -> INTEGER MINUS INTEGER
        """
        
        # set current token to the first token taken from the input
        self.current_token = self.get_next_token()

        # we expect the current token to be an integer
        left = self.current_token
        self.eat(INTEGER)

        # we expect the current token to be either a '+' or '-'
        op = self.current_token
        if op.type == PLUS:
            self.eat(PLUS)
        elif op.type == MINUS:
            self.eat(MINUS)
        elif op.type == MUL:
            self.eat(MUL)
        else:
            self.eat(DIV)

        # we expect the current token to be an integer
        right = self.current_token
        self.eat(INTEGER)
        # after the above call the self.current_token is set to
        # EOF token

        # at this point either the INTEGER PLUS INTEGER or
        # the INTEGER MINUS INTEGER sequence of tokens
        # has been successfully found and the method can just
        # return the result of adding or subtracting two integers,
        # thus effectively interpreting client input
        if op.type == PLUS:
            result = left.value + right.value
        elif op.type == MINUS:
            result = left.value - right.value
        elif op.type == MUL:
            result = left.value * right.value
        elif op.type == DIV:
            if right.value == 0:
                result = 'ERROR'
            else:
                result = left.value / right.value
        return result


def main():
    while True:
        try:
            text = input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        interpreter = Interpreter(text)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

  1. Modify the code to interpret expressions containing an arbitrary number of additions and subtractions, for example “9 - 5 + 3 + 11” 修改代码以解释任意数目的加减法,例如“9-5 + 3 + 11”

Check your understanding.

检查你的理解。

  1. What is a lexeme? 什么是词位?

构成token的字符序列

  1. What is the name of the process that finds the structure in the stream of tokens, or put differently, what is the name of the process that recognizes a certain phrase in that stream of tokens? 在token流中查找结构的过程的名称是什么,或者换句话说,在token流中识别特定短语的过程的名称是什么?

解析(parsing)

  1. What is the name of the part of the interpreter (compiler) that does parsing? 解释器(编译器)进行解析(parsing)的那部分的名称是什么?

解析器(parsers)

I hope you liked today’s material. In the next article of the series you will extend your calculator to handle more complex arithmetic expressions. Stay tuned.

我希望你喜欢今天的材料。在本系列的下一篇文章中,您将扩展计算器以处理更复杂的算术表达式。请继续关注。

And here is a list of books I recommend that will help you in your study of interpreters and compilers:

下面是我推荐的一些书籍,它们可以帮助你学习解释器和编译器:

  1. Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages (Pragmatic Programmers)
    语言实现模式: 创建您自己的特定领域和通用编程语言(实用程序员)
  2. Writing Compilers and Interpreters: A Software Engineering Approach
    编写编译器和解释器: 一种软件工程方法
  3. Modern Compiler Implementation in Java
    现代编译器在 Java 中的实现
  4. Modern Compiler Design
    现代编译器设计
  5. Compilers: Principles, Techniques, and Tools (2nd Edition)
    编译器: 原理、技术和工具(第二版)

All articles in this series:

本系列的所有文章: