Python 的封送代码对象(或 .pyc 文件)的结构

逆向工程 Python
2021-06-25 08:31:29

帮助我破译 Python 的封送代码对象。.pyc文件几乎是相同的:的pyc文件的结构

我有:

  1. 从源代码编译的代码对象。
  2. 此代码对象的封送表示。
  3. 其(代码对象)代码部分的递归反汇编。
  4. 它的所有字段值。

主要的意思:

我想找出不同的代码对象如何相互存储和引用。也就是说,子代码对象的链接是如何存储的?该模块应该引用其所有功能。该函数应该引用所有其他函数,可以从中调用。等等 虚拟机在将代码对象id存储到.pyc?是否保留代码对象我不这么认为,因为id.pyc文件中看不到s

例如,我在反汇编源中有这样的说明:

LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)

因此:

  • 虚拟机将如何找到baz代码对象?我看不到所有这些信息:0x7f380995e5d0, file "foo.py", line 7在编组字符串中。对象 id 是0x7f380995e5d0存储在编组代码中还是在每次程序运行时创建?
  • 如果不存储,如何在编组代码对象(.pyc文件)中保留对象的连接

我想,我会gdb进一步调查,但也许这种方法(.pyc文件解密)也能完成这项工作。

当前结果:

我使用所有这些信息来创建下一个文件:第一列是编组代码对象的二进制表示,第二列是每个字节序列的含义,我已经确定了。

b'
\xe3                    <don't know>
\x00\x00\x00\x00        <foo.py: co_argcount: 0>
\x00\x00\x00\x00        <foo.py: co_kwonlyargcount: 0>
\x00\x00\x00\x00        <foo.py: co_nlocals: 0>
\x03\x00\x00\x00        <foo.py: co_stacksize: 3>               
@\x00\x00\x00           <foo.py: co_flags = '@' = 0x40 = 64>
s.\x00\x00\x00          <foo.py: number of bytes for module instructions = '.' = 46>
d\x00                   <foo.py: co_code:  0 LOAD_CONST        0 (1)
Z\x00                   <foo.py: co_code:  2 STORE_NAME        0 (a)
d\x01                   <foo.py: co_code:  4 LOAD_CONST        1 (2)
Z\x01                   <foo.py: co_code:  6 STORE_NAME        1 (b)
e\x00                   <foo.py: co_code:  8 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 10 LOAD_NAME         1 (b)
\x17\x00                <foo.py: co_code: 12 BINARY_ADD
Z\x02                   <foo.py: co_code: 14 STORE_NAME        2 (c)
d\x02                   <foo.py: co_code: 16 LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)
d\x03                   <foo.py: co_code: 18 LOAD_CONST        3 ('baz')
\x84\x00                <foo.py: co_code: 20 MAKE_FUNCTION     0
Z\x03                   <foo.py: co_code: 22 STORE_NAME        3 (baz)
e\x03                   <foo.py: co_code: 24 LOAD_NAME         3 (baz)
e\x00                   <foo.py: co_code: 26 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 28 LOAD_NAME         1 (b)
\x83\x02                <foo.py: co_code: 30 CALL_FUNCTION     2
Z\x04                   <foo.py: co_code: 32 STORE_NAME        4 (multiplication)
e\x04                   <foo.py: co_code: 34 LOAD_NAME         4 (multiplication)
d\x01                   <foo.py: co_code: 36 LOAD_CONST        1 (2)
\x13\x00                <foo.py: co_code: 38 BINARY_POWER
Z\x05                   <foo.py: co_code: 40 STORE_NAME        5 (square)
d\x04                   <foo.py: co_code: 42 LOAD_CONST        4 (None)
S\x00                   <foo.py: co_code: 44 RETURN_VALUE
)\x05                   <foo.py: co_const: size>
\xe9\x01\x00\x00\x00    <foo.py: co_const[0]: 1>
\xe9\x02\x00\x00\x00    <foo.py: co_const[1]: 2>
c                       <TYPE_CODE>
\x02\x00\x00\x00        <baz: co_argcount: 2>
\x00\x00\x00\x00        <baz: co_kwonlyargcount: 0>
\x02\x00\x00\x00        <baz: co_nlocals: 2>
\x02\x00\x00\x00        <baz: co_stacksize: 2>               
C\x00\x00\x00           <baz: co_flags = 'C' = 0x43 = 67>
s\x08\x00\x00\x00       <baz: co_code: size = 8 bytes>
|\x00                   <baz: co_code: 0 LOAD_FAST                0 (x) 
|\x01                   <baz: co_code: 2 LOAD_FAST                1 (y) 
\x14\x00                <baz: co_code: 4 BINARY_MULTIPLY                
S\x00                   <baz: co_code: 6 RETURN_VALUE                   
)\x01                   <baz: co_const: size>
N                       <baz: co_const[0]: None>
\xa9\x00                <don't know> 
)\x02                   <baz: co_varnames: size>
\xda\x01                <baz: number of characters of next item>
x                       <baz: co_varnames[0]: x>
\xda\x01                <baz: number of characters of next item>
y                       <baz: co_varnames[1]: y>
r\x03\x00\x00\x00       <baz: don't know. But the 'r' = 'TYPE_REF'>
r\x03\x00\x00\x00       <baz: don't know. But the 'r' = 'TYPE_REF'>
\xfa\x06                <baz: next item length>
foo.py                  <baz: co_filename>
\xda\x03                <baz: number of characters of next item>
baz                     <baz: co_name: 'baz'>
\x07\x00\x00\x00        <baz: co_firstlineno: 7>
s\x02\x00\x00\x00       <baz: co_lnotab: size = 2 >
\x00\x01                <baz: co_lnotab>
r\x07\x00\x00\x00       <foo.py: co_const[3]: reference to baz>
N                       <foo.py: co_const[4]: None>
)\x06                   <foo.py: co_names: size> 
\xda\x01                <foo.py: number of characters of next item>
a                       <foo.py: co_names[0]: a>
\xda\x01                <foo.py: number of characters of next item>
b                       <foo.py: co_names[1]: b>
\xda\x01                <foo.py: number of characters of next item>
c                       <foo.py: co_names[2]: c>
r\x07\x00\x00\x00       <foo.py: co_names[3]: reference to baz>
Z\x0e                   <foo.py: number of characters of next item>
multiplication          <foo.py: co_names[4]: multiplication>
Z\x06                   <foo.py: number of characters of next item>
square                  <foo.py: co_names[5]: square>
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x06\x00\x00\x00       <foo.py: don't know>     
\xda\x08                <foo.py: number of characters of next item>
<module>                <foo.py: co_name>
\x03\x00\x00\x00        <foo.py: co_firstlineno>
s\n\x00\x00\x00         <foo.py: co_lnotab: size = '\n' = 0A>
\x04\x01                <foo.py: o_lnotab> 
\x04\x01                <foo.py: o_lnotab>
\x08\x02                <foo.py: o_lnotab>
\x08\x07                <foo.py: o_lnotab>
\n\x01'                 <foo.py: o_lnotab>

复制所需的代码片段:

1)源代码foo.py

a = 1 
b = 2 
c = a + b 

def baz(x,y):
    return x * y

multiplication = baz(a,b)
square = multiplication ** 2

2)封送表示foo.py

source_py = "foo.py"

with open(source_py) as f_source:
    source_code = f_source.read()

code_obj_compile = compile(source_code, source_py, "exec")

data = marshal.dumps(code_obj_compile)

print(data)

3)代码对象的完整(递归)反汇编

import types

dis.dis(code_obj_compile)

for x in code_obj_compile.co_consts:
    if isinstance(x, types.CodeType):
        sub_byte_code = x
        func_name = sub_byte_code.co_name
        print('\nDisassembly of %s:' % func_name)
        dis.dis(sub_byte_code)

4)所有代码对象的字段值

def print_co_obj_fields(code_obj):
    # Iterating through all instance attributes
    # and calling all having the 'co_' prefix
    for name in dir(code_obj):
        if name.startswith('co_'):
            co_field = getattr(code_obj, name)
            print(f'{name:<20} = {co_field}')

print_co_obj_fields(code_obj_compile)
2个回答

下面的答案是参考Python 2.7

虚拟机如何找到 baz 代码对象?我看不到所有这些信息:0x7f380995e5d0,文件“foo.py”,编组字符串中的第 7 行。对象 id 0x7f380995e5d0 是存储在编组代码中还是在每次程序运行时创建?

baz代码对象位于内co_consts构件。按照你的例子。

>>> import marshal
>>> import dis
>>> 
>>> source_py = "foo.py"
>>> 
>>> with open(source_py) as f_source:
...     source_code = f_source.read()
>>> 

>>> code_obj_compile = compile(source_code, source_py, "exec")

如果你反汇编,你可以找到新生成的代码对象的引用 baz

>>> dis.dis(code_obj_compile)
  1           0 LOAD_CONST               0 (7)
              3 STORE_NAME               0 (a)

  2           6 LOAD_CONST               1 (5)
              9 STORE_NAME               1 (b)

  3          12 LOAD_NAME                0 (a)
             15 LOAD_NAME                1 (b)
             18 BINARY_ADD
             19 STORE_NAME               2 (c)

  5          22 LOAD_CONST               2 (<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>)
             25 MAKE_FUNCTION            0
... snip...

baz代码对象位于内co_consts父代码对象的阵列,如下所示。

>>> code_obj_compile.co_consts[2]
<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>

你也可以拆卸它。

>>> dis.dis(code_obj_compile.co_consts[2])
  6           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 BINARY_MULTIPLY
              7 RETURN_VALUE

每次程序运行时都会创建对象。因此地址将相应地改变。

如果不存储,如何在编组代码对象(.pyc 文件)中保留对象的连接?

刚刚解释了。如果仔细查看指令,您会注意到该LOAD_CONST指令将偏移量作为参数 - 操作数。

  5          22 LOAD_CONST               2 (<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>)

这里的偏移量是 2,它指示 Python 虚拟机将co_consts数组中的第三个(从零开始)项加载到计算堆栈上。因此,使用与其他元数据成员的偏移量来保留“连接”。

代码对象封送处理的目的是在文件中存储和恢复程序。因此,它应该具有 Python 的所有特性的编码方案:对象、字节码、名称等,否则它无法从文件中恢复程序。

因此,它使用多种类型标识符,可以分为四组:

  • 单个 TYPE : {类型标识符} 大小为 1 个字节。

     Example: TYPE_NONE = 'N'`, `TYPE_TRUE = 'T'.
    
  • short TYPE : {类型标识符} + 1 字节值

     Example: TYPE_SHORT_ASCII_INTERNED = 'Z'.
    
  • long TYPE : {类型标识符} + 4 字节值

     Example: TYPE_STRING = 's'.
    
  • object TYPE : {类型标识符} + 所有不同类型的组合,包括它object TYPE本身。也就是说,它具有递归结构。

     Example: TYPE_CODE = 'c'
    

所有类型都可以在这里看到: cpython/Python/marshal.c

此外,代码对象具有多个int字段。它们在编组字符串中没有标识符,只有四个字节值的序列。

    int co_argcount;            /* #arguments, except *args */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
    

完整的代码对象结构在这里: cpython/Include/code.h

知道转储代码对象的顺序很有用,因为这样我们就可以计算结果字符串中的每个字段偏移量,例如 - 前四个字节是co_argcount,第二个是co_kwonlyargcount,等等。

代码对象转储的顺序:

    # PyCodeObject *co - pointer to the code object
    # p                - pointer to the file object,
    that accumulating marshaled code object before
    writing to the file.
    
    W_TYPE(TYPE_CODE, p);
    w_long(co->co_argcount, p);
    w_long(co->co_kwonlyargcount, p);
    w_long(co->co_nlocals, p);
    w_long(co->co_stacksize, p);
    w_long(co->co_flags, p);
    w_object(co->co_code, p);
    w_object(co->co_consts, p);
    w_object(co->co_names, p);
    w_object(co->co_varnames, p);
    w_object(co->co_freevars, p);
    w_object(co->co_cellvars, p);
    w_object(co->co_filename, p);
    w_object(co->co_name, p);
    w_long(co->co_firstlineno, p);
    w_object(co->co_lnotab, p);

结果:完全解密的 foo.py 编组字符串:

b'
\xe3                    <foo.py: '\xe3' & 0x80 (FLAG_REF)  = 'c' (TYPE_CODE)>
\x00\x00\x00\x00        <foo.py: co_argcount: 0>
\x00\x00\x00\x00        <foo.py: co_kwonlyargcount: 0>
\x00\x00\x00\x00        <foo.py: co_nlocals: 0>
\x03\x00\x00\x00        <foo.py: co_stacksize: 3>               
@\x00\x00\x00           <foo.py: co_flags = '@' = 0x40 = 64>
s.\x00\x00\x00          <foo.py: number of bytes for module instructions = '.' = 46>
d\x00                   <foo.py: co_code:  0 LOAD_CONST        0 (1)
Z\x00                   <foo.py: co_code:  2 STORE_NAME        0 (a)
d\x01                   <foo.py: co_code:  4 LOAD_CONST        1 (2)
Z\x01                   <foo.py: co_code:  6 STORE_NAME        1 (b)
e\x00                   <foo.py: co_code:  8 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 10 LOAD_NAME         1 (b)
\x17\x00                <foo.py: co_code: 12 BINARY_ADD
Z\x02                   <foo.py: co_code: 14 STORE_NAME        2 (c)
d\x02                   <foo.py: co_code: 16 LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)
d\x03                   <foo.py: co_code: 18 LOAD_CONST        3 ('baz')
\x84\x00                <foo.py: co_code: 20 MAKE_FUNCTION     0
Z\x03                   <foo.py: co_code: 22 STORE_NAME        3 (baz)
e\x03                   <foo.py: co_code: 24 LOAD_NAME         3 (baz)
e\x00                   <foo.py: co_code: 26 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 28 LOAD_NAME         1 (b)
\x83\x02                <foo.py: co_code: 30 CALL_FUNCTION     2
Z\x04                   <foo.py: co_code: 32 STORE_NAME        4 (multiplication)
e\x04                   <foo.py: co_code: 34 LOAD_NAME         4 (multiplication)
d\x01                   <foo.py: co_code: 36 LOAD_CONST        1 (2)
\x13\x00                <foo.py: co_code: 38 BINARY_POWER
Z\x05                   <foo.py: co_code: 40 STORE_NAME        5 (square)
d\x04                   <foo.py: co_code: 42 LOAD_CONST        4 (None)
S\x00                   <foo.py: co_code: 44 RETURN_VALUE
)\x05                   <foo.py: co_const: size>
\xe9\x01\x00\x00\x00    <foo.py: co_const[0]: 1; '\xe9' & 0x80 (FLAG_REF) = 'i' (TYPE_INT)>
\xe9\x02\x00\x00\x00    <foo.py: co_const[1]: 2; '\xe9' & 0x80 (FLAG_REF) = 'i' (TYPE_INT)>
c                       <foo.py: co_const[2]: 'c' = TYPE_CODE>
\x02\x00\x00\x00        <baz: co_argcount: 2>
\x00\x00\x00\x00        <baz: co_kwonlyargcount: 0>
\x02\x00\x00\x00        <baz: co_nlocals: 2>
\x02\x00\x00\x00        <baz: co_stacksize: 2>               
C\x00\x00\x00           <baz: co_flags = 'C' = 0x43 = 67>
s\x08\x00\x00\x00       <baz: co_code: size = 8 bytes>
|\x00                   <baz: co_code: 0 LOAD_FAST                0 (x) 
|\x01                   <baz: co_code: 2 LOAD_FAST                1 (y) 
\x14\x00                <baz: co_code: 4 BINARY_MULTIPLY                
S\x00                   <baz: co_code: 6 RETURN_VALUE                   
)\x01                   <baz: co_const: size>
N                       <baz: co_const[0]: None>
\xa9\x00                <baz: co_names: size = 0  '\xa9' & 0x80 (FLAG_REF)  = ')'> 
)\x02                   <baz: co_varnames: size>
\xda\x01                <baz: number of characters of next item; '\xda' & 0x80 (FLAG_REF)  = 'Z'>
x                       <baz: co_varnames[0]: x>
\xda\x01                <baz: number of characters of next item; '\xda' & 0x80 (FLAG_REF)  = 'Z'>
y                       <baz: co_varnames[1]: y>
r\x03\x00\x00\x00       <baz: co_freevars: reference to empty tuple '()'>     
r\x03\x00\x00\x00       <baz: co_cellvars: reference to empty tuple '()'>
\xfa\x06                <baz: next item length>
foo.py                  <baz: co_filename>
\xda\x03                <baz: number of characters of next item>
baz                     <baz: co_name: 'baz'>
\x07\x00\x00\x00        <baz: co_firstlineno: 7>
s\x02\x00\x00\x00       <baz: co_lnotab: size = 2 >
\x00\x01                <baz: co_lnotab>
r\x07\x00\x00\x00       <foo.py: co_const[3]: reference to 'baz'>
N                       <foo.py: co_const[4]: None>
)\x06                   <foo.py: co_names: size> 
\xda\x01                <foo.py: number of characters of next item>
a                       <foo.py: co_names[0]: a>
\xda\x01                <foo.py: number of characters of next item>
b                       <foo.py: co_names[1]: b>
\xda\x01                <foo.py: number of characters of next item>
c                       <foo.py: co_names[2]: c>
r\x07\x00\x00\x00       <foo.py: co_names[3]: reference to 'baz'>
Z\x0e                   <foo.py: number of characters of next item>
multiplication          <foo.py: co_names[4]: multiplication>
Z\x06                   <foo.py: number of characters of next item>
square                  <foo.py: co_names[5]: square>
r\x03\x00\x00\x00       <foo.py: co_varnames: reference to empty tuple '()'>     
r\x03\x00\x00\x00       <foo.py: co_freevars: reference to emtpy tuple '()'>     
r\x03\x00\x00\x00       <foo.py: co_cellvars: reference to empty tuple '()'>
r\x06\x00\x00\x00       <foo.py: co_filename: reference to 'foo.py'>     
\xda\x08                <foo.py: number of characters of next item>
<module>                <foo.py: co_name>
\x03\x00\x00\x00        <foo.py: co_firstlineno>
s\n\x00\x00\x00         <foo.py: co_lnotab: size = '\n' = 0A>
\x04\x01                <foo.py: o_lnotab> 
\x04\x01                <foo.py: o_lnotab>
\x08\x02                <foo.py: o_lnotab>
\x08\x07                <foo.py: o_lnotab>
\n\x01'                 <foo.py: o_lnotab>

有用的信息:

如何在python中创建代码对象?