02-批量汇总、过滤与组合¶
In [1]:
Copied!
from __future__ import annotations
from pathlib import Path
from molop.io import AutoParser
def find_repo_root(start: Path) -> Path:
current = start.resolve()
for candidate in (current, *current.parents):
if (candidate / "pyproject.toml").exists():
return candidate
raise RuntimeError("Could not find repository root (pyproject.toml).")
repo_root = find_repo_root(Path.cwd())
from __future__ import annotations
from pathlib import Path
from molop.io import AutoParser
def find_repo_root(start: Path) -> Path:
current = start.resolve()
for candidate in (current, *current.parents):
if (candidate / "pyproject.toml").exists():
return candidate
raise RuntimeError("Could not find repository root (pyproject.toml).")
repo_root = find_repo_root(Path.cwd())
批量汇总¶
In [2]:
Copied!
mix_batch = AutoParser((repo_root / "tests/test_files/mix_format/*").as_posix())
mix_batch
mix_batch = AutoParser((repo_root / "tests/test_files/mix_format/*").as_posix())
mix_batch
MolOP parsing with single process: 0%| | 0/10 [00:00<?, ?it/s]
Out[2]:
FileBatchModelDisk(10)
In [33]:
Copied!
mix_batch.to_summary_df()
mix_batch.to_summary_df()
MolOP processing frame summary with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[33]:
| DiskStorage | General | Calc Parameter | Environment | Status | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FilePath | FileFormat | Charge | Multiplicity | CanonicalSMILES | NumAtoms | FrameID | Software | Version | Method | ... | Functional | Keywords | SolventModel | Solvent | Temperature (kelvin) | Pressure (standard_atmosphere) | IsError | IsNormal | IsTS | IsOptimized | |
| 0 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .log | -1 | 1 | C/C=C(\[O-])[N-](c1ccccc1)->[Cu+3]1(<-[O-]c2cc... | 77 | 131 | Gaussian | ES64L-G16RevC.01 | DFT | ... | RB3LYP | #N Geom=AllCheck Guess=TCheck SCRF=Check GenCh... | None | None | 298.15 kelvin | 1.0 standard_atmosphere | False | True | True | False |
| 1 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .gjf | 0 | 1 | CCOC(=O)c1cc(OC)no1 | 21 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .gjf | 0 | 1 | C=[N+](C)[N-]C.COC(=O)[C@@]1(OC)C#CC(Br)(Br)CCCC1 | 43 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .log | 0 | 1 | C=[N+](C)[N-]C.COC(=O)[C@@]1(OC)C#CC(Br)(Br)CCCC1 | 43 | 73 | Gaussian | ES64L-G16RevB.01 | DFT | ... | RB3LYP | #N Geom=AllCheck Guess=TCheck SCRF=Check Test ... | PCM | Water | 298.15 kelvin | 1.0 standard_atmosphere | False | True | True | False |
| 4 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .fchk | 0 | 2 | [CH2]C | 7 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .gjf | 0 | 2 | [CH2]C | 7 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .log | 1 | 1 | C[N+]1=C=CC([N+](=O)[O-])=C1 | 14 | 21 | Gaussian | ES64L-G16RevC.01 | DFT | ... | RB3LYP-GD3BJ | #p b3lyp/chkbas pop=nbo geom=allcheck guess=tc... | None | None | 298.15 kelvin | 1.0 standard_atmosphere | False | True | False | False |
| 7 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .gjf | 0 | 1 | C=CCC.CC(=O)O | 20 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .out | 0 | 1 | C=CCC.CC(=O)O | 20 | 43 | Gaussian | Gaussian 09: AS64L-G09RevD.01 24-Apr-2013\n ... | DFT | ... | #P irc=(calcfc, MaxPoints=20) ub3lyp/6-31g(d) | None | None | NaN | NaN | True | False | False | False | |
| 9 | /Users/tmj/Documents/proj/MolOP/tests/test_fil... | .gjf | 0 | 1 | C | 5 | 0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 21 columns
In [3]:
Copied!
mix_batch.file_names
mix_batch.file_names
Out[3]:
['RE_BOX-Anion-Real_Cu-III-Phenol_Major-Amide-Anion_From-IP_C-O-190_TS_Opt.log', 'S_Ph_Ni_TS.gjf', 'TS_Zy0fwX_ll_ad_14-19_15-16_optts_g16.gjf', 'TS_Zy0fwX_ll_ad_14-19_15-16_optts_g16.log', 'dsgdb9nsd_000007-6.fchk', 'dsgdb9nsd_000007-6.gjf', 'dsgdb9nsd_131941-4+.log', 'irc.gjf', 'irc.out', 'opt_point_charge_xtb.gjf']
可以像列表一样对批次进行索引,获取特定的文件。
In [4]:
Copied!
mix_batch[0].file_path
mix_batch[0].file_path
Out[4]:
'/Users/tmj/Documents/proj/MolOP/tests/test_files/mix_format/RE_BOX-Anion-Real_Cu-III-Phenol_Major-Amide-Anion_From-IP_C-O-190_TS_Opt.log'
索引也可以是文件的路径,对于顺序不方便获得的大规模场景会更有用,例如:
In [5]:
Copied!
mix_batch[mix_batch[0].file_path].file_path
mix_batch[mix_batch[0].file_path].file_path
Out[5]:
'/Users/tmj/Documents/proj/MolOP/tests/test_files/mix_format/RE_BOX-Anion-Real_Cu-III-Phenol_Major-Amide-Anion_From-IP_C-O-190_TS_Opt.log'
进行快速观察
In [6]:
Copied!
mix_batch.draw_grid_image(molsPerRow=2, subImgSize=(500, 500))
mix_batch.draw_grid_image(molsPerRow=2, subImgSize=(500, 500))
MolOP processing smi format with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[6]:
如果计算结果中存在可嵌入到原子或键的信息,可以使用qm_embedded_rdmol方法获取嵌入后的rdkit mol对象,从这种对象保存的SDF/MOL文件中会包含这些信息。
In [7]:
Copied!
mix_batch[0][-1].qm_embedded_rdmol()
mix_batch[0][-1].qm_embedded_rdmol()
Out[7]:
| atom.dprop.MULLIKEN_CHARGES_BY_GAUSSIAN | 0.118843 -0.52778700000000001 -0.51229000000000002 -0.53101200000000004 -0.51256599999999997 0.61099899999999996 0.0034429999999999999 0.139241 0.61233300000000002 0.115479 -0.39646199999999998 0.114855 0.0024759999999999999 0.125336 -0.057921 0.13408999999999999 0.12804299999999999 -0.50508900000000001 0.26441100000000001 0.111973 0.15039 0.122109 -0.45417600000000002 0.11734899999999999 0.13606699999999999 0.12819800000000001 -0.47093800000000002 0.11650099999999999 0.18501600000000001 0.137873 -0.065138000000000001 0.12690599999999999 0.15543299999999999 -0.476964 0.12060999999999999 0.120382 0.20754500000000001 -0.57657999999999998 -0.152752 -0.129075 -0.20485700000000001 0.41112599999999999 -0.20932500000000001 -0.14534900000000001 0.080785999999999997 0.084764000000000006 0.108117 0.16168299999999999 0.099557999999999994 -0.48423300000000002 0.120835 0.23432 0.13422200000000001 -0.45259300000000002 0.128138 0.13276499999999999 0.12720999999999999 0.55761000000000005 -0.52698400000000001 -0.57556099999999999 -0.084725999999999996 0.19952400000000001 -0.481348 0.178619 0.14002800000000001 0.18781600000000001 0.27968700000000002 -0.168215 -0.17993300000000001 -0.14568900000000001 0.17444000000000001 -0.141733 0.166348 -0.13463900000000001 0.098386000000000001 0.100982 0.091067999999999996 |
|---|---|
| atom.dprop.APT_CHARGES_BY_GAUSSIAN | 0.443965 -0.792883 -0.83162499999999995 -0.79886199999999996 -0.77464900000000003 1.2313970000000001 0.30693799999999999 -0.051416000000000003 1.167859 0.055232999999999997 -0.83615399999999995 0.040263 0.3377 -0.066701999999999997 0.38441500000000001 -0.050312000000000003 -0.081948999999999994 0.017576000000000001 0.048186 -0.056041000000000001 -0.0068040000000000002 0.062031000000000003 0.058993999999999998 -0.038065000000000002 -0.020482 -0.043746 0.028738 -0.045996000000000002 0.0098879999999999992 -0.010717000000000001 0.34742400000000001 -0.046920000000000003 -0.042823 0.038242999999999999 -0.031461000000000003 -0.038001 0.013105 -0.84482800000000002 -0.199792 0.099024000000000001 -0.18670300000000001 0.568415 -0.18141299999999999 0.061751 -0.039863999999999997 -0.049185 0.015035 0.047211999999999997 -0.015221 0.0093130000000000001 -0.053851999999999997 0.032781999999999999 -0.015025 0.050227000000000001 -0.050072999999999999 -0.025194000000000001 -0.027827999999999999 0.40771000000000002 -0.65421099999999999 -0.62746400000000002 0.75768800000000003 -0.028715000000000001 -0.16567699999999999 0.067944000000000004 -0.0081110000000000002 0.038531999999999997 0.388457 -0.182672 -0.149811 0.062414999999999998 0.099127000000000007 0.039350999999999997 0.047967000000000003 -0.142039 -0.032569000000000001 -0.012616 -0.026436000000000001 |
过滤文件¶
在文件数量较多时开启并行处理可以显著提高处理速度。
In [8]:
Copied!
mix_batch.groupby(lambda x: x.pure_filename)
mix_batch.groupby(lambda x: x.pure_filename)
Grouping files with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[8]:
{'RE_BOX-Anion-Real_Cu-III-Phenol_Major-Amide-Anion_From-IP_C-O-190_TS_Opt': FileBatchModelDisk(1),
'S_Ph_Ni_TS': FileBatchModelDisk(1),
'TS_Zy0fwX_ll_ad_14-19_15-16_optts_g16': FileBatchModelDisk(2),
'dsgdb9nsd_000007-6': FileBatchModelDisk(2),
'dsgdb9nsd_131941-4+': FileBatchModelDisk(1),
'irc': FileBatchModelDisk(2),
'opt_point_charge_xtb': FileBatchModelDisk(1)}
In [24]:
Copied!
mix_batch.groupby(lambda x: x.file_format)
mix_batch.groupby(lambda x: x.file_format)
Grouping files with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[24]:
{'.log': FileBatchModelDisk(3),
'.gjf': FileBatchModelDisk(5),
'.fchk': FileBatchModelDisk(1),
'.out': FileBatchModelDisk(1)}
In [22]:
Copied!
ts_batch = mix_batch.filter_state("ts")
ts_batch
ts_batch = mix_batch.filter_state("ts")
ts_batch
Filtering ts files with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[22]:
FileBatchModelDisk(2)
In [25]:
Copied!
opt_batch = mix_batch.filter_state("opt")
opt_batch
opt_batch = mix_batch.filter_state("opt")
opt_batch
Filtering opt files with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[25]:
FileBatchModelDisk(3)
In [26]:
Copied!
cation_batch = mix_batch.filter_value("charge", 1)
cation_batch
cation_batch = mix_batch.filter_value("charge", 1)
cation_batch
Filtering files with charge == 1 with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[26]:
FileBatchModelDisk(1)
In [27]:
Copied!
anion_batch = mix_batch.filter_value("charge", 0, "<")
anion_batch
anion_batch = mix_batch.filter_value("charge", 0, "<")
anion_batch
Filtering files with charge < 0 with 1 jobs: 0%| | 0/10 [00:00<?, ?it/s]
Out[27]:
FileBatchModelDisk(1)
内置了并行执行器,用户可以通过 parallel_execute 方法并行执行自定义函数,最终返回一个包含所有函数执行结果的列表。
In [28]:
Copied!
mix_batch.parallel_execute(lambda x: print(x.charge))
mix_batch.parallel_execute(lambda x: print(x.charge))
0%| | 0/10 [00:00<?, ?it/s]
-1 0 0 0 0 0 1 0 0 0
Out[28]:
[None, None, None, None, None, None, None, None, None, None]
组合¶
支持简单的布尔运算,例如:
In [29]:
Copied!
ion_batch = cation_batch + anion_batch
ion_batch
ion_batch = cation_batch + anion_batch
ion_batch
Out[29]:
FileBatchModelDisk(2)
In [30]:
Copied!
opt_batch_without_ts = opt_batch - ts_batch
opt_batch_without_ts
opt_batch_without_ts = opt_batch - ts_batch
opt_batch_without_ts
Out[30]:
FileBatchModelDisk(1)