Contributing¶
Thank you for your interest in MolOP! We welcome contributions in all forms.
- Reporting Bugs: How to submit an effective Issue.
- Feature Suggestions: Share your ideas for improving MolOP.
- Code Contribution Workflow: Detailed steps from Fork to Pull Request.
- Development Environment Setup: How to configure your local development environment.
- Code Style Guidelines: Follow Ruff and type checking requirements.
- Documentation Style Guide: Use the unified four-layer documentation standard (
style_guide.md).
Documentation Quality Assurance¶
To ensure high-quality documentation, we enforce the following policies:
- CI Verification: Every Pull Request triggers a documentation build using
mkdocs build --strict. This ensures no broken internal links and valid configuration. - Translation Policy:
- We use
TODO(translate):as a placeholder for content that has not yet been translated between English and Chinese. - Placeholders are allowed in the
mainbranch and Pull Requests to enable staged parity. - Release Blockers: Placeholders are strictly forbidden in release tags (
v*). The CI will fail if any are detected during a release build. - Notebooks: Jupyter notebooks in the documentation are optional and are not executed by the CI. Please ensure they are pre-executed if you want the output to be visible.
Developing IO plugins¶
MolOP’s IO stack is deliberately split into three layers: parsing, storage, and registration. Plugins are easiest to maintain when they follow that split instead of putting everything into one class.
Design principles¶
- Files own frame collections and file-level transforms.
BaseChemFileis aSequenceof frames, stores file-wide metadata such ascharge,multiplicity, andfile_content, and exposes file-levelformat_transform(...)throughFormatTransformMixin. Multi-frame behavior such as frame selection andembed_in_one_filebelongs here. - Frames own per-structure data and single-frame rendering.
BaseChemFileFramecarriesframe_id,frame_content, and neighbor links (prev/next). Specializations likeBaseQMInputFrameandBaseCalcFrameadd QM metadata, energies, vibrations, and other computed properties. - Storage models should preserve raw data first, then normalize. File/frame models keep raw text (
file_content,frame_content) and parsed fields side-by-side. This is important for round-tripping, debugging, and format-specific rendering like Gaussian fakeG. - Rendering should be domain-correct. File transforms go through
codec_registry.write(...); frame transforms go throughcodec_registry.write_frame(...). Do not fake file rendering by wrapping one frame as a one-frame file, and do not expose a frame writer when the format is only meaningful for whole files.
Relevant runtime files:
src/molop/io/base_models/ChemFile.pysrc/molop/io/base_models/ChemFileFrame.pysrc/molop/io/base_models/Mixins.pysrc/molop/io/base_models/_format_transform.pysrc/molop/io/codec_registry.py
Required behaviors for plugin classes¶
When you add a new parser or renderer plugin, the following are the behaviors that must exist for the plugin to work in MolOP.
- The plugin module MUST expose
register(registry). Builtin and third-party codec discovery depends on this function. - A reader plugin MUST register at least one
Registry.reader_factory(...)entry. That reader must be able to return a parsed file object through itsread(...)path. - A file renderer plugin MUST implement
FileMixin._render_frames_in_one_file(...). This is the entrypoint used whenembed_in_one_file=True. - A file renderer plugin MUST implement
FileMixin._render_frames(...). This is the entrypoint used when a file transform needs per-frame outputs. - A frame renderer plugin MUST implement
frame._render(**kwargs)if it registersdomain="frame".FrameRendererWriterdepends on the frame class owning single-frame rendering semantics. - A file-only format MUST register only
domain="file". If the format has no valid single-frame semantics, do not add a frame writer. - A plugin SHOULD preserve raw source text when the format contains meaningful directives or output blocks. This keeps round-tripping and format-specific rendering possible.
Reference implementations:
- Dual file/frame rendering:
src/molop/io/logic/coords_models/XYZFile.pyandsrc/molop/io/logic/coords_frame_models/XYZFileFrame.py - File-only rendering:
src/molop/io/logic/QM_models/G16LogFile.py - Reader registration:
src/molop/io/logic/qminput_parsers/GJFFileParser.py
Minimal plugin example¶
The following example shows the smallest realistic shape for a plugin that supports file and frame rendering. If your format is file-only, omit the frame writer factory exactly like fakeg does in G16LogFile.py.
from __future__ import annotations
from collections.abc import Sequence
from typing import TYPE_CHECKING, cast
from molop.io.base_models.ChemFile import BaseCoordsFile
from molop.io.base_models.ChemFileFrame import BaseCoordsFrame, _HasCoords
from molop.io.base_models.Mixins import (
DiskStorageMixin,
FileMixin,
MemoryStorageMixin,
_HasRenderableFrames,
)
if TYPE_CHECKING:
from molop.io.codec_registry import Registry
class MyFmtFrameMixin:
def _render(self, **kwargs) -> str:
typed_self = cast(_HasCoords, self)
return "\n".join(
[
str(len(typed_self.atoms)),
f"charge {typed_self.charge} multiplicity {typed_self.multiplicity}",
*(
f"{atom} {x:.6f} {y:.6f} {z:.6f}"
for atom, (x, y, z) in zip(
typed_self.atom_symbols,
typed_self.coords.m,
strict=True,
)
),
]
)
class MyFmtFrameMemory(MemoryStorageMixin, MyFmtFrameMixin, BaseCoordsFrame["MyFmtFrameMemory"]): ...
class MyFmtFrameDisk(DiskStorageMixin, MyFmtFrameMixin, BaseCoordsFrame["MyFmtFrameDisk"]): ...
class MyFmtFileMixin(FileMixin):
def _render_frames_in_one_file(self, frameID: Sequence[int], **kwargs) -> str:
typed_self = cast(_HasRenderableFrames, self)
return "\n\n".join(
frame._render(**kwargs) for frame in typed_self.frames if frame.frame_id in frameID
)
def _render_frames(self, frameID: Sequence[int], **kwargs) -> list[str]:
typed_self = cast(_HasRenderableFrames, self)
return [
frame._render(**kwargs) for frame in typed_self.frames if frame.frame_id in frameID
]
class MyFmtFileMemory(MemoryStorageMixin, MyFmtFileMixin, BaseCoordsFile[MyFmtFrameMemory]): ...
class MyFmtFileDisk(DiskStorageMixin, MyFmtFileMixin, BaseCoordsFile[MyFmtFrameDisk]): ...
def register(registry: Registry) -> None:
from molop.io.codecs._shared.writer_helpers import (
FileRendererWriter,
FrameRendererWriter,
StructureLevel,
)
@registry.writer_factory(
format_id="myfmt",
required_level=StructureLevel.COORDS,
domain="file",
default_graph_policy="coords",
priority=100,
)
def _file_writer():
return FileRendererWriter(
format_id="myfmt",
required_level=StructureLevel.COORDS,
file_cls=MyFmtFileDisk,
frame_cls=MyFmtFrameDisk,
priority=100,
)
@registry.writer_factory(
format_id="myfmt",
required_level=StructureLevel.COORDS,
domain="frame",
default_graph_policy="coords",
priority=100,
)
def _frame_writer():
return FrameRendererWriter(
format_id="myfmt",
required_level=StructureLevel.COORDS,
frame_cls=MyFmtFrameDisk,
priority=100,
)
Data-structure dependency graph¶
The diagram below shows the dependency direction that plugin authors should preserve. Base file/frame models define storage and traversal semantics, parsers populate those models, and the registry plus writer helpers expose reader/writer behavior on top of them.
graph TD
Registry[Registry]
ReaderFactory[reader_factory / writer_factory]
ReaderHelpers[reader helpers]
WriterHelpers[FileRendererWriter / FrameRendererWriter]
FileMixin[FileMixin]
FTM[FormatTransformMixin]
FrameFTM[FrameFormatTransformMixin]
BaseChemFile[BaseChemFile]
BaseCoordsFile[BaseCoordsFile]
BaseQMInputFile[BaseQMInputFile]
BaseCalcFile[BaseCalcFile]
BaseFrame[BaseChemFileFrame]
BaseCoordsFrame[BaseCoordsFrame]
BaseQMInputFrame[BaseQMInputFrame]
BaseCalcFrame[BaseCalcFrame]
FileParser[FileParser / FileBatchParserDisk]
FrameParser[FrameParser]
PluginFile[Plugin File Model]
PluginFrame[Plugin Frame Model]
PluginReader[Plugin Reader / Parser]
BaseChemFile --> FTM
BaseChemFile --> FileMixin
BaseCoordsFile --> BaseChemFile
BaseQMInputFile --> BaseCoordsFile
BaseCalcFile --> BaseQMInputFile
BaseFrame --> FrameFTM
BaseCoordsFrame --> BaseFrame
BaseQMInputFrame --> BaseCoordsFrame
BaseCalcFrame --> BaseQMInputFrame
PluginFile --> BaseCalcFile
PluginFrame --> BaseCalcFrame
FileParser --> PluginFile
FrameParser --> PluginFrame
PluginReader --> FileParser
Registry --> ReaderFactory
ReaderFactory --> ReaderHelpers
ReaderFactory --> WriterHelpers
WriterHelpers --> PluginFile
WriterHelpers --> PluginFrame
Read it as follows:
- inheritance flows from generic base models to format-specific models
- parsers depend on the models they populate
- registration depends on callable factories, not direct model imports from the core runtime
- file renderers may depend on both file and frame classes, but frame renderers should depend only on frame semantics
Runtime dataflow sequence¶
The next diagram captures the runtime order for parsing and rendering. This is the sequence you should preserve when adding new codecs or plugin models.
sequenceDiagram
participant User
participant API as AutoParser / format_transform
participant Registry
participant Catalog as builtin/plugin catalog
participant Reader as ReaderCodec
participant FileParser as File parser
participant FrameParser as Frame parser
participant FileModel as File model
participant FrameModel as Frame model
participant Writer as File/Frame writer
User->>API: parse path or request transform
API->>Registry: ensure_default_codecs_registered()
Registry->>Catalog: load_builtin_codecs() / load_plugin_codecs()
alt parsing
API->>Registry: select_reader(path, parser_detection)
Registry-->>API: ordered ReaderCodec candidates
API->>Reader: read(path)
Reader->>FileParser: parse raw file content
FileParser->>FrameParser: split file and parse frame payloads
FrameParser->>FrameModel: build parsed frame objects
FileParser->>FileModel: assemble file object and append frames
FileModel-->>API: parsed file model
else file transform
API->>Registry: write(file_obj, format, frameID, embed_in_one_file)
Registry->>Writer: select file-domain writer
Writer->>FileModel: _render(...)
FileModel-->>Writer: rendered text or file output
Writer-->>API: transform result
else frame transform
API->>Registry: write_frame(frame_obj, format)
Registry->>Writer: select frame-domain writer
Writer->>FrameModel: _render(...)
FrameModel-->>Writer: rendered text or file output
Writer-->>API: transform result
end
Plugin rule of thumb: add logic at the earliest stable layer that owns the behavior. Parsing belongs in parser modules, file assembly belongs in file models, single-frame semantics belong in frame models, and user-visible availability belongs in registry registration.
What the base classes already provide¶
File models¶
Use one of the existing file bases unless you have a very strong reason not to:
BaseCoordsFile: coordinate-only formatsBaseQMInputFile: input formats with coordinates plus lightweight route/resource metadataBaseCalcFile: calculation/result formats with coordinates, QM metadata, and output properties
These file bases already provide:
- re-iterable
Sequencebehavior over frames append(...),frames,__getitem__, and__iter__- summary helpers (
to_summary_dict,to_summary_df) - file-content lifecycle helpers (
release_file_content) - file-level
format_transform(...)
If your file model is renderable through the generic registry path, implement the FileMixin contract:
_render_frames_in_one_file(frameID, **kwargs) -> str_render_frames(frameID, **kwargs) -> list[str]
Reference patterns:
src/molop/io/logic/qminput_models/GJFFile.pysrc/molop/io/logic/QM_models/G16LogFile.py
Frame models¶
Use one of the frame bases that matches the data level:
BaseCoordsFrameBaseQMInputFrameBaseCalcFrame
These frame bases already provide:
- frame identity (
frame_id) - frame linkage (
prev,next) - raw frame text preservation (
frame_content) - molecule-level behavior inherited from
Molecule - frame-level
format_transform(...)when a frame writer exists for the target format
Frame renderers should implement format-specific single-frame logic only. File assembly, frame selection, and multi-frame packaging stay on the file model.
Parser and storage conventions¶
- Parsers build models; models do not parse files on demand. Keep extraction logic in parser modules and model validation/aggregation logic in the model classes.
- Preserve raw directives where possible. For QM input/output formats, keep route/resources/title text available in model fields rather than only storing derived semantic values.
- Normalize with validators, not ad hoc post-processing scripts. MolOP already relies on model validators to fill derived fields such as method, basis set, and functional information.
- Containers must remain stable under repeated iteration. Do not introduce shared cursor state into file/frame collections.
Registration rules for new formats¶
Renderers become available only after a module exposes a callable register(registry) function. Builtin codec loading scans parser/model packages and invokes that function lazily.
For readers:
- register through
Registry.reader_factory(...) - provide a format id, extensions, and priority
- keep parsing logic in parser modules under
src/molop/io/logic/*_parsers
For writers/renderers:
- register through
Registry.writer_factory(...) - choose the correct
domainexplicitly: domain="file"for file-level writersdomain="frame"for frame-level writers- use
FileRendererWriterfor file renderers andFrameRendererWriterfor frame renderers - pick
required_levelcarefully: StructureLevel.COORDSfor coordinate-driven formatsStructureLevel.GRAPHfor graph-preserving formats- set
default_graph_policyintentionally rather than relying on guesses
Reference registration files:
src/molop/io/logic/qminput_models/GJFFile.pysrc/molop/io/logic/coords_models/XYZFile.pysrc/molop/io/logic/QM_models/G16LogFile.py
File-only vs frame-only support¶
Not every format should support both file and frame transforms.
- If a format is only meaningful as a whole-file render (for example, a synthetic multi-frame Gaussian log), register only
domain="file". - If a format has valid single-frame semantics, add a separate
domain="frame"writer. - If you omit the frame writer,
frame.format_transform(...)should fail withUnsupportedFormatError, which is the correct behavior.
Generated surfaces you must keep in sync¶
Registration changes affect generated stubs and CLI typing. If you add or change readers/writers, regenerate or check the generated artifacts in the same work session:
uv run python scripts/generate_io_typing_catalog.pyuv run python scripts/generate_chemfile_format_transform_stubs.pyuv run python scripts/generate_cli_transform_stubs.py
Useful verification commands:
uv run pytest <targeted-test>uv run python scripts/generate_io_typing_catalog.py --checkuv run python scripts/generate_chemfile_format_transform_stubs.py --checkuv run python scripts/generate_cli_transform_stubs.py --check
Practical checklist for plugin authors¶
Before opening a PR for a new parser/renderer plugin, verify that:
- file and frame responsibilities are separated cleanly
- raw input/output text is preserved where useful
- file collections remain re-iterable and stateless
- registration uses the correct
domain - file-only formats do not accidentally expose frame writers
- generated stubs and CLI typing are updated
- at least one targeted test proves the new format is actually parseable or renderable