The PNG specification allows arbitrary text metadata through tEXt, iTXt, and zTXt chunks. ComfyUI uses this mechanism to embed two large JSON structures — the workflow graph and the prompt execution data — under specific keywords. Stable Diffusion Web UI (A1111) uses a different approach: a plain-text “parameters” string with a specific formatting convention. InvokeAI uses yet another format. And tools like Midjourney do not embed metadata in images at all.
Part of our AI-Native DAM Architecture
An asset management system that claims to support AI-generated content must extract metadata from all of these formats. This is not a simple parsing problem — each format has edge cases, size limits, encoding variations, and failure modes that a production extractor must handle gracefully. Getting extraction right is the foundation of everything that follows: search, lineage, reproducibility, and compliance all depend on correct, complete metadata extraction.
The Forces at Work
- Format diversity: Even within the PNG specification, tools use different chunk types (tEXt vs. iTXt), different keywords (workflow, prompt, parameters, Dream, invokeai_metadata), and different encoding strategies (plain text, JSON, base64). The extractor must probe for multiple formats and identify which tool produced the file before attempting to parse its metadata.
- Large payload sizes: ComfyUI workflow JSON can exceed 200 kilobytes for complex workflows with forty or more nodes. Many PNG metadata libraries have fixed buffer sizes that silently truncate large text chunks, producing corrupted JSON that fails to parse. Understanding the two JSON blobs means understanding that both can be very large and must be extracted completely.
- Custom node contamination: ComfyUI's extensibility means custom nodes can inject arbitrary data into the workflow and prompt structures. Some custom nodes add fields with non-standard types, circular references, or extremely large values. The extractor must handle these gracefully — extracting what it can and flagging what it cannot — without crashing on malformed input.
- Non-image sources: Not all generation metadata comes from image files. Midjourney metadata comes from Discord messages. Some tools produce metadata in sidecar files (JSON or XML alongside the image). Some export metadata through APIs. The extraction layer must support file-based, message-based, and API-based sources.
The Problem
Most metadata extraction tools are built for traditional photography metadata: EXIF data measured in bytes, with well-defined field types and standard value ranges. AI generation metadata breaks these assumptions in every dimension:
Traditional vs. AI Generation Metadata
| Dimension | Photography EXIF | AI Generation Metadata |
|---|---|---|
| Size per image | 1-10 KB | 10-400 KB |
| Format | Standardized (EXIF/XMP) | Tool-specific, non-standard |
| Structure | Flat key-value pairs | Nested JSON, graphs, free text |
| Stability | Mature, rarely changes | Evolves with each tool update |
| Validation | Schema-defined types | Arbitrary, unvalidated content |
A metadata extraction library designed for EXIF will fail on ComfyUI output — not with an error, but with silent data loss. It will read the standard EXIF fields (resolution, color space) and ignore the ComfyUI-specific text chunks that contain the generation parameters. The result is an image with “extracted metadata” that is missing everything that matters for AI asset management.
The most dangerous extraction failure is the silent one. The tool reports success, returns metadata, and the user assumes the extraction was complete. But the generation parameters — the prompt, the seed, the model, the workflow — were never read because the extractor did not know to look for them.
The Solution: Probing Extractors with Tool Detection
A robust extraction system uses a probe-first architecture: before attempting to parse metadata in any specific format, it probes the file to determine which tool produced it and which metadata format to expect.
Tool Detection
The first extraction step is tool identification. The system checks for signature metadata markers: a “workflow” text chunk indicates ComfyUI. A “parameters” text chunk with the A1111 formatting pattern indicates Stable Diffusion Web UI. An “invokeai_metadata” chunk indicates InvokeAI. No AI-specific text chunks with specific EXIF patterns may indicate DALL-E or Midjourney. The detection step routes the file to the correct tool-specific parser.
Format-Specific Parsers
Each tool gets a dedicated parser that understands its specific metadata format:
- ComfyUI parser: Reads “workflow” and “prompt” text chunks. Validates both as JSON. Handles large payloads (no fixed buffer). Extracts node types, connections, widget values from the workflow blob. Extracts resolved parameters, seeds, model paths from the prompt blob. Handles custom node fields gracefully.
- A1111 parser: Reads “parameters” text chunk. Parses the structured text format: first line is positive prompt, “Negative prompt:” line is negative prompt, remaining lines are key-value parameters (Steps, Sampler, CFG scale, Seed, Model, etc.). Handles multi-line prompts and BREAK tokens.
- Midjourney parser: Operates on Discord message data rather than image file metadata. Extracts prompt text, parameter flags, job ID, variation/upscale relationships from message content and formatting.
- Generic parser: For images with no recognized AI-specific metadata, extracts standard EXIF, XMP, and IPTC data. Captures technical metadata (dimensions, color space, camera info if present) and flags the asset as having no AI generation metadata detected.
Extraction Quality Reporting
Each extractor reports what it found and what it could not parse. The extraction result includes a quality assessment: full extraction (all expected fields parsed successfully), partial extraction (some fields parsed, some failed), or minimal extraction (only basic technical metadata). This quality signal flows into the normalization pipeline and eventually to the user, who can see which assets have rich metadata and which have gaps.
Consequences
- Extractor maintenance burden: Each tool-specific parser must be maintained as tools evolve. When ComfyUI adds new node types or changes its JSON structure, the parser must be updated. When Midjourney changes its Discord message formatting, the parser must adapt. This is a continuous maintenance cost, but it is isolated to the extraction layer — downstream systems work with the normalized schema and are unaffected by extractor changes.
- Extensibility: New tools can be supported by adding a new detection signature and a new parser. The architecture is designed for this — the probe-first approach means the system gracefully handles unknown formats (falling back to the generic parser) while providing rich extraction for recognized tools.
- Testing complexity: Each parser needs its own test suite with real-world examples from its tool. Edge cases differ per tool: ComfyUI edge cases involve custom nodes and large workflows; A1111 edge cases involve multi-line prompts and extension-specific parameters; Midjourney edge cases involve message threading and variation chains. A comprehensive test suite requires a library of real metadata samples from each tool.
- Foundation for everything: Correct extraction is the prerequisite for every other capability. Search quality depends on extraction completeness. Reproducibility depends on extracting the right parameters. Lineage tracking depends on extracting relationships. Investing in extraction quality pays compound returns across the entire system.
Related Patterns
- The Normalization Pipeline consumes extractor output and translates it into a unified schema.
- The Two JSON Blobs details the specific extraction challenges for ComfyUI's dual metadata structure.
- Midjourney Metadata describes extraction from Discord messages rather than image files.
- The Two Metadata Problem provides the broader context for why tool-specific extraction is necessary.
