ai-multimodal

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

下载完整技能包查看接入说明

downloads

updated

2026/05/12

author

Admin

visibility

已公开

downloads.trend.tsxlast 7 days

当前技能最近 7 天下载趋势

用和首页一致的趋势图，快速判断这个 skill 最近是否还在被持续下载和使用。

7d total

quickstart.shinstall

安装命令

npx skills add ai-multimodal

使用建议

先看趋势和左侧结构化信息，再决定是直接下载、复制安装命令，还是继续阅读原始 `SKILL.md`。

打开 Docs 返回搜索页

overview.tsdecision summary

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

| Task | Audio | Image | Video | Document | Generation | |------|:-----:|:-----:|:-----:|:--------:|:----------:| | Transcription | ✓ | - | ✓ | - | - | | Summarization | ✓ | ✓ | ✓ | ✓ | - | | Q&A | ✓ | ✓ | ✓ | ✓ | - | | Object Detection | - | ✓ | ✓ | - | - | | Text Extraction | - | ✓ | - | ✓ | - | | Structured Output | ✓ | ✓ | ✓ | ✓ | - | | Creation | TTS | - | - | - | ✓ | | Timestamps | ✓ | - | ✓ | - | - | | Segmentation | - | ✓ | - | - | - |

SKILL.md previewcollapsible

name

ai-multimodal

description

---
name: ai-multimodal
description: Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
license: MIT
allowed-tools:
  - Bash
  - Read
  - Write
  - Edit
---

# AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

## Core Capabilities

### Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice

### Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction

### Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis

### Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)

### Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality

## Capability Matrix

| Task | Audio | Image | Video | Document | Generation |
|------|:-----:|:-----:|:-----:|:--------:|:----------:|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object De

预览已截断。下载完整技能包可查看全部文件内容。

next-steps.mdrecommended flow

1. 先判断它是否匹配你的任务、运行环境和依赖边界。

2. 再结合最近 7 天下载趋势，决定是直接安装还是先下载完整包审阅。

3. 需要程序化集成时，再去 Docs 查看 API 和 OpenAPI 描述。

related.tssame category

frontend-skill

Use when the task asks for a visually strong landing page, website, app, prototype, demo, or game UI. This skill enforces restrained composition, image-led hierarchy, cohesive content structure, and tasteful motion while avoiding generic cards, weak branding, and UI clutter.

★ 330

超能模式

超能模式：作为主 skill，超能模式负责理解复杂需求、维护统一任务状态并统筹整条执行链路，它会把任务分别交给 “复杂任务分处理器”、"深度网页搜索"、"多内容生成器 "等子 skill 协同完成，适合多步骤、多交付物、研究+生成一体化的复杂任务场景。

★ 275