Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
用和首页一致的趋势图,快速判断这个 skill 最近是否还在被持续下载和使用。
--- name: ai-multimodal description: Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens. license: MIT allowed-tools: - Bash - Read - Write - Edit --- # AI Multimodal Processing Skill Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation. ## Core Capabilities ### Audio Processing - Transcription with timestamps (up to 9.5 hours) - Audio summarization and analysis - Speech understanding and speaker identification - Music and environmental sound analysis - Text-to-speech generation with controllable voice ### Image Understanding - Image captioning and description - Object detection with bounding boxes (2.0+) - Pixel-level segmentation (2.5+) - Visual question answering - Multi-image comparison (up to 3,600 images) - OCR and text extraction ### Video Analysis - Scene detection and summarization - Video Q&A with temporal understanding - Transcription with visual descriptions - YouTube URL support - Long video processing (up to 6 hours) - Frame-level analysis ### Document Extraction - Native PDF vision processing (up to 1,000 pages) - Table and form extraction - Chart and diagram analysis - Multi-page document understanding - Structured data output (JSON schema) - Format conversion (PDF to HTML/JSON) ### Image Generation - Text-to-image generation - Image editing and modification - Multi-image composition (up to 3 images) - Iterative refinement - Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4) - Controllable style and quality ## Capability Matrix | Task | Audio | Image | Video | Document | Generation | |------|:-----:|:-----:|:-----:|:--------:|:----------:| | Transcription | ✓ | - | ✓ | - | - | | Summarization | ✓ | ✓ | ✓ | ✓ | - | | Q&A | ✓ | ✓ | ✓ | ✓ | - | | Object De
预览已截断。下载完整技能包可查看全部文件内容。
1. 先判断它是否匹配你的任务、运行环境和依赖边界。
2. 再结合最近 7 天下载趋势,决定是直接安装还是先下载完整包审阅。
3. 需要程序化集成时,再去 Docs 查看 API 和 OpenAPI 描述。