As voice transcription and subtitle generation needs expand, Alibaba Cloud’s Qwen team provides “Qwen3-ASR,” an ASR (Automatic Speech Recognition) service combining multilingual capability, high accuracy, and noise robustness. Delivered as API, it supports singing (lyrics) and dialects, with practical features like streaming output and context injection. Pricing is per-second billing, with a total 36,000-second (10-hour) free tier in the 90 days after account activation.
This article organizes Qwen3-ASR’s overview, key features, pricing, usage, application areas, and implementation considerations.
Table of Contents
- Qwen3-ASR Overview
- Key Features Supporting Speech Recognition
- Tools and Pricing Structure
- API and SDK Usage
- Main Business Application Areas
- Requirements to Verify Before Implementation
- Usage Considerations and Risks
- Summary
Qwen3-ASR Overview

Source: https://qwen.ai/blog?id=824c40353ea019861a636650c948eb8438ea5cf2&from=home.latest-research-list
Qwen3-ASR is an audio→text conversion service built on Qwen’s multimodal platform model. Designed to cover multiple languages, dialects, accents, and even singing transcription with a single model.
Delivery format is cloud API (HTTP/SDK), supporting streaming output, punctuation insertion, ITN (normalization of numbers/dates). Languages include Chinese (Mandarin, Sichuan dialect, Minnan, Wu, Cantonese) plus English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, and Spanish. Automatic language identification is also available.
This “one-size-fits-all” design contributes to operational simplicity and faster deployment.
Key Features Supporting Speech Recognition
The core value is handling “multilingual, singing, noise resistance, real-time, terminology adaptation” comprehensively through a single API.
Feature Summary (Implementation Parameter Quick Reference)
| Feature | Key Settings / Returns | Representative Use Cases |
|---|---|---|
| Multilingual/dialect support (11 languages + Chinese dialects) | Explicit language specification / enable_lid: true (auto language ID, returns language annotation) | International conference, multilingual content subtitles |
| Singing (lyrics) support | Standard “Singing recognition” support | Song/BGM-included video transcription |
| Noise/non-human voice resistance | Noise rejection / non-human voice filter | Call logs, outdoor recording, distant microphone |
| Context injection (Contextual enhancement) | System message text up to 10,000 token reference text | Proper noun, industry term adaptation |
| Streaming output | Sequential partial result returns | Live subtitles, interactive UI |
| ITN & punctuation | enable_itn: true (ITN for Chinese/English), punctuation estimation | Readable transcript output |
| Input/output specs | 16kHz mono, 3min/10MB per request, major file formats supported | Design prerequisites and split strategy |
Based on above, confirm each feature’s points and implementation considerations sequentially.
- Multilingual and Singing Support
Single model covers Chinese (Mandarin, Sichuan, Minnan, Wu, Cantonese) plus English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish. Specify language for known languages, enable enable_lid: true for automatic identification in mixed scenarios. “Explicit specification for expected language → auto-identification for unknown” is stable design.
Additionally provides Singing recognition as standard, handling vocal tracks, rap, and BGM-background voices in the same workflow. Directly applicable to video editing and streaming subtitles.
- Context Injection and Single-Turn Design
System message text allows up to 10,000 tokens. Flexibly handles lists of names/product names/industry terms or paragraph text, offering higher freedom than hotword dictionaries. Model is single-turn, retaining no conversation history, so necessary terms/background should be injected in text every request for certainty.
- Streaming Output and Implementation Ease
API and Python/Java SDKs support streaming (sequential output). Enables reflection to subtitles/UI without waiting for final confirmation, suited for meeting streaming and live use. Official documentation provides invocation examples and response structures for easy integration.
- Noise and Non-Human Voice Resistance and Text Formatting
Noise rejection and non-human voice filter suppress quality degradation under noise, low-quality microphones, distant recording. Effective for call logs and outdoor recording.
Additionally, ITN (inverse text normalization: Chinese/English support) converts numbers/dates to natural notation, automatically generating readable transcripts via punctuation estimation. For strict formatting in other languages, combine with post-processing rules for stability.
- Input/Output Specs and Operational Design
Input is 16kHz mono, 3min/10MB limit per request. Supports major audio/video formats, but long or large files require split-based design. Practically, extract audio from video beforehand.
No online trial UI provided; API usage is standard. Available official CLI handles splitting, parallel processing, and auto-resampling all-in-one.
Thus, the strength is covering challenging real-world requirements like language mixing, BGM, and noise through a single API with minimal parameters. Available official CLI also published, handling long media splitting, parallel processing, and auto-resampling end-to-end.
Tools and Pricing Structure
Qwen publishes official CLI tools and demos alongside API. Pricing is per-second with regional rate variations.
Pricing (Official Documentation Published Values)
| Region | Model | Rate (per second) | Free Tier |
|---|---|---|---|
| Singapore | qwen3-asr-flash (as of 2025-09-08) | $0.000035 | 36,000 sec (10 hours) *90 days from activation |
| Beijing | qwen3-asr-flash (as of 2025-09-08) | $0.000032 | Not listed |
Actual application may vary by account settings, exchange rates, taxes, etc., so verify latest values in management console and documentation before use.
Provided Tools (Official)
- Qwen3-ASR-Toolkit (Official CLI, MIT)
Auto-splits long files (VAD) for parallel API calls. Installable via pip install qwen3-asr-toolkit. Supports 3-minute limit bypass, thread count specification, log output.
- Hugging Face Spaces Demo
Official space for browser-based audio upload trials.
API and SDK Usage
Implementation is simple. Obtain API key from Model Studio, call via SDK or HTTP specifying audio file. Below is a minimal Python SDK example image (endpoints vary by region).
import os, dashscopedashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1' # Beijing: https://dashscope.aliyuncs.com/api/v1
messages = [ {"role": "system", "content": [{"text": "Context text like proper nouns (optional)"}]}, # Context injection {"role": "user", "content": [{"audio": "https://example.com/audio.mp3"}]}]
resp = dashscope.MultiModalConversation.call( api_key=os.getenv("DASHSCOPE_API_KEY"), model="qwen3-asr-flash", messages=messages, result_format="message", asr_options={"enable_lid": True, "enable_itn": True})print(resp)As above, simply specify “audio URL,” “(optional) context,” “output settings (ITN, language detection, etc.)” and call. Local files can be placed on public URLs or processed via official CLI.
Main Business Application Areas
Based on feature characteristics (multilingual, singing, noise resistance, streaming, context injection), suited for scenarios like:
-
Real-time subtitle display and meeting minute transcription for meetings/webinars (streaming + auto language detection).
-
Video/music content transcription and subtitle generation (lyric transcription and BGM-background recognition).
-
Voice support call log transcription, search, and analysis (robustness in noisy environments).
-
e-Learning and training material creation (multilingual support + specialized term enhancement via context injection).
Combining language specification and context utilization during use case design facilitates accuracy improvement.
Requirements to Verify Before Implementation
Verifying API specs and limits before use avoids operational troubles.
- File Limits
Audio limited to 3min/10MB per request, assuming 16kHz mono. Long audio handled via CLI auto-split (VAD).
- Model Characteristics
Qwen3-ASR is a single-turn model. Does not retain conversation history or multi-turn prompts.
- Context Injection
Maximum 10,000 tokens. Flexibly injects from hotword lists to paragraph text.
- ITN Application Scope
ITN currently applicable to Chinese and English. Consider post-processing design as needed.
- Endpoints
URLs and API keys differ between international (Singapore) and mainland China (Beijing). Attention to environment variable and billing configuration management.
Usage Considerations and Risks
Understand specification-derived considerations to ensure quality and compliance.
No online trial UI is provided, so API-based invocation is standard (demo available on HF Spaces; verification via scripts or CLI). Audio passed to API should preferably be provided via public URLs, requiring sufficient consideration for data rights, confidentiality, and consent. Additionally, punctuation, number normalization (ITN), and proper noun processing vary output depending on settings, so establishing review and post-editing flows enables stable usage.
Summary
Qwen3-ASR is an ASR service covering multilingual, dialect, and singing with a single model, equipped with practical implementation features like streaming and context injection. Pricing is per-second, with a 36,000-second free tier in 90 days after account activation. Before implementation, verify 3min/10MB request limit, single-turn model characteristic, ITN coverage, regional endpoint and billing differences. First verify operational feel and accuracy with free tier and official CLI/demo, then refine settings (language specification, context design, post-editing flow) according to production requirements for smooth deployment.