Siqi Ouyang (欧阳思琦)

Hi, I’m currently a PhD student at the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University, advised by Prof. Lei Li. Before I came to CMU, I spent two years as a PhD student at the Computer Science Department at UC Santa Barbara also with Lei. Before PhD, I received my B.Eng. from the Institute for Interdisciplinary Information Sciences at Tsinghua University (a.k.a. Yao Class), advised by Prof. Yi Wu.

My research aims to build the foundation for real-time communication across languages. I study simultaneous translation with large language models, with the goal of enabling systems that can listen, understand, and translate as people speak, making multilingual communication as natural and immediate as conversation itself.

Office: GHC 6715, 4902 Forbes Ave, Pittsburgh, PA 15213

Email: siqiouya@andrew.cmu.edu

[Twitter/X] [GitHub] [LinkedIn] [Google Scholar]

News

Mar 11, 2026	Give a lecture at Speech Technology for Conversational AI course at CMU.
Feb 26, 2026	Give a talk at Speech Lunch. Here is the slide.
Sep 05, 2025	Give a talk at TTIC Summer Workshop on Foundations of Speech and Audio Foundation Models.
May 12, 2025	Intern at NVIDIA NeMo again, advised by Shouyang Ding, Oleksii Hrinchuk, and Vitaly Lavrukhin.
Apr 30, 2025	Presented Anticipating Future with Large Language Model for Simultaneous Machine Translation orally at NAACL 2025.
Oct 03, 2024	Give a talk at Speech Lunch. Here is the slide.
May 13, 2024	Intern at NVIDIA NeMo, advised by Zhehuai Chen, Oleksii Hrinchuk, and Vitaly Lavrukhin.
Jan 16, 2024	Join the Language Technologies Institute at Carnegie Mellon University as a PhD student.

Selected Papers

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

Siqi Ouyang, Shuoyang Ding, Oleksii Hrinchuk, Vitaly Lavrukhin, Brian Yan, Boris Ginsburg, and Lei Li

In Proceedings of the 64rd Annual Meeting of the Association for Computational Linguistics, Jul 2026

Oral

Abs PDF Code

Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM’s key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies.
RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Jiaxuan Luo, Siqi Ouyang, and Lei Li

Jan 2026

Abs PDF

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. In this paper, we introduce RASST, which integrates cross-modal retrieval into the SST pipeline using a lightweight speech-text retriever and sliding-window retrieval to provide terminology hints and enhance translation accuracy. Experiments demonstrate improvements of up to 16% in terminology translation accuracy and up to 3 BLEU points in overall quality across three language directions.
InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model

Siqi Ouyang, Xi Xu, and Lei Li

In Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025

Abs PDF Code

Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the historical speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. Code is released at https://github.com/LeiLiLab/InfiniSST.
Anticipating Future with Large Language Model for Simultaneous Machine Translation

Siqi Ouyang, Oleksii Hrinchuk, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Lei Li, and Boris Ginsburg

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

Abs PDF Code

Simultaneous machine translation (SMT) takes streaming input utterances and incrementally produces target text. Existing SMT methods only use the partial utterance that has already arrived at the input and the generated hypothesis. Motivated by human interpreters’ technique to forecast future words before hearing them, we propose Translation by Anticipating Future (TAF), a method to improve translation quality while retaining low latency. Its core idea is to use a large language model (LLM) to predict future source words and opportunistically translate without introducing too much risk. We evaluate our TAF and multiple baselines of SMT on four language directions. Experiments show that TAF achieves the best translation quality-latency trade-off and outperforms the baselines by up to 5 BLEU points at the same latency (three words).
CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Xi Xu, Wenda Xu, Siqi Ouyang, and Lei Li

In Findings of the Association for Computational Linguistics: NAACL 2025, Apr 2025

Abs PDF

Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.
FASST: Fast LLM-based Simultaneous Speech Translation

Siqi Ouyang, Xi Xu, Chinmay Dandekar, and Lei Li

Aug 2024

Abs PDF

Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
CMU‘s IWSLT 2024 Simultaneous Speech Translation System

Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, and Shinji Watanabe

In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Aug 2024

Top 1 Human Rating

Abs PDF

This paper describes CMU‘s submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

Siqi Ouyang (欧阳思琦)

News

Selected Papers

Selected Awards

Waibel Presidential Fellowship 2024

Tsinghua University Yao Recognition Prize 2021

Gold Medal, Chinese National Olympiad in Informatics 2016