STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data

Yongdeuk Seo¹, Hyun-seok Min², Sungchul Choi¹

¹Pukyong National University, ²Tomocube Inc.

sod7050@pukyong.ac.kr, hsmin@tomocube.com, sc82.choi@pknu.ac.kr

AAAI Workshops AIBSD 2026 Oral

Abstract

Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Lowresource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.

Key Contributions

1️⃣ Accurate text editing in low-resource languages (Korean, Arabic, and Japanese).
2️⃣ Develop an STE model that performs strongly on real-world images.
→ [Dataset] STIPLAR: Scene Text Image Pairs of Low-resource lAnguages and Real-World Data
→ [Model] STELLAR: Scene Text Editor for Low-Resource LAnguages and Real-World Data
3️⃣ Propose a new metric to evaluate style preservation in text images without GT.
→ [Metric] TAS: Text Appearance Similarity

Framework

STELLAR guides a diffusion generator using two signals: glyph features for accurate text rendering and style features for preserving color, font, and background. The glyph features come from a language-adaptive glyph encoder trained with language-specific OCR recognizers, making it effective for scripts with different structural properties. The model is trained in two stages: synthetic pre-training followed by real-world fine-tuning on STIPLAR.

STIPLAR Dataset

STIPLAR is a real-world paired scene text image dataset for Korean, Arabic, and Japanese, built to support real-world adaptation and evaluation. Pairs are constructed from real images via text detection and cropping, filtering, similarity based pairing, and manual verification of matched style attributes. In STELLAR’s multi-stage training, we use STIPLAR for Stage 2 fine-tuning to adapt the model to real-world scenes. We also release representative samples to illustrate the diversity of real-world conditions.

As far as we know, STIPLAR is the first real-world paired scene text image dataset covering three languages.

TAS Metric

Conventional image similarity metrics are unreliable for scene text editing because the edited text content differs and a ground truth target image is usually unavailable. TAS evaluates appearance consistency (color, font, background).

TAS relies on a pretrained Text Style Encoder \(S\). When a text image \(I\) is fed into \(S\), it yields disentangled style conditions texture features \(c_\text{tex}\) and spatial features \(c_\text{spa}\), and can produce component-specific outputs via its task heads \(F_\text{clr}\) (color), \(F_\text{fnt}\) (font), and \(F_\text{rmv}\) (background), with \(F_\text{seg}\) providing a text mask when needed.

To compare two images \(I_A\) and \(I_B\), we pass both through \(S\) and obtain \(\tilde{i}^{A}_\text{clr}\), \(\tilde{i}^{B}_\text{clr}\), \(\tilde{i}^{A}_\text{fnt}\), \(\tilde{i}^{B}_\text{fnt}\), and text-removed backgrounds \(\tilde{i}^{A}_\text{bg}\), \(\tilde{i}^{B}_\text{bg}\). TAS then computes color \(s_\text{clr}\), font \(s_\text{fnt}\), and background \(s_\text{bg}\) similarities from these outputs, and reports the final score by averaging the three.

[Text Style Encoder]

[Examples of TAS Component Similarities]

Results on Real-world STE

We evaluate STELLAR on real-world scene text editing and compare it with prior methods in terms of visual fidelity and style preservation. Qualitative results show that STELLAR better maintains key appearance factors such as text color, font style, and background consistency while producing cleaner and more coherent edited text. These examples highlight STELLAR’s ability to perform realistic edits in challenging real-world conditions.

Additional Editing Results

BibTeX

@article{seo2025stellar,
  title={STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data},
  author={Seo, Yongdeuk and Min, Hyun-seok and Choi, Sungchul},
  journal={arXiv preprint arXiv:2511.09977},
  year={2025}
}