Conventional image similarity metrics are unreliable for scene text editing because the edited text content differs and a ground truth target image is usually unavailable. TAS evaluates appearance consistency (color, font, background).
TAS relies on a pretrained Text Style Encoder \(S\). When a text image \(I\) is fed into \(S\), it yields disentangled style conditions
texture features \(c_\text{tex}\) and spatial features \(c_\text{spa}\), and can produce component-specific outputs via its task heads
\(F_\text{clr}\) (color), \(F_\text{fnt}\) (font), and \(F_\text{rmv}\) (background), with \(F_\text{seg}\) providing a text mask when needed.
To compare two images \(I_A\) and \(I_B\), we pass both through \(S\) and obtain
\(\tilde{i}^{A}_\text{clr}\), \(\tilde{i}^{B}_\text{clr}\),
\(\tilde{i}^{A}_\text{fnt}\), \(\tilde{i}^{B}_\text{fnt}\),
and text-removed backgrounds \(\tilde{i}^{A}_\text{bg}\), \(\tilde{i}^{B}_\text{bg}\).
TAS then computes color \(s_\text{clr}\), font \(s_\text{fnt}\), and background \(s_\text{bg}\) similarities from these outputs,
and reports the final score by averaging the three.