EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

Better VLM initialization for VLA with mid-training on aligned data distribution

Yiyang Du1, Zhanqiu Guo1, Xin Ye2, Liu Ren2, Chenyan Xiong1
1Language Technologies Institute, Carnegie Mellon University
2Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI)
{yiyangd, zhanqiug, cx}@cs.cmu.edu   {xin.ye3, liu.ren}@us.bosch.com

TL;DR — Most VLAs initialize from off-the-shelf VLMs that aren't tailored to embodied domains. EmbodiedMidtrain bridges this gap with a lightweight data engine that selects VLM samples closest to the VLA distribution for mid-training, consistently boosting VLA performance across benchmarks and backbones.

VLM-VLA Distribution Gap
(a) VLM-VLA Data
Distribution Gap
VLM Data Distribution Shift
(b) VLM Data
Distribution Shift
VLM Mid-training
(c) VLM Mid-training
Downstream VLA Gains
(d) Downstream
VLA Gains
Figure 1: Overview of EmbodiedMidtrain. We analyze the data distribution gap between VLMs and VLAs, and select VLM samples with higher proximity to the VLA domain for mid-training, yielding a stronger initialization for downstream VLA fine-tuning.

Abstract

Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data.

Key Contributions

Data Distribution Gap between VLMs and VLAs

Since most VLAs inherit their visual and linguistic representations from VLM pretraining, the quality of this initialization is fundamentally shaped by the data the VLM was trained on. We analyze VLM and VLA data in a shared representation space using the last hidden states of a VLM as the feature representation $h(\cdot)$ for each data sample, and quantify the distance between each pair of datasets with Maximum Mean Discrepancy (MMD):

$\operatorname{MMD}^2(P,Q) = \mathbb{E}_{x,x' \sim P}[k(x,x')] - 2\,\mathbb{E}_{\substack{x \sim P \\ y \sim Q}}[k(x,y)] + \mathbb{E}_{y,y' \sim Q}[k(y,y')]$

where $k(x,y) = \exp(-\|h(x) - h(y)\|_2^2 / 2\sigma^2)$ is a Gaussian RBF kernel with bandwidth $\sigma$ set via the median heuristic.

Pairwise normalized MMD distance matrix
(a) Pairwise normalized MMD distance matrix
between VLM and VLA datasets
t-SNE visualization
(b) t-SNE visualization of visual feature
distributions across VLM and VLA datasets
Figure 2: Distribution analysis of VLM and VLA datasets. (a) Pairwise MMD distances quantify the distribution gap, with cross-group distances larger than within-group distances. (b) VLA datasets form compact, concentrated clusters that are separated from the broader VLM distributions.

Key Findings

The distribution gap between VLM and VLA data is large and clear. The distribution space of VLM data and VLA data is largely separated, with only a few near-neighbors. MMD distances are generally smaller within the VLM group and within the VLA group than across the two groups, quantitatively confirming a clear distributional mismatch. VLA datasets form compact clusters that are mostly detached from the main regions occupied by VLM datasets, while only a small subset of the VLM data lies nearby.

The distribution gap exhibits substantial internal heterogeneity. Although VLM and VLA data are separated overall, their mismatch is not uniform within each dataset. Some VLM sources lie noticeably closer to VLA domains than others, with a few local regions exhibiting clear cross-domain proximity despite the global separation. This suggests that the gap is better characterized as a spectrum of alignment rather than a binary distinction, motivating sample-level selection within each dataset.

These insights suggest that enhancing VLMs for embodied tasks requires reshaping the training data distribution toward the VLA domain — not through coarse dataset-level mixture adjustment, but through fine-grained sample-wise selection.

Data Engine for EmbodiedMidtrain

We propose a mid-training data engine that selects VLM samples whose distribution best aligns with the VLA domain. The data mixture spans both general and embodied-oriented VLM sources to preserve diversity, and selection operates at the sample level rather than the dataset level.

Proximity-Based Data Selection

Let $\mathcal{D}_{\mathrm{VLM}}$ and $\mathcal{D}_{\mathrm{VLA}}$ denote the candidate VLM pool and target VLA corpus, with densities $p_{\mathrm{VLM}}$ and $p_{\mathrm{VLA}}$ over a shared representation space. Our goal is to select a size-$K$ subset whose distribution best aligns with VLA data:

$\mathcal{D}_{\mathrm{VLM}}^{*} = \underset{\mathcal{D}' \subseteq \mathcal{D}_{\mathrm{VLM}},\; |\mathcal{D}'| = K}{\operatorname{argmin}} \; d(P_{\mathcal{D}'},\; P_{\mathrm{VLA}})$

where $d$ is a distributional divergence. Solving this exactly is intractable, so we relax it to per-sample scoring and top-$K$ selection: $\mathcal{D}_{\mathrm{VLM}}^{*} = \operatorname{top\text{-}K}_{x_i \in \mathcal{D}_{\mathrm{VLM}}} s(x_i)$. The key question is how to define the scoring function $s$. A natural choice is the density ratio $p_{\mathrm{VLA}}(x)/p_{\mathrm{VLM}}(x)$, but estimating this directly in high-dimensional feature spaces is difficult. We instead leverage a classical result from density ratio estimation: a binary classifier trained to distinguish two distributions recovers this ratio at optimality:

$s^{*}(x) = \dfrac{p_{\mathrm{VLA}}(x)}{p_{\mathrm{VLA}}(x) + p_{\mathrm{VLM}}(x)}$

Since $s^{*}$ is monotonically increasing in the density ratio, ranking by the classifier output is equivalent to ranking by the density ratio.

Proximity Estimator

We instantiate this as a lightweight proximity estimator on frozen VLM features. The estimator applies a learnable scoring function $f$ on top of the frozen VLM's last hidden state $\phi(\mathbf{x})$, followed by a sigmoid:

$s(\mathbf{x}) = \sigma\!\big(f(\phi(\mathbf{x}))\big)$

We train with VLA samples as positives and VLM samples as negatives, using binary cross-entropy loss:

$\mathcal{L}_{\mathrm{cls}} = -\mathbb{E}_{y \sim \mathcal{D}_{\mathrm{VLA}}} [\log s(y)] - \mathbb{E}_{x \sim \mathcal{D}_{\mathrm{VLM}}} [\log (1-s(x))]$

After training, we rank all candidate VLM samples by $s(x)$ and retain the top-$K$ to form $\mathcal{D}_{\mathrm{VLM}}^{*}$ for mid-training. This procedure turns a broad candidate pool into a more targeted corpus for embodied adaptation, preserving the useful diversity of VLM data while shifting the training distribution toward the VLA domain.

Main Results

We evaluate on three simulated manipulation benchmarks: Calvin ABC-D, SimplerEnv Bridge, and LIBERO-10. Our mid-trained models achieve consistent and substantial improvements, competitive with expert VLAs and much larger off-the-shelf VLMs.

Model Size # Samples Seen
(Calvin / Simpler / Libero)
Calvin (Tasks Completed in a Row) Simpler↑ Libero↑
12345Avg. Len.↑
Expert VLA Baselines*
OpenVLA (Llama-2)7.7B7.7M / 25.6M / 25.6M 0.7920.6440.4990.3680.2452.5484.253.7
π0 (Paligemma-1)3.1B7.7M / 25.6M / 25.6M 0.8960.7850.7860.6100.5323.50960.446.0
Off-the-shelf VLM Baselines*
Qwen2.5VL-3B3.8B7.7M / 25.6M / 25.6M 0.9220.8420.7660.7000.6263.85648.043.0
Qwen2.5VL-7B8.3B7.7M / 25.6M / 25.6M 0.9350.8640.8070.7580.6934.05746.845.0
Qwen3VL-2B2.1B7.7M / 25.6M / 25.6M 0.9430.8820.8310.7760.7104.14249.055.8
Qwen3VL-4B4.4B7.7M / 25.6M / 25.6M 0.9330.8570.7900.7190.6443.94356.344.4
Qwen3VL-8B8.8B7.7M / 25.6M / 25.6M 0.9400.8680.7970.7460.6844.03558.346.2
Qwen3VL-30B-A3B30B-A3B7.7M / 25.6M / 25.6M 0.9390.8770.8200.7570.6824.07544.846.8
Paligemma-12.9B7.7M / 25.6M / 25.6M 0.9140.8130.6920.5990.4883.50655.344.2
Paligemma-23.0B7.7M / 25.6M / 25.6M 0.9010.7750.6690.5750.4863.40657.346.2
KosMos-21.7B7.7M / 25.6M / 25.6M 0.8780.7210.5910.4980.4083.09660.455.0
VLM with EmbodiedMidtrain (Ours)
InternVL3.5-1B1.1B1.0M / 4.1M / 4.1M 0.9090.7540.6060.4980.4063.17336.539.0
+ EmbodiedMidtrain1.1B1.0M / 4.1M / 4.1M 0.9350.8380.7370.6530.5513.71456.354.2
Qwen3VL-2B2.1B1.0M / 4.1M / 4.1M 0.8870.7470.6120.5270.4323.20538.533.8
+ EmbodiedMidtrain2.1B1.0M / 4.1M / 4.1M 0.9220.8080.7000.6230.5333.58445.840.2

Table 1: Main results across Calvin ABC-D, SimplerEnv-Bridge, and Libero-10. # Samples Seen is reported as the training budgets on Calvin / SimplerEnv / Libero. * Results for Expert VLA Baselines and Off-the-shelf VLM Baselines are reproduced and reported by VLM4VLA.

Analysis

Training Dynamics

Training dynamics across VLA tasks
Figure 3: Training dynamics across VLA tasks for VLMs with and without EmbodiedMidtrain. The mid-trained model already achieves higher performance in the early stages of fine-tuning, and the gap widens over time — indicating that mid-training provides a fundamentally better initialization.

The mid-trained model already achieves higher performance in the early stages of fine-tuning, providing direct evidence that proximity-based mid-training yields a better initialization for VLA learning.

Analysis of Selected VLM Data

Per-dataset proximity score
(a) Per-dataset proximity score
High-scoring sample
Q: You are standing at the point marked by the coordinate point at point (0.878, 0.780). Which object is directly in front of you?
A: The white matte truck at lower right.

Q: Locate a point on the yellow metallic crane at upper right. (format instructions...)
A: [(0.976, 0.244)]
(b) High-scoring sample
Low-scoring sample
Q: Who wrote this book?
A: Charles P. McKeague.

Q: What is the title of this book?
A: Trigonometry.
(c) Low-scoring sample
Figure 4: Analysis of proximity-based data selection. (a) Distribution of proximity scores across VLM data sources. (b) A high-scoring sample from RefSpatial requiring spatial grounding and reasoning. (c) A low-scoring sample: a book cover with text-only VQA.

While all datasets concentrate in the low-to-moderate score range, the distribution shapes vary noticeably across datasets. Among them, RefSpatial achieves the highest average scores while VCR receives the lowest, indicating that the estimator assigns clear dataset-level preferences. At the same time, the within-dataset score spread shows that the estimator also performs fine-grained sample-level selection, retaining only the most VLA-aligned samples even from high-scoring datasets.

Ablation Studies

We ablate two central design choices: the advantage of proximity-based selection over random sampling, and the effectiveness of different proximity measurements on mid-training InternVL3.5-1B backbone.

Setting Calvin↑ Simpler↑ Libero↑
Random Selection3.39843.848.4
Proximity Measurements
Feat.-space Avg. Dist.3.12653.151.2
VLA-cond. Perplexity3.15955.248.0
Delta Perplexity1.52739.654.2
Learned Estimator (Ours)3.71456.354.2

Table 2: Ablation results for random selection and different proximity measurements on mid-training InternVL3.5-1B backbone. The learned proximity estimator consistently outperforms all alternatives.

Both random sampling and the three hand-crafted alternatives — feature-space average distance, VLA-conditioned perplexity, and delta perplexity — underperform our learned estimator across all three benchmarks, confirming that mid-training’s gains come from identifying VLA-aligned samples rather than from additional data alone, and that a learned proximity signal captures this alignment far more robustly than heuristic metrics.