Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (2025)

Chenrui Tie∗1  Shengxiang Sun∗2  Jinxuan Zhu1  Yiwei Liu4  Jingxiang Guo1
Yue Hu5  Haonan Chen1  Junting Chen1  Ruihai Wu3  Lin Shao1
1National University of Singapore  
2University of Toronto  3Peking University  4Sichuan University5Zhejiang University

Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions.In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them.To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation.We demonstrate the effectiveness of Manual2Skillby successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals.This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (1)

11footnotetext: Equal contribution.

I Introduction

Humans can learn manipulation skills from instructions in images or texts; for example, people can assemble IKEA furniture or LEGO models by following a manual’s instructions.This ability enables humans to efficiently acquire long-horizon manipulation skills from sketched instructions.In contrast, robots typically learn such skills through imitation learning [59] or reinforcement learning [43], both of which require significantly more data and computation.Replicating the human ability to transfer abstract manuals to real-world actions remains a significant challenge for robots.Manuals are typically designed for human understanding, using simple schematic diagrams and symbols to convey manipulation processes. This abstraction makes it difficult for robots to comprehend such instructions and derive actionable manipulation strategies [32, 49, 48].Developing a method for robots to effectively utilize human-designed manuals would greatly expand their capacity to tackle complex, long-horizon tasks while reducing the demand of collecting extensive demonstration data.

Manuals inherently encode the structural information of complex tasks. They decompose high-level goals into mid-level subgoals and capture task flow and dependencies, such as sequential steps or parallelizable subtasks.For example, furniture assembly manuals guide the preparation and combination of components and ensure that all steps follow the correct order [32].Extracting this structure is crucial for robots to replicate human-like understanding and manage complex tasks effectively[19, 33].After decomposing the task, robots need to infer the specific information for each step, such as the involved components and their spatial relationships.For example, in cooking tasks, the instruction images and texts may involve selecting ingredients, tools, and utensils and arranging them in a specific order [38].Finally, robots need to generate a sequence of actions to complete the task, such as grasping, placing, and connecting components.Previous works have tried to leverage sketched pictures[42] or trajectories[15] to learn manipulation skills but are always limited to relatively simple tabletop tasks.

In this paper, we propose Manual2Skill, a novel robot learning framework that is capable of learning manipulation skills from visual instruction manuals.This framework can be applied to automatically assemble IKEA furniture, a challenging and practical task that requires complex manipulation skills.As illustrated inFigure1, given a set of manual images and the real furniture parts, we first leverage a vision language model to understand the manual and extract the assembly structure, represented as a hierarchical graph.Then, we train a model to estimate the assembly poses of all involved components in each step.Finally, a motion planning module generates action sequences to move selected components to target poses and executes them on robots to assemble the furniture.

In summary, our main contributions are as follows:

  • We propose Manual2Skill, a novel framework that leverages a VLM to learn complex robotic skills from manuals, enabling a generalizable assembly pipeline for IKEA furniture.

  • We introduce a hierarchical graph generation pipeline that utilizes a VLM to extract structured information for assembly tasks. Our pipeline facilitates real-world assembly and extends to other assembly applications.

  • We define a novel assembly pose estimation task within the learning-from-manual framework. We predict the 6D poses of all involved components at each assembly step to meet real-world assembly requirements.

  • We perform extensive experiments to validate the effectiveness of our proposed system and modules.

  • We evaluate our method on four real items of IKEA furniture, demonstrating its effectiveness and applicability in real-world assembly tasks.

II Related Work

II-A Furniture Assembly

Part assembly is a long-standing challenge with extensive research exploring how to construct a complete shape from individual components or parts[6, 13, 20, 27, 29, 36, 53, 46, 45]. Broadly, we can categorize part assembly into geometric assembly and semantic assembly. Geometric assembly relies solely on geometric cues, such as surface shapes or edge features, to determine how parts mate together[6, 53, 37, 10]. In contrast, semantic assembly primarily leverages high-level semantic information about the parts to guide assembly process[13, 20, 27, 29, 45].

Furniture assembly is a representative semantic assembly task, where each part has a predefined semantic role (e.g., a chair leg or a tabletop), and the assembly process follows intuitive, common-sense relationships (e.g., a chair leg must be attached to the chair seat). Previous studies on furniture assembly have tackled different aspects of the problem, including the motion planning[41], multi-robot collaboration[25], and assembly pose estimation[29, 58, 30]. Researchers have developed several datasets and simulation environments to facilitate research in this domain. For example,Wang etal. [49], Liu etal. [32] introduced IKEA furniture assembly datasets containing 3D models of furniture and structured assembly procedures derived from instruction manuals. Additionally, Lee etal. [27] andYu etal. [58] developed simulation environments for IKEA furniture assembly, whileHeo etal. [16] provides a reproducible benchmark for real-world furniture assembly.However, existing works typically focus on specific subproblems rather than addressing the entire assembly pipeline.In this work, we aim to develop a comprehensive framework that learns the sequential process of furniture assembly from manuals and deploys it in real-world experiments.

II-B VLM Guided Robot Learning

Vision Language Models (VLMs)[57] have been widely used in robotics to understand the environment[17] and interact with humans[39].Recent advancements highlight VLMs’ potential to enhance robot learning by integrating vision and language information, enabling robots to perform complex tasks with greater adaptability and efficiency[18].A potential direction is the development of the Vision Language Action Model (VLA Model) that can generate actions based on the vision and language inputs[2, 23, 3, 44].However, training such models requires vast amounts of data, and they struggle with long-horizon or complex manipulation tasks.Another direction is to leverage VLMs to guide robot learning by providing high-level instructions and perceptual understanding. VLMs can assist with task descriptions[17, 18], environment comprehension[19], task planning[47, 56, 62], and even direct robot control[28].Additionally, Goldberg etal. [14] demonstrates how VLMs can assist in designing robot assembly tasks. Building on these insights, we explore how VLMs can interpret abstract manuals and extract structured information to guide robotic skill learning for long-horizon manipulation tasks.

II-C Learning from Demonstrations

Learning from demonstration(LfD) has achieved promising results in acquiring robot manipulation skills[12, 64, 7]. For a broader review of LfD in robotic assembly, we refer to Zhu and Hu [65].The key idea is to learn a policy that imitates the expert’s behavior.However, previous learning methods often require fine-grained demonstrations, like robot trajectories[7] or videos[22, 40, 21].Collecting these demonstrations is often labor-intensive and may not always be feasible.Some works propose to learn from coarse-grained demonstrations, like the hand-drawn sketches of desired scenes[42] or rough trajectory sketches[15].These approaches reduce dependence on expert demonstrations and improve the practicality of LfD. However, they are mostly limited to tabletop manipulation tasks and do not generalize well to more complex, long-horizon assembly problems.In this work, we aim to extend LfD beyond these constraints by tackling a more challenging assembly task using abstract instruction manuals.

III Problem Formulation

Given a complete set of 3D assembly parts and its assembly manual, our goal is to generate a physically feasible sequence of robotic assembly actions for autonomous furniture assembly. Manuals typically use schematic diagrams and symbols designed to depict step-by-step instructions in an abstract format that is universally understandable. We define the manual pages as a set of N𝑁Nitalic_N images. ={I1,I2,,IN}subscript𝐼1subscript𝐼2subscript𝐼𝑁\mathcal{I}=\{I_{1},I_{2},\cdots,I_{N}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where each image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT illustrates a specific step in the assembly process, such as the merging of certain parts or subassemblies

The furniture consists of M𝑀Mitalic_M individual parts 𝒫={P1,P2,,PM}𝒫subscript𝑃1subscript𝑃2subscript𝑃𝑀\mathcal{P}=\{P_{1},P_{2},\cdots,P_{M}\}caligraphic_P = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. A part is an individual element in 𝒫𝒫\mathcal{P}caligraphic_P that remains disconnected from other parts until assembly. A subassembly is any partially or fully assembled structure that forms a proper subset of 𝒫𝒫\mathcal{P}caligraphic_P (for example, {P1,P2}subscript𝑃1subscript𝑃2\{P_{1},P_{2}\}{ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }). The term component encompasses both parts and subassemblies.

Given the manual and 3D parts, the system generates an assembly plan. Each step corresponds to a manual image and specifies the involved parts and sub-assemblies, their spatial 6D poses, and the assembly actions or motion trajectories required for execution.

IV Technical Approach

Our approach automates furniture assembly by leveraging the VLM to interpret IKEA-style manuals and guide robotic execution. Given a visual manual and physical parts in a pre-assembly scene, a VLM generates a hierarchical assembly graph, defining which parts and subassemblies are involved in each step. Next, a per-step pose estimation model predicts 6D poses for each component using a manual image and the point clouds of involved components. Finally, for assembly execution, the estimated poses are transformed into the robot’s world frame, and a motion planner generates a collision-free trajectory for part mating.

This paper shows an overview of our framework in Fig.2. We describe the VLM-guided assembly hierarchical graph generation inSectionIV-A, followed by per-step assembly pose estimation inSectionIV-B and assembly action generation based on component relationships inSectionIV-C.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (2)

IV-A VLM Guided Hierarchical Assembly Graph Generation

This section demonstrates how VLMs can interpret IKEA-styled manuals to generate high-level assembly plans. Given a manual and a real-world image of furniture parts (pre-assembly scene image), a VLM predicts a hierarchical assembly graph. We show one example in Fig.2. In this graph, leaf nodes represent atomic parts, while non-leaf nodes denote subassemblies. We structure the graph in multiple layers, where each layer contains nodes representing parts or subassemblies involved in a single assembly step (corresponding to one manual image). The directed edges from the children to a parent node indicate that the system assembles the parent node from all its children nodes.Additionally, we add edges between equivalent parts, denoting these parts are identical(e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g . four legs of a chair).Representing the assembly process as a hierarchical graph can decomposes the assembly into sequential steps while specifying necessary parts and subassemblies.We give the formal definition of the hierarchical graph inSection-J.We achieve this in two stages: Associating Manuals with Real Parts and Identifying Parts needed in Each Image.

IV-A1 VLM Capabilities and General Prompt Structure

The task is inherently complex due to the diverse nature of input images. Manuals are typically abstract sketches, whereas pre-assembly scene images are high-resolution real-world images.Such diversity requires advanced visual recognition and spatial reasoning across varied image domains, which are strengths of VLMs due to their training on extensive, internet-scale datasets. We demonstrate the effectiveness of VLMs for this task in SectionV-A andSection-D.

Every VLM prompt consists of two components:

  • Image Set: This includes all manual pages and the real-world pre-assembly scene image. Unlike traditional VLM applications in robotics [23, 18], which process a single image, our method requires multi-image reasoning.

  • Text Instructions: These instructions provide a task-specific context, guiding the model in interpreting the image set. The instructions range from simple directives to Chain-of-Thought reasoning [51]. All instructions incorporate in-context learning examples, specifying the required output format—be it JSON, Python code, or natural language. This structure is essential to our multi-stage pipeline, ensuring well-structured, interpretable outputs that seamlessly integrate into subsequent stages.

IV-A2 Stage I: Associating Real Parts with Manuals

Given the manual’s cover sketch of the assembled furniture and the pre-assembly scene image, the VLM aims to associate physical parts with the manual. The VLM achieves this by predicting the roles of each physical part through semantically interpreting the manual’s illustrations. This process involves analyzing spatial, contextual, and functional cues in the manual illustrations to enable a comprehensive understanding of each physical part. This design mimics human assembly cognition—people first map abstract manual images to physical parts before assembling. Our method follows CoT [51] and Least-to-Most [63] prompting, reducing cognitive load and improving accuracy.We considered pairwise matching of parts from manuals and scene images, but we found it impractical because the manuals not depict each part independently.

To enhance part identification, we employ Set of Marks [55] and GroundingDINO [31] to automatically label parts on the pre-assembly scene image with numerical indices. The labeled scene image and manual sketch form the Image Set. Text instructions consist of a brief context explanation for the association task of predicting the roles of each physical part, accompanied by in-context examples of the output structure:

{name, label, role}

For example, inFigure2 In Stage I Output, we describe the chair’s seat as name: seat frame, label: [2], role: for people sitting on a chair, the seat offers essential support and comfort and is positioned centrally within the chair’s frame.. Here, [2] indicates that this triplet corresponds to the physical part labeled with index 2 in the pre-assembly scene image. This triplet format enhances interpretability and ensures consistency by structuring all outputs into the same data format. We use the Image Set and Text Instructions as the input prompt for the VLM (specifically GPT-4o [1]) and query it once to generate real assignments for all physical parts. We then use these labels as leaf nodes in the hierarchical assembly graph.

We can obtain equivalent parts through these triplets. When two physical parts share the same geometric shapes, their triplets only differ by label. For example, inFigure2 Stage I Output, {name: side frame, label: [0], role:…} and {name: side frame, label: [1], role:…}—these two parts are considered equivalent. Understanding equivalent part relationships is crucial for downstream modules, as demonstrated by our ablation experiments(seeSection-C).

IV-A3 Stage II: Identify Involved Parts in Each Step

This stage focuses on identifying the particular parts and subassemblies involved in each manual page. The VLM achieves this by reasoning through the illustrated assembly steps, using the triplets and the labeled pre-assembly scene from the previous stage as supporting hints.

In practice, we observe that irrelevant elements in the manual (e.g., nails, human figures) can distract the VLM. Following [49], we manually crop the illustrated parts and subassemblies in each manual step to focus the VLM’s attention (Figure2 Stage II Image Set), significantly improving performance (see Ablation Study for details). Automating Region-of-Interest (ROI) detection remains an open problem beyond the scope of this work and is left for future research.

The manual pages, combined with the labeled pre-assembly scene from the previous stage, form the Image Set.The Text Instructions use a Chain-of-Thought prompt to guide the VLM in identifying parts and subassemblies step by step and includes in-context examples that clarify the structured output format: a pair consisting of (Step N, Labeled Parts Involved).The bottom left output ofFigure2 provides an example of this format.Together, the Image Set and Text Instructions compose the input prompt for GPT-4o, which generates pairs for all assembly steps using a single query.

As shown in Fig.2, the system outputs nested lists. We then transform these lists, along with the equivalent parts, into a hierarchical graph. Using this assembly graph, we traverse all non-leaf nodes and explore various assembly orders. Formally, a feasible assembly order is an ordered set of non-leaf nodes, ensuring that a parent node appears only after all its child nodes. A key advantage of the hierarchical graph representation is its flexibility—since the assembly sequence is not unique, it allows for parallel assembly or strategic sequencing.

IV-B Per-step Assembly Pose Estimation

Given an assembly order, we train a model to estimate the poses of components (parts or subassemblies) at each step of the assembly process. At each step, the model inputs the manual image and the point clouds of the involved components, predicting their target poses to ensure proper alignment. To support this task, we construct a dataset for sequential pose estimation. For a detailed description, seeSection-A.

Given each component’s point cloud (obtained from real-world scans or our dataset), we first center it by translating its centroid to the origin. Next, we apply Principal Component Analysis (PCA) to identify the dominant object axes, which define a canonical coordinate frame. The most dominant axes serve as the reference frame, ensuring a shape-driven and consistent orientation that remains independent of arbitrary coordinate systems.

The dataset we create provides manual images, point clouds, and target poses for each component in the camera frame of the corresponding manual image(following[29]). For an assembly step depicted in the manual image isubscript𝑖\mathcal{I}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the inputs to our model include: (1) the manual image isubscript𝑖\mathcal{I}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; (2) the point clouds of all involved components. The output is the target pose TSE(3)𝑇𝑆𝐸3T\in SE(3)italic_T ∈ italic_S italic_E ( 3 ) for each component represented in the camera frame of Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

IV-B1 Model Architecture

Note that the number of components at each step is not fixed, depending on the subassembly division of the furniture.Our pose estimation model consists of four parts: an image encoder Isubscript𝐼\mathcal{E}_{I}caligraphic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, a point cloud encoder Psubscript𝑃\mathcal{E}_{P}caligraphic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, a cross-modality fusion module Gsubscript𝐺\mathcal{E}_{G}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and a pose regressor \mathcal{R}caligraphic_R.

We first feed the manual image I𝐼Iitalic_I into the image encoder to get an image feature map 𝐅Isubscript𝐅𝐼\mathbf{F}_{I}bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT.

𝐅I=I(I)subscript𝐅𝐼subscript𝐼𝐼\mathbf{F}_{I}=\mathcal{E}_{I}(I)bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I )(1)

Then, we feed the point clouds into the point cloud encoder to get the point cloud feature for each component.

{𝐅j}=P({P}j)subscript𝐅𝑗subscript𝑃subscript𝑃𝑗\{\mathbf{F}_{j}\}=\mathcal{E}_{P}(\{P\}_{j}){ bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } = caligraphic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( { italic_P } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )(2)

In order to fuse the multi-modality information from the manual image and the point cloud features, we leverage a GNN[54] to update the information for each component.We consider the manual image feature and component-wise point cloud features as nodes in a complete graph, employing a GNN to update the information for each node.

𝐅I,{𝐅j}=G(𝐅I,{𝐅j})superscriptsubscript𝐅𝐼superscriptsubscript𝐅𝑗subscript𝐺subscript𝐅𝐼subscript𝐅𝑗\mathbf{F}_{I}^{\prime},\{\mathbf{F}_{j}^{\prime}\}=\mathcal{E}_{G}(\mathbf{F}%_{I},\{\mathbf{F}_{j}\})bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , { bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } = caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , { bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } )(3)

where 𝐅I,{𝐅j}superscriptsubscript𝐅𝐼superscriptsubscript𝐅𝑗\mathbf{F}_{I}^{\prime},\{\mathbf{F}_{j}^{\prime}\}bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , { bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } are updated image and point cloud features.

Finally, we feed the updated point cloud features as input into the pose regressor to get the target pose for each component.

Tj=(𝐅j)subscript𝑇𝑗superscriptsubscript𝐅𝑗T_{j}=\mathcal{R}(\mathbf{F}_{j}^{\prime})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_R ( bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )(4)

IV-B2 Loss Function

We adopt a loss function that jointly considers pose prediction accuracy and point cloud alignment, following[60, 30]. The first term penalizes errors in the predicted SE(3) transformation, while the second measures the distance between predicted and ground truth point clouds. To account for interchangeable components, we compute the loss across all possible permutations of equivalent parts and select the minimum loss as the final training objective. We provide further details on the loss formulation and training strategy inSection-B.

IV-C Robot Assembly Action Generation

IV-C1 Align Predicted Poses with the World Frame

At each assembly step, the previous stage predicts each component’s pose in the camera frame of the manual image. However, real-world robotic systems operate in their world frame, requiring a 6D transformation between these coordinates. Consider two components, A and B. The predicted target poses in the camera frame are denoted as 𝒯^aIisuperscriptsubscript^𝒯𝑎subscript𝐼𝑖{}^{I_{i}}\hat{\mathcal{T}}_{a}start_FLOATSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and 𝒯^bIisuperscriptsubscript^𝒯𝑏subscript𝐼𝑖{}^{I_{i}}\hat{\mathcal{T}}_{b}start_FLOATSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Meanwhile, our system can collect the current 6D pose of part A in the world frame, represented as 𝒯aWsuperscriptsubscript𝒯𝑎𝑊{}^{W}\mathcal{T}_{a}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.To align 𝒯^aIisuperscriptsubscript^𝒯𝑎subscript𝐼𝑖{}^{I_{i}}\hat{\mathcal{T}}_{a}start_FLOATSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to 𝒯aWsuperscriptsubscript𝒯𝑎𝑊{}^{W}\mathcal{T}_{a}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, we compute the 6D transformation matrix 𝒯IiWsuperscriptsubscript𝒯subscript𝐼𝑖𝑊{}^{W}_{I_{i}}\mathcal{T}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T, which maps the camera frame to the world frame.

𝒯aW=𝒯IiW𝒯^aIisuperscriptsubscript𝒯𝑎𝑊superscriptsubscript𝒯subscript𝐼𝑖𝑊superscriptsubscript^𝒯𝑎subscript𝐼𝑖{}^{W}\mathcal{T}_{a}={}^{W}_{I_{i}}\mathcal{T}{}^{I_{i}}\hat{\mathcal{T}}_{a}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_FLOATSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT(5)

Using the same transformation 𝒯IiWsuperscriptsubscript𝒯subscript𝐼𝑖𝑊{}^{W}_{I_{i}}\mathcal{T}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T, we compute the assembled target pose of part B (and all remaining components) in the world frame.

𝒯bW=𝒯IiW𝒯^bIisuperscriptsubscript𝒯𝑏𝑊superscriptsubscript𝒯subscript𝐼𝑖𝑊superscriptsubscript^𝒯𝑏subscript𝐼𝑖{}^{W}\mathcal{T}_{b}={}^{W}_{I_{i}}\mathcal{T}{}^{I_{i}}\hat{\mathcal{T}}_{b}start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = start_FLOATSUPERSCRIPT italic_W end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_T start_FLOATSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT(6)

This transformation accurately maps predicted poses from the manual image frame to the robot’s world frame, ensuring precise assembly execution.

IV-C2 Assembly Execution

Once our system determines the target poses of each component in the world frame for the current assembly step, it grasps each component and generates the required action sequences for assembly.

Part Grasping

After scanning each real-world part, we obtain the corresponding 3D meshes for each part. We employ FoundationPose[52], and the Segment Anything Model (SAM)[24] to obtain the initial poses of all parts in the scene.

Given the pose and shape of each part, we design heuristic grasping methods tailored to the geometry of individual components. While general grasping algorithms such as GraspNet[11] are viable, grasping is beyond the scope of this work. Instead, we employ heuristic grasping strategies specifically designed for structured components in assembly tasks. For stick-shaped components, we grasp the centroid of the object after identifying its longest axis for stability. For flat and thin-shaped components, we use fixtures or staging platforms to securely position the object, allowing the robot to grasp along the thin boundary for improved stability. We provide further details on these grasping methods inSection-G.

Part Assembly Trajectory

Once the robot arm grasps a component, it finds a feasible, collision-free path to predefined robot poses (anchor poses). At these poses, the 6D pose of the grasped component is recalculated in the world frame, leveraging the FoundationPose [52] and the Segment Anything Model (SAM)[24]. The system then plans a collision-free trajectory to the component’s target pose. We use RRT-Connect[26] as our motion planning algorithm. All collision objects in the scene are represented as point clouds and fed into the planner. Once the planner finds a collision-free path, the robot moves along the planned trajectory.

Assembly Insertion Policy

Once the robot arm moves a component near its target pose, the assembly insertion process begins. Assembly insertions are contact-rich tasks that require multi-modal sensing (e.g., force sensors and closed-loop control) to ensure precise alignment and secure connections. However, developing closed-loop assembly insertion skills is beyond the scope of this work and will be addressed in future research. In our current approach, human experts manually perform the insertion action.

V Experiments

In this section, we perform a series of experiments aimedat addressing the following questions.

  • Q1: Can our proposed hierarchical assembly graph generation module effectively extract structured information from manuals? (seeSectionV-A)

  • Q2: Can the per-step pose estimation be applicable to different categories of furniture and outperform previous settings? (seeSectionV-B)

  • Q3: How effective is the proposed framework in the assembly of furniture with manual guidance? (seeSectionV-C)

  • Q4: Can this pipeline be applied to real-world scenarios?(seeSectionV-D)

  • Q5: Can this pipeline be extended to other assembly tasks? (seeSectionV-E)

  • Q6: How should we determine and evaluate the key design choices of each module? (ablation experiments, seeSections-E and-C)

In addition, we have included a comprehensive set of prompts utilized in the VLM-guided hierarchical graph generation process inSection-K

V-A Hierarchical Assembly Graph Generation

In this section, we evaluate the performance of our VLM-guided hierarchical assembly graph generation approach. Specifically, we assess Stage II: Identifying Parts in Each Image using the IKEA-Manuals dataset[49]. We provide the rationale for excluding Stage I evaluation inSection-H.

MethodPrecisionRecallF1 ScoreSuccess Rate
SingleStep0.2200.2200.2200.220
GeoCluster0.1970.2010.1960.080
Ours0.6900.6800.6840.620
Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (3)

Experiment Setup. The IKEA-Manuals dataset[49] includes 102 furniture items, each with IKEA manuals, 3D parts, and assembly plans represented as trees in nested lists. We load each item’s 3D parts into Blender and render an image of the pre-assembly scene. Moreover, we split the 102 furniture items into two sets. The first set consists of 50 furniture items with six or fewer parts, and the second set contains 52 furniture items with seven or more parts. We observe that current VLMs can effectively deal with the first set, and a significant portion of real-world furniture also contains fewer than seven parts (as seen in real-world experiments). Here, we report the results of the first set. Please refer toSections-D and-K for complete results and prompts.This rendered image, along with the manual, is processed by the VLM through the stages outlined inSectionIV-A to generate a hierarchical assembly graph. Since we represent our graph as a nested list, we align our notation with the assembly tree notation used in IKEA-Manuals [49]. In this subsection, we refer to our generated assembly graph as the predicted tree.

Evaluation Metrics. We use the same metrics as IKEA-Manuals[49], which include precision, recall, and F1 score to compare predicted and ground-truth nodes of the assembly tree. For detailed descriptions of these metrics, we refer readers to[49].

The Matching criterion for each node is defined as follows: We consider a predicted non-leaf node correct only if its set of leaf and non-leaf child nodes exactly matches that of the corresponding ground-truth node(With consideration of equivalent parts).In other words, the predicted node must have the same children as its ground-truth counterpart. We compute precision, recall, and F1 scores based on this criterion.

The Success Rate criterion measures the proportion of the predicted tree that exactly matches the ground-truth tree. We consider a predicted tree exactly matched if all its non-leaf nodes satisfy the Matching criterion.

Baselines. We compare our VLM-based method against two heuristic approaches introduced in IKEA-Manuals[49].

  • SingleStep predicts a flat, one-level tree with a single parent node and n𝑛nitalic_n leaf nodes.

  • GeoCluster employs a pre-trained DGCNN [50] to iteratively group furniture parts with similar geometric features into a single assembly step. Compared to SingleStep, it generates deeper trees with more parent nodes and multiple hierarchical levels.

Results. As shown in TableI, quantitative results demonstrate that both baseline methods face challenges in generating accurate assembly trees under the Matching and Assembly criterion. In contrast, our VLM-guided method achieves significantly superior performance, with a success rate of 62%. These findings underscore the robust generalization capabilities when guided by well-structured prompts. Figure3 provides qualitative results for two furniture items, illustrating the advantages of our approach in greater detail. With the ongoing development of more advanced VLMs, we expect further enhancements in assembly planning accuracy. Please refer toSection-Efor ablation results.

V-B Per-step Assembly Pose Estimation

GD\downarrowRMSE\downarrowCD \downarrowPA\uparrow
MethodChairLampTableChairLampTableChairLampTableChairLampTable
Li etal. [29]1.8471.8651.8940.2470.2780.3180.2430.3960.5190.2680.1210.055
Mean-Max Pool0.4341.1181.0590.0870.1870.2000.0460.2290.2800.4570.1990.107
Ours 0.2020.8260.9530.0420.1530.1720.0270.1890.2760.8680.2400.184

Data Preparation.We select three categories of furniture items from PartNet[34]: chair, table, and lamp.For each category, we select 100 furniture items and generate 10 parts selection and subassembly division for each piece of furniture.To generate the assembly manual images, we render diagrammatic images of parts at 20 random camera poses using Blender’s Freestyle functionality.We provide more details about it inSection-A.In general, we generate 12,000 training and 5,200 testing data pieces for each category.

Training Details.For the Image Encoder Isubscript𝐼\mathcal{E}_{I}caligraphic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, we selected the encoder component of DeepLabV3+, which includes MobileNet V2 as the backbone and the atrous spatial pyramid pooling (ASPP) module. We made this choice because DeepLabV3+ leverages atrous convolutions on the basis of Auto Encoder, enabling the model to capture multi-scale structures and spatial information effectively[4, 5]. It generates a multi-channel feature map from the image I𝐼Iitalic_I, and we use mean-max pool[61] to derive a global vector 𝐅I256subscript𝐅𝐼superscript256\mathbf{F}_{I}\in\mathbb{R}^{256}bold_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT from the feature map.For the Point Clouds Encoder Psubscript𝑃\mathcal{E}_{P}caligraphic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, we use the encoder part of PointNet++[35].For each part and subassembly, we extract a part-wise feature 𝐅j256subscript𝐅𝑗superscript256\mathbf{F}_{j}\in\mathbb{R}^{256}bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT.For the GNN Gsubscript𝐺\mathcal{E}_{G}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, we use a three-layer graph transformer[8].The pose regressor \mathcal{R}caligraphic_R is a three-layer MLP. We provide more details of the mean-max pool for the image feature and our training hyperparameter setting inSection-B.

Baselines.We evaluate the performance of our method on our proposed per-step assembly pose estimation dataset.We compare our method with two baselines:

  • Li etal. [29] proposed a pipeline for single image guided 3D object pose estimation.

  • Mean-Max Pool is a variant of our method, replacing GNN with a mean-max pool trick, similar to our approach of obtaining a one-dimensional vector from a multi-channel feature map, with details inSection-B.

Evaluation Metrics.We adopt comprehensive evaluation metrics to assess the performance of our method and baselines.

  • Geodesic Distance (GD), which measures the shortest path distance on the unit sphere between the predicted and ground-truth rotations.

  • Root Mean Squared Error (RMSE), which measures the Euclidean distance between the predicted and ground-truth poses.

  • Chamfer Distance (CD), which calculates the holistic distance between the predicted and the ground-truth point clouds.

  • Part Accuracy (PA), which computes the Chamfer Distance between the predicted and the ground truth point clouds; if the distance is smaller than 0.01m, we count this part as “correctly placed”.

Results.As shown inTableII, our method outperforms Li etal. [29] and the mean-max pool variant in all evaluation metrics and on three furniture categories.We attribute this to the effectiveness of our multi-modal feature fusion and GNN in capturing the spatial relationships between parts.We also provide qualitative results for each furniture category inFigure4.

Ablation.To assess the impact of equivalent parts, guided image, and per-step data about subassemblies, we perform ablation studies on these components. We present the details and results inSection-C.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (4)
Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (5)

V-C Overall Performance Evaluation

We evaluate the overall performance of our method by assembling furniture models in a simulation environment. We implement the evaluation process in the PyBullet[9] simulation environment and test the entire pipeline. We source all test furniture models from the IKEA-Manuals dataset[49]. Given these manuals along with 3D parts, we generate the pre-assembly scene images as described in IV-C, and our pipeline generates the hierarchical graphs. Then, we traverse the hierarchical graph to determine the assembly order. Following this sequence and the predicted 6D poses of each component, we implement RRT-Connect[26] in simulation to plan feasible motion paths for the 3D parts and subassemblies, ensuring they move towards their target poses. Note that, in this experiment, we focus on object-centric motion planning and omit robotic execution in our framework.

Baselines.As the first to propose a comprehensive pipeline for furniture assembly, there is no direct baseline for comparison.So we design a baseline method that uses previous work [29] to estimate the poses of all parts, with the guidance of an image of the fully assembled furniture, and adopt a heuristic order to assemble all parts. Specifically, given the predicted poses of all parts, we can calculate the distance between each pair of parts. The heuristic order is defined as follows: starting from a random part, we find the nearest part to it and assemble it, then successively find the nearest part to the assembled parts until we assemble all parts.

Evaluation Metrics.We adopt the assembly success rate as the evaluation metric and define the following situations as a failure:1) A part is placed at a pose that is too far from the ground truth pose.2) A part collides with other parts when moving to the estimated pose. In other words, the RRT-Connect algorithm[26] finds no feasible path when mating it with other parts.3) We place a part that is not near any other components, causing it to suspend in midair after each assembly step.

MethodBenchChairTableMiscAverage
Li etal. [29]+Heuristic0.000.390.110.000.30
Ours0.670.610.440.500.58

Results. We evaluate the overall performance on 50 furniture items from the IKEA-Manual dataset[49], each consisting of fewer than seven parts. These items fall into four categories (Bench, Chair, Table, Misc), and we report the success rate for each inTableIII.

Our system successfully assembles 29 out of 50 furniture pieces, whereas the baseline method assembles only 15. Our framework achieves a success rate of 58%, demonstrating the effectiveness of our proposed framework. The most common failure occurs when the VLM fails to generate a fully accurate assembly graph, leading to misalignment between the point cloud and the instruction manual images used for pose estimation.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (6)

V-D Real-world Assembly Experiments

To evaluate the feasibility and performance of our pipeline, we conducted experiments in the real world using four IKEA furniture items: Flisat (Wooden Stool), Variera (Iron Shelf), Sundvik (Chair), and Knagglig (Box). Figure6 illustrates our real-world experiment setup. We show the manual images, per-step pose estimation results, and real-world assembly process inFigure5. We also attach videos of the real-world assembly process in the supplementary material. For detailed implementation of our real-world experiments, please checkSection-G.We evaluated all the assembly tasks with target poses provided by three different methods: Ground truth Pose, Mean-Max Pool(seeSectionV-B), and our proposed approach.The Ground truth Pose method uses the ground truth poses for each part to assemble the furniture.We use the Average Completion Rate (ACR) as the evaluation criterion and calculate it as follows:

ACR=1Nj=1NSjStotal𝐴𝐶𝑅1𝑁superscriptsubscript𝑗1𝑁subscript𝑆𝑗subscript𝑆totalACR=\frac{1}{N}\sum_{j=1}^{N}\frac{S_{j}}{S_{\text{total}}}italic_A italic_C italic_R = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT total end_POSTSUBSCRIPT end_ARG(7)

where N𝑁Nitalic_N is the total number of trials,Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of steps completed in trial j𝑗jitalic_j, and Stotalsubscript𝑆totalS_{\text{total}}italic_S start_POSTSUBSCRIPT total end_POSTSUBSCRIPT denotes the total number of steps in the task.

We perform each task over 10 trials with varying initial 3D part poses. We present the results inTableIV, showing that our method outperforms the baseline and achieves a high success rate in real-world assembly tasks.

These findings underscore the practicality and effectiveness of our approach for real-world implementation. The primary failure mode arises from planning limitations, particularly in handling complex obstacles. Failures occur when the RRT-Connect algorithm cannot find a feasible trajectory when the planned path results in collisions with the robotic arm or surrounding objects or due to suboptimal grasping poses. To improve robustness in real-world scenarios, we plan to develop a low-level policy for adaptive motion refinements—a topic we leave for future work.

MethodFLISATVARIERASUNDVIKKNAGGLIG
Oracle Pose72.585.080.090.0
Mean-Max Pool52.561.740.070.0
Ours60.080.068.085.0

V-E Generalization to Other Assembly Tasks

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (7)

We design Manual2Skill as a generalizable framework capable of handling diverse assembly tasks with manual instructions. To assess its versatility, we evaluate the VLM-guided hierarchical graph generation method across three distinct assembly tasks, each varying in complexity and application domain. These include: (1) Assembling a Toy Car Axle (a low-complexity task with standardized components, representing consumer product assembly), (2) Assembling an Aircraft Model (a medium-complexity task, representing consumer product assembly), and (3) Assembling a Robotic Arm (a high-complexity task involving non-standardized components, representing research & prototyping assembly).

For the toy car axle and aircraft model, we sourced 3D parts from [46] and reconstructed pre-assembly scene images using Blender. We manually crafted the manuals in their signature style, with each page depicting a single assembly step through abstract illustrations. For the robotic arm assembly, we used the Zortrax robotic arm [66], which includes pre-existing 3D parts and a structured manual. These inputs were then processed through the VLM-guided hierarchical graph generation pipeline (described in Sec.V-A), yielding assembly graphs as shown in Figure7. This zero-shot generalization achieves a success rate of 100%percent100100\%100 % over five trials per task. The generated graphs align with ground-truth assembly sequences, confirming the generalization of our VLM-guided hierarchical graph generation across diverse manual-based assembly tasks and highlighting its potential for broader applications.

VI Limitations

This paper explores the acquisition of complex manipulation skills from manuals and introduces a method for automated IKEA furniture assembly. Despite this progress, several limitations remain. First, our approach mainly identifies the objects that need assembly but overlooks other details, such as grasping position markings and precise connector locations (e.g., screws). Integrating a vision-language model (VLM) module to extract this information could significantly enhance robotic insertion capabilities. Second, the method does not cover the automated execution of fastening mechanisms, like screwing or insertion actions, which depend heavily on force and tactile sensing signals. We leave these challenges as directions for future work.

VII Conclusion

In this paper, we address the issue of learning complex manipulation skills from manuals, which is essential for robots to execute such tasks based on human-designed instructions.We propose Manual2Skill, a novel framework that leverages VLM to understand manuals and learn robotic manipulation skills from manuals.We design a pipeline for assembling IKEA furniture and validate its effectiveness in real scenarios.We also demonstrate that our method extends beyond the task of furniture assembly.This work represents a significant step toward enabling robots to learn complex manipulation skills with human-like understanding. It could potentially unlock new avenues for robots to acquire diverse complex manipulation skills from human instructions.

References

  • Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • Black etal. [2024]Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, etal.pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024.
  • Brohan etal. [2023]Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, XiChen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, etal.Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023.
  • Chen etal. [2017]Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and AlanL Yuille.Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • Chen etal. [2018]Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.Encoder-decoder with atrous separable convolution for semantic image segmentation.In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  • Chen etal. [2022]Yun-Chun Chen, Haoda Li, Dylan Turpin, Alec Jacobson, and Animesh Garg.Neural shape mating: Self-supervised object assembly with adversarial shape priors.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12724–12733, 2022.
  • Chi etal. [2023]Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song.Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023.
  • Costa etal. [2021]Allan Costa, Manvitha Ponnapati, JosephM. Jacobson, and Pranam Chatterjee.Distillation of msa embeddings to folded protein structures with graph transformers.bioRxiv, 2021.doi: 10.1101/2021.06.02.446809.URL https://www.biorxiv.org/content/early/2021/06/02/2021.06.02.446809.
  • Coumans [2015]Erwin Coumans.Bullet physics simulation.In ACM SIGGRAPH 2015 Courses, page1. ACM, 2015.
  • Du etal. [2024]Bi’an Du, Xiang Gao, Wei Hu, and Renjie Liao.Generative 3d part assembly via part-whole-hierarchy message passing.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20850–20859, 2024.
  • Fang etal. [2023]Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu.Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics (T-RO), 2023.
  • Fu etal. [2024]Zipeng Fu, TonyZ Zhao, and Chelsea Finn.Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024.
  • Funkhouser etal. [2011]Thomas Funkhouser, Hijung Shin, Corey Toler-Franklin, AntonioGarcía Castañeda, Benedict Brown, David Dobkin, Szymon Rusinkiewicz, and Tim Weyrich.Learning how to match fresco fragments.Journal on Computing and Cultural Heritage (JOCCH), 4(2):1–13, 2011.
  • Goldberg etal. [2024]Andrew Goldberg, Kavish Kondap, Tianshuang Qiu, Zehan Ma, Letian Fu, Justin Kerr, Huang Huang, Kaiyuan Chen, Kuan Fang, and Ken Goldberg.Blox-net: Generative design-for-robot-assembly using vlm supervision, physics simulation, and a robot with reset.arXiv preprint arXiv:2409.17126, 2024.
  • Gu etal. [2023]Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, MontserratGonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, etal.Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023.
  • Heo etal. [2023]Minho Heo, Youngwoon Lee, Doohyun Lee, and JosephJ Lim.Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.arXiv preprint arXiv:2305.12821, 2023.
  • Huang etal. [2024a]Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao.Copa: General robotic manipulation through spatial constraints of parts with foundation models.arXiv preprint arXiv:2403.08248, 2024a.
  • Huang etal. [2024b]Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and LiFei-Fei.Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024b.
  • Jiang etal. [2024]Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li.Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation.arXiv preprint arXiv:2402.15487, 2024.
  • Jones etal. [2021]Benjamin Jones, Dalton Hildreth, Duowen Chen, Ilya Baran, VladimirG Kim, and Adriana Schulz.Automate: A dataset and learning approach for automatic mating of cad assemblies.ACM Transactions on Graphics (TOG), 40(6):1–18, 2021.
  • Jonnavittula etal. [2024]Ananth Jonnavittula, Sagar Parekh, and DylanP Losey.View: Visual imitation learning with waypoints.arXiv preprint arXiv:2404.17906, 2024.
  • Kareer etal. [2024]Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu.Egomimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024.
  • Kim etal. [2024]MooJin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, etal.Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024.
  • Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick.Segment anything.arXiv:2304.02643, 2023.
  • Knepper etal. [2013]RossA Knepper, Todd Layton, John Romanishin, and Daniela Rus.Ikeabot: An autonomous multi-robot coordinated furniture assembly system.In 2013 IEEE International conference on robotics and automation, pages 855–862. IEEE, 2013.
  • Kuffner and LaValle [2000]JamesJ Kuffner and StevenM LaValle.Rrt-connect: An efficient approach to single-query path planning.In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume2, pages 995–1001. IEEE, 2000.
  • Lee etal. [2021]Youngwoon Lee, EdwardS Hu, and JosephJ Lim.Ikea furniture assembly environment for long-horizon complex manipulation tasks.In 2021 ieee international conference on robotics and automation (icra), pages 6343–6349. IEEE, 2021.
  • Li etal. [2024a]Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong.Manipllm: Embodied multimodal large language model for object-centric robotic manipulation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024a.
  • Li etal. [2020]Yichen Li, Kaichun Mo, Lin Shao, Minhyuk Sung, and Leonidas Guibas.Learning 3d part assembly from a single image.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 664–682. Springer, 2020.
  • Li etal. [2024b]Yichen Li, Kaichun Mo, Yueqi Duan, HeWang, Jiequan Zhang, and Lin Shao.Category-level multi-part multi-joint 3d shape assembly.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3281–3291, 2024b.
  • Liu etal. [2025]Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, etal.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.In European Conference on Computer Vision, pages 38–55. Springer, 2025.
  • Liu etal. [2024]Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, JuanCarlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu.Ikea manuals at work: 4d grounding of assembly instructions on internet videos.arXiv preprint arXiv:2411.11409, 2024.
  • Mo etal. [2019a]Kaichun Mo, Paul Guerrero, LiYi, Hao Su, Peter Wonka, Niloy Mitra, and LeonidasJ Guibas.Structurenet: Hierarchical graph networks for 3d shape generation.arXiv preprint arXiv:1908.00575, 2019a.
  • Mo etal. [2019b]Kaichun Mo, Shilin Zhu, AngelX Chang, LiYi, Subarna Tripathi, LeonidasJ Guibas, and Hao Su.Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019b.
  • Qi etal. [2017]CharlesRuizhongtai Qi, LiYi, Hao Su, and LeonidasJ Guibas.Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017.
  • Scarpellini etal. [2024]Gianluca Scarpellini, Stefano Fiorini, Francesco Giuliari, Pietro Moreiro, and Alessio DelBue.Diffassemble: A unified graph-diffusion model for 2d and 3d reassembly.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28098–28108, 2024.
  • Sellán etal. [2022]Silvia Sellán, Yun-Chun Chen, Ziyi Wu, Animesh Garg, and Alec Jacobson.Breaking bad: A dataset for geometric fracture and reassembly.Advances in Neural Information Processing Systems, 35:38885–38898, 2022.
  • Shi etal. [2023]Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu.Robocook: Long-horizon elasto-plastic object manipulation with diverse tools.arXiv preprint arXiv:2306.14447, 2023.
  • Shi etal. [2024]LucyXiaoyang Shi, Zheyuan Hu, TonyZ Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn.Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024.
  • Sontakke etal. [2024]Sumedh Sontakke, Jesse Zhang, Séb Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti.Roboclip: One demonstration is enough to learn robot policies.Advances in Neural Information Processing Systems, 36, 2024.
  • Suárez-Ruiz etal. [2018]Francisco Suárez-Ruiz, Xian Zhou, and Quang-Cuong Pham.Can robots assemble an ikea chair?Science Robotics, 3(17):eaat6385, 2018.
  • Sundaresan etal. [2024]Priya Sundaresan, Quan Vuong, Jiayuan Gu, Peng Xu, Ted Xiao, Sean Kirmani, Tianhe Yu, Michael Stark, Ajinkya Jain, Karol Hausman, Dorsa Sadigh, Jeannette Bohg, and Stefan Schaal.Rt-sketch: Goal-conditioned imitation learning from hand-drawn sketches, 2024.URL https://arxiv.org/abs/2403.02709.
  • Tang etal. [2024]Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone.Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8, 2024.
  • Team etal. [2024]OctoModel Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, etal.Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024.
  • Tian etal. [2022]Yunsheng Tian, Jie Xu, Yichen Li, Jieliang Luo, Shinjiro Sueda, Hui Li, KarlDD Willis, and Wojciech Matusik.Assemble them all: Physics-based planning for generalizable assembly by disassembly.ACM Transactions on Graphics (TOG), 41(6):1–11, 2022.
  • Tian etal. [2024]Yunsheng Tian, KarlDD Willis, Bassel AlOmari, Jieliang Luo, Pingchuan Ma, Yichen Li, Farhad Javid, Edward Gu, Joshua Jacob, Shinjiro Sueda, etal.Asap: Automated sequence planning for complex robotic assembly with physical feasibility.In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4380–4386. IEEE, 2024.
  • Vemprala etal. [2024]SaiH Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor.Chatgpt for robotics: Design principles and model abilities.IEEE Access, 2024.
  • Wang etal. [2022a]Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Chin-Yi Cheng, and Jiajun Wu.Translating a visual lego manual to a machine-executable plan.In European Conference on Computer Vision, pages 677–694. Springer, 2022a.
  • Wang etal. [2022b]Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Ran Zhang, Chin-Yi Cheng, and Jiajun Wu.Ikea-manual: Seeing shape assembly step by step.Advances in Neural Information Processing Systems, 35:28428–28440, 2022b.
  • Wang etal. [2019]Yue Wang, Yongbin Sun, Ziwei Liu, SanjayE Sarma, MichaelM Bronstein, and JustinM Solomon.Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  • Wei etal. [2022]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022.
  • Wen etal. [2024]Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield.Foundationpose: Unified 6d pose estimation and tracking of novel objects.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024.
  • Wu etal. [2023]Ruihai Wu, Chenrui Tie, Yushi Du, Yan Zhao, and Hao Dong.Leveraging se (3) equivariance for learning 3d geometric shape assembly.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14311–14320, 2023.
  • Wu etal. [2020]Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and SYu Philip.A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
  • Yang etal. [2023]Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao.Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023.
  • Yao etal. [2022]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022.
  • Yin etal. [2023]Shukang Yin, Chaoyou Fu, Sirui Zhao, KeLi, Xing Sun, Tong Xu, and Enhong Chen.A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023.
  • Yu etal. [2021]Mingxin Yu, Lin Shao, Zhehuan Chen, Tianhao Wu, Qingnan Fan, Kaichun Mo, and Hao Dong.Roboassembly: Learning generalizable furniture assembly policy in a novel multi-robot contact-rich simulation environment.arXiv preprint arXiv:2112.10143, 2021.
  • Zare etal. [2024]Maryam Zare, ParhamM. Kebria, Abbas Khosravi, and Saeid Nahavandi.A survey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 54(12):7173–7186, 2024.doi: 10.1109/TCYB.2024.3395626.
  • Zhang etal. [2024]Jiahao Zhang, Anoop Cherian, Cristian Rodriguez, Weijian Deng, and Stephen Gould.Manual-pa: Learning 3d part assembly from instruction diagrams.arXiv preprint arXiv:2411.18011, 2024.
  • Zhang etal. [2018]Minghua Zhang, Yunfang Wu, Weikang Li, and Wei Li.Learning universal sentence representations with mean-max attention autoencoder.arXiv preprint arXiv:1809.06590, 2018.
  • Zhao etal. [2024]Zirui Zhao, WeeSun Lee, and David Hsu.Large language models as commonsense knowledge for large-scale task planning.Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou etal. [2022]Denny Zhou, Nathanael Schärli, LeHou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, etal.Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022.
  • Zhu etal. [2023]Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu.Viola: Imitation learning for vision-based manipulation with object proposal priors.In Conference on Robot Learning, pages 1199–1210. PMLR, 2023.
  • Zhu and Hu [2018]Zuyuan Zhu and Huosheng Hu.Robot learning from demonstration in robotic assembly: A survey.Robotics, 7(2):17, 2018.
  • Zortrax Library [n.d.]Zortrax Library.Zortrax robotic arm, n.d.URL https://library.zortrax.com/project/zortrax-robotic-arm/.Accessed: 2025-02-01.

-A Per-step Assembly Pose Estimation Dataset

We build a dataset for our proposed manual guided per-step assembly pose estimation task.Each data piece is a tuple (Ii,{P}j,{T}j,𝐑i)subscript𝐼𝑖subscript𝑃𝑗subscript𝑇𝑗subscript𝐑𝑖(I_{i},\{P\}_{j},\{T\}_{j},\mathbf{R}_{i})( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { italic_P } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , { italic_T } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the manual image, {P}jsubscript𝑃𝑗\{P\}_{j}{ italic_P } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the point clouds of all the components involved in the assembly step, {T}jsubscript𝑇𝑗\{T\}_{j}{ italic_T } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the target poses for each component, and 𝐑isubscript𝐑𝑖\mathbf{R}_{i}bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the spatial and geometric relationship between components.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (8)

Instruction manuals in the real world come in a wide variety. To cover as many scenarios as we might encounter in real-life situations, we considered three possible variations of instruction manuals when constructing the dataset, as shown inFigure8.Our dataset encompasses a variety of furniture shapes.For each piece of furniture, we randomly selected some connected parts to form different subassemblies.Meanwhile, for each subassembly, there are multiple possible camera perspectives for taking manual photos.This definition enables our dataset to cover various manuals that we might encounter in real-world scenarios.

Formally, for furniture consisting of M𝑀Mitalic_M parts, we randomly select m𝑚mitalic_m connected parts to form a subassembly.Denoted as Psub={P1,P2,,Pm}subscript𝑃subsubscript𝑃1subscript𝑃2subscript𝑃𝑚P_{\text{sub}}=\{P_{1},P_{2},\cdots,P_{m}\}italic_P start_POSTSUBSCRIPT sub end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, here each Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a atomic part.Then, we randomly group the m𝑚mitalic_m atomic parts into n𝑛nitalic_n components while keeping all parts within the same group are connected, denoted as Psub={{P11,P1α1},{Pn1,Pnαn}}subscript𝑃subsubscript𝑃11subscript𝑃1subscript𝛼1subscript𝑃𝑛1subscript𝑃𝑛subscript𝛼𝑛P_{\text{sub}}=\{\{P_{11},\cdots P_{1\alpha_{1}}\},\cdots\{P_{n1},\cdots P_{n%\alpha_{n}}\}\}italic_P start_POSTSUBSCRIPT sub end_POSTSUBSCRIPT = { { italic_P start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , ⋯ italic_P start_POSTSUBSCRIPT 1 italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } , ⋯ { italic_P start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT , ⋯ italic_P start_POSTSUBSCRIPT italic_n italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } }, where each αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the number of atomic parts in i𝑖iitalic_i-th component, and thus iαi=msubscript𝑖subscript𝛼𝑖𝑚\sum_{i}\alpha_{i}=m∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m.We sample the point cloud for each component to consist of the point cloud of the data piece.We can also take photos of the subassembly from different perspectives.

We also provide annotations for equivalent parts in the auxiliary information.In this paper, we propose new techniques to leverage the auxiliary information for each assembly step, which significantly enhances the precision and robustness of our pose estimation model.

-B Pose Estimation Implementation

-B1 Loss Functions for Pose Estimation

Rotation Geodesic Loss:In 3D pose prediction tasks, we commonly use the rotation geodesic loss to measure the distance between two rotations [53]. Formally, given the ground truth rotation matrix RSO(3)𝑅𝑆𝑂3R\in SO(3)italic_R ∈ italic_S italic_O ( 3 ) and the predicted rotation R^SO(3)^𝑅𝑆𝑂3\hat{R}\in SO(3)over^ start_ARG italic_R end_ARG ∈ italic_S italic_O ( 3 ), the rotation geodesic loss is defined as:

rot=arccos(tr(RTR^)12)subscriptrottrsuperscript𝑅𝑇^𝑅12\mathcal{L}_{\text{rot}}=\arccos\left(\frac{\text{tr}(R^{T}\hat{R})-1}{2}\right)caligraphic_L start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT = roman_arccos ( divide start_ARG tr ( italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_R end_ARG ) - 1 end_ARG start_ARG 2 end_ARG )(8)

where tr()tr\text{tr}(\cdot)tr ( ⋅ ) denotes the trace of a matrix and RTsuperscript𝑅𝑇R^{T}italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the transpose of R𝑅Ritalic_R.

Translation MSE Loss:Following[29], we use the mean squared error (MSE) loss to measure the distance between the ground truth translation t𝑡titalic_t and the predicted translation t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG:

trans=tt^2subscripttranssubscriptnorm𝑡^𝑡2\mathcal{L}_{\text{trans}}=||t-\hat{t}||_{2}caligraphic_L start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT = | | italic_t - over^ start_ARG italic_t end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(9)

Chamfer Distance Loss:This loss function minimizes the holistic distance between each point in the predicted and ground truth point clouds. Given the ground truth point cloud S1=RP+tsubscript𝑆1𝑅𝑃𝑡S_{1}=RP+titalic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_R italic_P + italic_t and the predicted point cloud S2=R^P+t^subscript𝑆2^𝑅𝑃^𝑡S_{2}=\hat{R}P+\hat{t}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over^ start_ARG italic_R end_ARG italic_P + over^ start_ARG italic_t end_ARG, it is defined as:

cham=1|S1|xS1minyS2xy22+1|S2|xS2minyS1yx22subscriptcham1subscript𝑆1subscript𝑥subscript𝑆1subscript𝑦subscript𝑆2superscriptsubscriptnorm𝑥𝑦221subscript𝑆2subscript𝑥subscript𝑆2subscript𝑦subscript𝑆1superscriptsubscriptnorm𝑦𝑥22\mathcal{L}_{\text{cham}}=\frac{1}{|S_{1}|}\sum_{x\in S_{1}}\min_{y\in S_{2}}|%|x-y||_{2}^{2}+\frac{1}{|S_{2}|}\sum_{x\in S_{2}}\min_{y\in S_{1}}||y-x||_{2}^%{2}caligraphic_L start_POSTSUBSCRIPT cham end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_y ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_x - italic_y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_y ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_y - italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(10)

where S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the point cloud after applying the ground truth 6D pose transformation, and S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the point cloud after applying the predicted 6D pose transformation.

Pointcloud MSE Loss:We supervise the predicted rotation by applying it to the point of the component and use the MSE loss to measure the distance between the rotated point and the ground truth point:

pc=RPR^P2subscriptpcsubscriptnorm𝑅𝑃^𝑅𝑃2\mathcal{L}_{\text{pc}}=||RP-\hat{R}P||_{2}caligraphic_L start_POSTSUBSCRIPT pc end_POSTSUBSCRIPT = | | italic_R italic_P - over^ start_ARG italic_R end_ARG italic_P | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(11)

Equivalent Parts:Given a set of components, we might encounter geometrically equivalent parts that we must assemble in different locations. Inspired by [60], we group these geometrically equivalent components and add an extra loss term to ensure we assemble them in different locations. For each group of equivalent components, we apply the predicted transformation to the point cloud of each component and then compute the Chamfer distance (CD) between the transformed point clouds. For all pairs (j1,j2)subscript𝑗1subscript𝑗2(j_{1},j_{2})( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) within the same group, we compute the Chamfer distance between the transformed point clouds P^j1subscript^𝑃subscript𝑗1\hat{P}_{j_{1}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and P^j2subscript^𝑃subscript𝑗2\hat{P}_{j_{2}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, encouraging the distance to be large:

equiv=group(j1,j2)CD(P^j1,P^j2)subscriptequivsubscriptgroupsubscriptsubscript𝑗1subscript𝑗2CDsubscript^𝑃subscript𝑗1subscript^𝑃subscript𝑗2\mathcal{L}_{\text{equiv}}=-\sum_{\text{group}}\sum_{(j_{1},j_{2})}\text{CD}(%\hat{P}_{j_{1}},\hat{P}_{j_{2}})caligraphic_L start_POSTSUBSCRIPT equiv end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT group end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT CD ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )(12)

Finally, we define the overall loss function as a weighted sum of the above loss terms:

total=λ1rot+λ2trans+λ3cham+λ4pc+λ5equivsubscripttotalsubscript𝜆1subscriptrotsubscript𝜆2subscripttranssubscript𝜆3subscriptchamsubscript𝜆4subscriptpcsubscript𝜆5subscriptequiv\mathcal{L}_{\text{total}}=\lambda_{1}\mathcal{L}_{\text{rot}}+\lambda_{2}%\mathcal{L}_{\text{trans}}+\lambda_{3}\mathcal{L}_{\text{cham}}+\lambda_{4}%\mathcal{L}_{\text{pc}}+\lambda_{5}\mathcal{L}_{\text{equiv}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT cham end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT pc end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT equiv end_POSTSUBSCRIPT(13)

where λ1=1subscript𝜆11\lambda_{1}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, λ2=1subscript𝜆21\lambda_{2}=1italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, λ3=1subscript𝜆31\lambda_{3}=1italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1, λ4=20subscript𝜆420\lambda_{4}=20italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 20, λ5=0.1subscript𝜆50.1\lambda_{5}=0.1italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0.1.

-B2 Mean-Max Pool

The core mechanic of the mean-max pool is to obtain the mean and maximum values along one dimension Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT of a set of vectors or matrices with the same dimensions and concatenate them into a one-dimensional vector in 2Csuperscript2𝐶\mathbb{R}^{2C}blackboard_R start_POSTSUPERSCRIPT 2 italic_C end_POSTSUPERSCRIPT to obtain a global feature. For one-dimensional vectors, we take the mean and maximum values along the sequence length dimension. For two-dimensional matrices, we take the mean and maximum values along the height ×\times× width dimensions:

Fglobal=[avg;max]2FsubscriptF𝑔𝑙𝑜𝑏𝑎𝑙avgmaxsuperscript2𝐹\textbf{F}_{global}=[\textbf{avg};\textbf{max}]\in\mathbb{R}^{2F}F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = [ avg ; max ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_F end_POSTSUPERSCRIPT(14)

In the setting of our work, we set F𝐹Fitalic_F to 128.

We use this trick twice in this work. One instance is when we obtain a one-dimensional vector with a channel dimension from a multi-channel feature map, thus obtaining a one-dimensional feature vector for the image. In this case, we can express the mean-max pool as follows:

{X=(Xc,h,w)c=1,h=1,w=1C,H,Wavg=(1HWh=1Hw=1WXc,h,w)c=1CCmax=(maxh,wXc,h,w)c=1CC\left\{\begin{aligned} &\textbf{X}=(\textbf{X}_{c,h,w})^{C,H,W}_{c=1,h=1,w=1}%\\&\textbf{avg}=(\frac{1}{HW}\sum_{\textit{h}=1}^{\textit{H}}\sum_{\textit{w}=1}%^{\textit{W}}\textbf{X}_{c,h,w})^{C}_{c=1}\in\mathbb{R}^{C}\\&\textbf{max}=(\max_{h,w}\textbf{X}_{c,h,w})^{C}_{c=1}\in\mathbb{R}^{C}\\\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL X = ( X start_POSTSUBSCRIPT italic_c , italic_h , italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_C , italic_H , italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c = 1 , italic_h = 1 , italic_w = 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL avg = ( divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT W end_POSTSUPERSCRIPT X start_POSTSUBSCRIPT italic_c , italic_h , italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL max = ( roman_max start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT X start_POSTSUBSCRIPT italic_c , italic_h , italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_CELL end_ROW(15)

Where 𝐗𝐗\mathbf{X}bold_X is the multi-channel feature map of image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with dimensions channels(C)×height(H)×width(W)channels𝐶height𝐻width𝑊\text{channels}(C)\times\text{height}(H)\times\text{width}(W)channels ( italic_C ) × height ( italic_H ) × width ( italic_W ), 𝐚𝐯𝐠𝐚𝐯𝐠\mathbf{avg}bold_avg and 𝐦𝐚𝐱𝐦𝐚𝐱\mathbf{max}bold_max denote one-dimensional vectors of length channels. Thus, FglobalsubscriptF𝑔𝑙𝑜𝑏𝑎𝑙\textbf{F}_{global}F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT of the multi-channel feature map is a C-dimensional vector.

The other instance is when we compare the baseline. To aggregate point cloud features on a per-part basis and obtain a one-dimensional global feature for the shape, we express the mean-max pool in the following form:

{avg=1Mj=1MFjFmax=maxF{Fj}F\left\{\begin{aligned} &\textbf{avg}=\frac{1}{M}\sum_{j=1}^{M}\textbf{F}_{j}%\in\mathbb{R}^{F}\\&\textbf{max}=\max_{F}\{\textbf{F}_{j}\}\in\mathbb{R}^{F}\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL avg = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL max = roman_max start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT { F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_CELL end_ROW(16)

Here, we let M𝑀Mitalic_M denote the number of parts in a shape. For each part in this baseline, we concatenate the one-dimensional image feature FIsubscriptF𝐼\textbf{F}_{I}F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, the global point cloud feature FglobalsubscriptF𝑔𝑙𝑜𝑏𝑎𝑙\textbf{F}_{global}F start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT (both obtained by mean-max pool), and the part-wise point cloud feature FjsubscriptF𝑗\textbf{F}_{j}F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to form a one-dimensional cross-modality feature. We then use this feature as input for the pose regressor MLP.

-B3 Hyperparameters in Training of Pose Estimation

We train our pose estimation model on a single NVIDIA A100 40GB GPU with a batch size of 32. Each experiment runs for 800 epochs (approximately 46 hours). We set the learning rate to 1e51𝑒51e-51 italic_e - 5 and employ a 10-epoch linear warm-up phase. Afterward, we use a cosine annealing schedule to decay the learning rate. We also set the weight decay to 1e71𝑒71e-71 italic_e - 7. The optimizer configuration for each component of the model is as shown inTableV.

ComponentOptimizer
Image EncoderRMSprop
Pointcloud EncoderAdamW
GNNAdamW
Pose RegressorRMSprop

-C Pose Estimation Ablation Studies

To evaluate the effectiveness of each component in our pipeline, we conduct an ablation study on the chair category. We show the quantitative results inTableVI and the qualitative results inFigure9.First, we remove the image input and only use the point cloud input to predict the pose.The performance drops significantly, indicating that the image input is crucial for pose estimation.Second, we remove the permutation mechanism for equivalent parts(Equation12).As shown in the visualizations, the model fails to distinguish between equivalent parts, placing two legs in similar positions.

MethodGD\downarrowRMSE\downarrowCD\downarrowPA\uparrow
w/o Image1.7970.2340.2270.138
w/o Permutations0.2520.0510.0290.783
Ours0.2020.0420.0270.868
Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (9)

Previous works usually train and predict only fully assembled shapes. In contrast, our pose estimation dataset includes per-step data (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., subassemblies). We conduct an ablation study comparing two settings:

  • w/o Per-step: Training and testing on a dataset of fully assembled shapes.

  • Per-step: Training on a dataset with per-step data and testing on fully assembled shapes.

MethodGD\downarrowRMSE\downarrowCD\downarrowPA\uparrow
w/o Per-step0.2330.0460.0150.753
Per-step (Ours)0.0640.0160.0040.983

As shown inTableVII, adding per-step data improves assembly prediction accuracy, demonstrating that per-step inference enhances robot assembly performance.

-D Complete VLM Plan Generation Results

We provide the complete analysis for VLM plan generation.In addition to the results for all 50 furniture items with six or fewer parts, shown in the main paper, we include results for all 52 furniture items with seven or more parts (denoted as \leq 7 Parts) and the complete dataset of 102 furniture items spanning all part counts (denoted as All Parts) in TableVIII. Furthermore, we categorized the full set of 102 furniture items in greater detail, with Hard Matching results for individual part counts ranging from 2 to 16 parts, as shown in TableIX. For detailed descriptions of Simple Matching and Hard Matching, we referreaders to [49].

For the GeoCluster baseline, we could not replicate the exact results shown in the IKEA-Manuals dataset [49]. Thus, we used the scores from our experiments for the \leq 6 Parts and \geq 7 Parts categories while retaining the original scores from the dataset [49] for the All Parts category.

To obtain our scores, we repeatedly ran the experiment 5 times using the same input and a temperature of 0. We repeated sampling to account for slight variations in GPT-4o’s [1] outputs, even when we set the temperature to 0, and to capture the range of possible outcomes. This approach provides a better estimate of the model’s true performance. When taking the maximum between precision, recall, and F1, the average score for \leq 6 parts on Hard Matching is 63.7%, the worst score is 57.2%, and the best score is 69.0%. Since the average and best scores are similar, we choose to report the best score in all of our tables related to Assembly Plan Generation.

To compare the trees generated by GPT-4o [1] with the ground truth trees in the dataset, we accounted for equivalence relationships among parts, which can result in multiple valid ground truth trees. For instance, if parts 1 and 2 are equivalent and [[1, 3], 2] is a valid tree, then so is [[2, 3], 1]. Since the dataset does not account for this isomorphism of trees, we manually defined all equivalent parts for each of the 102 furniture items. We then permuted the predicted tree using the equivalent parts, comparing each permutation to the ground truth and selecting the highest score. For furniture with 13 or more parts (6 items), we performed manual verification due to the computational cost of permutations. Overall, by employing this permutation method to evaluate predicted trees, we managed to increase our scores overall metrics by around 5%. To ensure fairness, we also applied this permutation over the two baselines but saw no effects.

As shown in TableVIII, tasks with 7absent7\geq 7≥ 7 parts experience a significant drop in performance—Hard Matching achieves a maximum of 13.36%, compared to 69.0% for tasks with 6absent6\leq 6≤ 6 parts—indicating that the model’s performance declines as the number of parts increases. This decrease is likely driven by increased task complexity and occlusion in manual drawings as the number of furniture parts grows, causing GPT-4o [1] to misinterpret out-of-distribution images and fail in the plan generation stage. As noted in [49], SingleStep always outputs the root node and selects all other nodes as its children, achieving perfect precision in Simple Matching for all cases. Beyond this, our GPT-4o-based method outperforms both baselines across all categories in TableVIII, which highlights the effectiveness of VLMs in interpreting manuals and designing reliable hierarchical assembly graphs.

Similarly, in TableIX, our method has a significant advantage over the two baselines in all numbers of parts. Mask Seg is an additional method we evaluated, which overlays segmentation masks from the IKEA-Manuals dataset [49] onto manual pages (prompt 3.aSection-K), improving part identification, image clarity, and comprehension of assembly steps. Although Mask Seg slightly outperforms the original version without mask segmentations, we chose the latter for all reported tables. Otherwise, such masks are costly in real-world scenarios. Overall, the trend observed in TableVIII persists here, with higher scores for furniture with fewer parts and lower scores as the number of parts increases.

Simple Matching (All Parts)Hard Matching (All Parts)Simple Matching (\geq 7 Parts)Hard Matching (\geq 7 Parts)
MethodPrecisionRecallF1PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
SingleStep100.0035.7748.6410.7810.7810.78100.0021.9635.090.000.000.00
GeoCluster44.9048.4643.5316.5416.5016.3031.9928.8829.667.316.916.92
Ours58.1155.98 56.8440.6339.9440.2233.7231.9532.6513.3612.9613.11
Number of Parts2345678910111213141516
SingleStep1005012.5031.5800000000000
GeoCluster1002510.4214.0421.7614.406.9915.004.172.22016.67000
Ours (Mask Seg)10010075.0072.8156.0829.6424.1719.0516.679.633.3337.500.000.000.00
Ours10010072.9278.5145.5925.2413.0516.6727.7809.336.250.000.000.00
Furniture Count2481917141034952121

-E Assembly Graph Generation Ablation Studies

We present the effectiveness of our VLM plan generation pipeline, emphasizing the critical role of cropped manual pages as input. The manual pages’ visuals, detailing parts and subassemblies for each step, directly influence GPT-4o’s output. Thus, we prioritize this content and ablate the strategy of inputting cropped pages. For furniture requiring N𝑁Nitalic_N assembly steps, instead of providing N𝑁Nitalic_N cropped manual pages corresponding to each step, we input the entire manual consisting of MN𝑀𝑁M\geq Nitalic_M ≥ italic_N pages. As shown in TableX, this ”no-crop” method leads to 7% accuracy drops in the Simple Matching category and 25% in the more important Hard Matching category. The decrease is likely due to irrelevant details in full manual pages, such as the nails, people, and speech bubbles in prompt 2.a), which divert GPT-4o’s focus from the critical furniture parts for each step. Overall, TableX underscores the importance of cropping manual pages to simplify the input and direct GPT-4o’s attention to the most relevant details.

Simple MatchingHard Matching
MethodPrecisionRecallF1 ScorePrecisionRecallF1 Score
Ours (no crop)69.1381.1373.0542.3745.5043.45
Ours83.4780.9781.9969.0068.0068.41

-F Failure Cases Analysis

We highlight failure cases of VLMs using GPT-4o in Figure10 for plan generation of complex furniture.Figure10 demonstrates that while GPT-4o surpasses previous baselines in assembly planning, it struggles with complex structures, often producing entirely incorrect results.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (10)

-G Real-World Experiment Details

This section provides the details of the real-world experiment.

-G1 Pose Estimation in the Real World

We utilize FoundationPose[52] to evaluate the 6D pose and point cloud of components in the real-world scene. First, a mobile app, ARCode, is used to scan the mesh of all atomic parts of the furniture. During each step of the assembly process, the mesh—along with the RGB and depth images and an object mask—is input into the FoundationPose model, which then generates the precise 6D pose and point cloud of the component within the scene. This information is crucial for subsequent tasks, including camera pose alignment, grasping, and collision-free planning.

-G2 Camera Frame Alignment

After we get the estimated target pose, we first use the PCA mentioned before to canonize them. To accurately map these target poses to the real world, we need to align the camera frame in the manual page image, denoted as Pmisubscript𝑃subscript𝑚𝑖P_{m_{i}}italic_P start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with the real-world camera frame, denoted as Pwisubscript𝑃subscript𝑤𝑖P_{w_{i}}italic_P start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for each step i𝑖iitalic_i. This section will introduce how we calculate the 6D transformation matrix Tmwsubscript𝑇𝑚𝑤T_{mw}italic_T start_POSTSUBSCRIPT italic_m italic_w end_POSTSUBSCRIPT between these two frames.

To achieve this, we designate a stable part of the scene as a base in the world frame using the VLM and utilize FoundationPose to extract the point cloud of this part. We then canonicalize the point cloud using the same PCA algorithm, ensuring that the relative 6D pose of the same component remains consistent. We denote the canonical base pose in the real world as PBwsubscript𝑃subscript𝐵𝑤P_{B_{w}}italic_P start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which remains static during this step. From the model’s predictions, we can also determine the pose of the same part used as the base in the manual, denoted as PBmsubscript𝑃subscript𝐵𝑚P_{B_{m}}italic_P start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We denote the transformation matrix between these two frames as Tmwsubscript𝑇𝑚𝑤T_{mw}italic_T start_POSTSUBSCRIPT italic_m italic_w end_POSTSUBSCRIPT. Using this transformation matrix, we map the target pose in the manual frame, PTmsubscript𝑃subscript𝑇𝑚P_{T_{m}}italic_P start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, to the corresponding target pose in the real-world frame, PTwsubscript𝑃subscript𝑇𝑤P_{T_{w}}italic_P start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for subsequent motion planning. We compute the transformation as follows:

Tmw=PBwPBm1subscript𝑇𝑚𝑤subscript𝑃subscript𝐵𝑤superscriptsubscript𝑃subscript𝐵𝑚1T_{mw}=P_{B_{w}}P_{B_{m}}^{-1}italic_T start_POSTSUBSCRIPT italic_m italic_w end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

We then calculate the target pose in the real-world frame using:

PTw=TmwPTmsubscript𝑃subscript𝑇𝑤subscript𝑇𝑚𝑤subscript𝑃subscript𝑇𝑚P_{T_{w}}=T_{mw}P_{T_{m}}italic_P start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_m italic_w end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (11)

As illustrated in Figure11, the stool example clearly demonstrates the process of aligning poses between the manual and real-world frames, ensuring a consistent and reliable foundation for motion planning.

-G3 Heuristic Grasping Policy

For general grasping tasks, pre-trained models such as GraspNet[11] are commonly used to generate grasping poses. However, in the case of furniture assembly, where components are often large and flat, we need to grasp specific parts of the object that are suitable for subsequent assembly. This requirement poses challenges for GraspNet, as it does not always estimate the best pose for the subsequent action. To address this, in addition to GraspNet, we utilize the poses generated by FoundationPose and consider the shapes of the furniture components in corner cases. These shapes are categorized into two types, as shown in Figure12:

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (12)

Stick-Shaped Components:For stick-shaped furniture parts, such as stool legs, we select the center of the point cloud as the grasping position. We define the grasping pose as a top-down approach.

Flat and thin-Shaped Components:We first estimate the pose of flat and thin, board-shaped furniture parts using a bounding box. Based on this estimate, we determine the grasping pose by aligning it with the bounding box’s orientation. The grasping position is set approximately 3 cm below the top surface.

-H Rationale for Excluding Performance Evaluation of Stage I in Hierarchical Assembly Graph Generation

Stage I, Associating Real Parts with Manuals, focuses on associating real parts with the manual. Still, since the IKEA manual lacks isolated images of individual parts, direct quantitative evaluation is challenging. Instead, Stage II implicitly reflects the quality of these associations by outputting the indices of identified real parts. Therefore, we report Stage II results as an intermediate measure of how effectively our approach aligns manual images with real components.

-I Justification for Hierarchical Assembly Graph

Using a hierarchical structure to represent assembly steps provides several advantages over simple linear data structures or unstructured step-by-step plans in plain text.

  • Hierarchical structures align naturally with the assembly process where multiple parts and subassemblies combine into larger subassemblies.

  • Lists or text plans struggle to store geometric and spatial relationships between each part or subassembly of the step, which is crucial in real assembly tasks.

  • The hierarchical graph clearly shows the dependencies between steps, revealing which steps you can perform in parallel and which ones you must complete before proceeding to others.So, it provides flexibility for parallel construction or strategic sequencing.

-J Formal Definition of Hierachial Assembly Graph

Inspired byMo etal. [33], we represent the assembly process as a hierarchical graph S=(𝐏,𝐇,𝐑)𝑆𝐏𝐇𝐑S=(\mathbf{P,H,R})italic_S = ( bold_P , bold_H , bold_R ).A set of nodes 𝐏𝐏\mathbf{P}bold_P represents the parts or subassemblies in the assembly process.A structure (𝐇,𝐑)𝐇𝐑(\mathbf{H,R})( bold_H , bold_R ) describes how these nodes are assembled and related to each other.The structure consists of two edge sets: 𝐇𝐇\mathbf{H}bold_H describes the assembly relationship between nodes, and 𝐑𝐑\mathbf{R}bold_R represents the geometric and spatial relationship between nodes.

Node.Each node v𝐏𝑣𝐏v\in\mathbf{P}italic_v ∈ bold_P is an atomic part or a subassembly, consisting of a non-empty subset of parts p(v)𝒫𝑝𝑣𝒫p(v)\subset\mathcal{P}italic_p ( italic_v ) ⊂ caligraphic_P.The root node vNsubscript𝑣𝑁v_{N}italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT represents the fully assembled furniture, with p(vN)=𝒫𝑝subscript𝑣𝑁𝒫p(v_{N})=\mathcal{P}italic_p ( italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = caligraphic_P.A non-root, non-leaf node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a subassembly with p(vi)𝑝subscript𝑣𝑖p(v_{i})italic_p ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as a non-empty and proper subset of 𝒫𝒫\mathcal{P}caligraphic_P.All leaf nodes vlsubscript𝑣𝑙v_{l}italic_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represent atomic parts, containing exactly one element from 𝒫𝒫\mathcal{P}caligraphic_P.Additionally, each non-leaf node corresponds to a manual image I𝐼Iitalic_I that describes how to merge smaller parts and subassemblies to form the node.

Assembly relationship.We formulate the assembly process as a tree, with all atomic parts serving as leaf nodes. The atomic parts are then recursively combined into subassemblies, forming non-leaf nodes until they reach the root node, which represents the fully assembled furniture. The directed edges from a child node to its parent node indicate the assembly relationship.The edge set 𝐇𝐇\mathbf{H}bold_H includes directed edges from a child node to its parent node, indicating the assembly relationship.For a non-leaf node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denote its child nodes as Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the following properties hold:

  1. (a)

    vjCi,p(vj)for-allsubscript𝑣𝑗subscript𝐶𝑖𝑝subscript𝑣𝑗\forall v_{j}\in C_{i},p(v_{j})∀ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a non-empty subset of 𝒫𝒫\mathcal{P}caligraphic_P

  2. (b)

    All children nodes contain distinct elements

    p(vj)p(vk)=,vj,vkCi,jkformulae-sequence𝑝subscript𝑣𝑗𝑝subscript𝑣𝑘for-allsubscript𝑣𝑗formulae-sequencesubscript𝑣𝑘subscript𝐶𝑖𝑗𝑘p(v_{j})\cap p(v_{k})=\emptyset,\forall v_{j},v_{k}\in C_{i},j\neq kitalic_p ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∩ italic_p ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∅ , ∀ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j ≠ italic_k(17)
  3. (c)

    The union of all child subsets equals p(vi)𝑝subscript𝑣𝑖p(v_{i})italic_p ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ):

    vjCip(vj)=p(vi)subscriptsubscript𝑣𝑗subscript𝐶𝑖𝑝subscript𝑣𝑗𝑝subscript𝑣𝑖\bigcup_{v_{j}\in C_{i}}p(v_{j})=p(v_{i})⋃ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_p ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(18)

Equivalence relationship.In addition to the assembly process’s hierarchical decomposition, we also consider the equivalence relationship between nodes.We label two parts equivalent if they share a similar shape and can be used interchangeably in the assembly process.We represent this relationship with undirected edges 𝐑𝐢subscript𝐑𝐢\mathbf{R_{i}}bold_R start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT in child nodes Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.An edge {va,vb}Risubscript𝑣𝑎subscript𝑣𝑏subscript𝑅𝑖\{v_{a},v_{b}\}\in R_{i}{ italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears between two nodes vaCisubscript𝑣𝑎subscript𝐶𝑖v_{a}\in C_{i}italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, vb𝒫subscript𝑣𝑏𝒫v_{b}\in\mathcal{P}italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ caligraphic_P, if the shape represented by vasubscript𝑣𝑎v_{a}italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and vbsubscript𝑣𝑏v_{b}italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are geometric equivalent and thus can be changed during assembly.Note that vbsubscript𝑣𝑏v_{b}italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is not constrained as a child of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT since any two nodes could be equivalent, regardless of their hierarchical positions.

The assembly structure is a hierarchical graph, where the nodes represent parts or subassemblies, and the edges represent the assembly and equivalence relationships.We consider this structured representation to be a more informative and interpretable way to formulate the assembly process than a flat list of parts.

-K Prompts

We offer a comprehensive set of prompts utilized in the VLM-guided hierarchical graph generation process. The process involves four distinct prompts, divided into two stages. The first two prompts, which are slight variations of each other, are used in Stage I: Associating Real Parts with Manuals. The remaining two prompts, also slight variations of each other, are employed in Stage II: Identifying Involved Parts in Each Step.

  1. 1.

    The first prompt is part of Stage I, and it initializes the JSON file’s structure and consists of two sections:

    • 1.a): Image Set: An image of the scene with furniture parts labeled using GroundingDINO [31], alongside an image of the corresponding manual’s front page.

    • 1.b) Text Instructions: A few sentences explaining the JSON file generation, supported by an example of the desired structure via in-context learning.

    This prompt is passed into GPT-4o to generate a JSON file with the name and label for each part.

  2. 2.

    The second prompt belongs in Stage I as well, and it populates the JSON file with detailed descriptions of roles. It includes:

    • 2.a): Image Set: Images of all manual pages (replacing the front page) to provide context about the function of each part and the scene image from the first prompt.

    • 2.b): Text Instructions: a simple text instruction explaining the context and output.

    We combine the JSON output from the first prompt with the second prompt, then query GPT-4o to generate the populated JSON file.

  3. 3.

    The third prompt is a part of Section II, and it generates a step-by-step assembly plan using:

    • 3.a): Image Set: The scene image and cropped manual pages highlight relevant parts and subassemblies, helping GPT-4o focus on key details. The cropped images also have a highlighted black number on the left, indicating the current assembly step. Our ablation studies demonstrate the effectiveness of these cropped images.

    • 3.b): Text Instructions: A text instruction combining chain-of-thought and in-context learning to describe the assembly plan generation process and guide the VLM.The JSON file from Step 2 is concatenated with the third prompt as input, guiding GPT-4o to produce the final text-based assembly plan.

  4. 4.

    Section II includes the fourth prompt, which converts the text-based plan into a traversable tree structure for action sequencing in robotic assembly. We achieve this conversion using a simple text input with in-context learning examples.

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Duncan Muller

Last Updated:

Views: 6234

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.