Editable representations for 3D reconstruction and novel view synthesis

Lazova, Verica

Please use this identifier to cite or link to this item: doi:10.22028/D291-47134

Title:	Editable representations for 3D reconstruction and novel view synthesis
Author(s):	Lazova, Verica
Language:	English
Year of Publication:	2025
DDC notations:	004 Computer science, internet 500 Science 510 Mathematics
Publikation type:	Dissertation
Abstract:	Learning in 3D often requires a suitable representation of the three-dimensional world. In this thesis, we focus on three specific tasks: 3D reconstruction from a single image; editable (controllable) novel view synthesis; and text-based 3D generation with model- and symmetry prior. Each of these tasks relies on a different underlining 3D representation: UV-maps, 3D volumes, and feature-point clouds, but we have the same requirement for them all: they need to be editable, and easy to use and manipulate. In the first part of this thesis, we focus on predicting a full 3D avatar of a person from a single image. We redefine this difficult 3D reconstruction problem as a set of image completion tasks by inferring texture and geometry in the UV-space of the SMPL model. From the input image and the estimated segmentation, we derive partial texture and segmentation layout maps. Our model predicts a complete segmentation map, a complete texture map, and a displacement map. The predicted displacement and texture maps can be applied to the canonical SMPL model, and the segmentation map can be used to further customize the 3D avatar. With this approach, we can naturally generalize to novel poses, shapes, and even new clothing. The results on real images from the DeepFashion dataset show that our method can reconstruct plausible 3D avatars from a single image. Next, in the second part of the thesis, we focus on editable novel-view synthesis. Methods based on neural radiance fields (NeRF) are effective for novel view synthesis, however they memorize the radiance for every scene point within the parameters of the neural network. These models are scene-specific, hence classical editing, such as shape manipulation, or scene composition, is challenging or not possible at all. We present Control-NeRF, a method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis, from a set of posed input images. This is a hybrid approach that combines NeRF with external scene representation. Our model couples learnt scene-specific 3D feature volumes with a general NeRF rendering network. Hence, we generalize to new scenes by optimizing only the scene-specific 3D feature volume, while keeping the parameters of the rendering network fixed. The learned feature volumes are independent of the rendering model and we can modify and combine scenes by editing their corresponding feature volumes. We demonstrate scene manipulations, including scene mixing, applying rigid and non-rigid transformations, inserting, moving and deleting objects in a scene, while also being able to render the new scene from many camera views. Generating 3D renderable representations of humans is another challenging task that many contemporary generative models are trying to solve. Many of these methods rely on the recent developments in 2D text-to-image generative models, usually using the score distillation approach of the diffusion-based generative models. However, they often lack the inherent constraints necessary to ensure consistency across the entire 3D shape, which makes the generation process very slow, often taking several hours of optimization per subject. In the third part of this thesis, we introduce FirstSight3D, a novel approach which utilizes structural and symmetry prior based on an attention mechanism. This prior allows us to produce a comprehensive 3D representation from only two conditionally generated RGB images. We use a pointcloud representation with shared and subject-specific features with the aforementioned attention mechanism, coupled with a PointNeRF-based neural radiance field model, pretrained on a human scans dataset. We generate two views using the image-to-Image Control-Net model conditioned on a rendered image, 2D pose and text description. Then, we optimize the subject-specific features of our model using the generated views for supervision. Our method can generate the final 3D human representation after approximately 20 minutes of optimization. Lernen in 3D erfordert oft eine geeignete Darstellung der dreidimensionalen Welt. In dieser Arbeit konzentrieren wir uns auf drei spezifische Aufgaben: die 3D-Rekonstruktion aus einem einzelnen Bild, die editierbare (kontrollierbare) Novel-view Synthese sowie die textbasierte 3D-Generierung unter Nutzung von Modell- und Symmetrie-Prior. Jede dieser Aufgaben basiert auf einer unterschiedlichen 3D-Darstellung: UV-Maps, 3D-Volumen und Feature-Punktwolken. Dennoch haben sie alle eine gemeinsame Anforderung: Sie müssen editierbar sowie einfach zu nutzen und zu manipulieren sein. Im ersten Teil dieser Arbeit liegt der Fokus auf der Vorhersage eines vollständigen 3D-Avatars einer Person aus einem einzelnen Bild. Wir formulieren dieses anspruchsvolle 3D-Rekonstruktionsproblem als eine Reihe von Bildvervollständigungsaufgaben, indem wir Textur und Geometrie im UV-Raum des SMPL-Modells ableiten. Aus dem Eingabebild und der geschätzten Segmentierung erzeugen wir partielle Textur- und Segmentierungs-Layout-Maps. Unser Modell sagt eine vollständige Segmentierungsmap, eine vollständige Texturmap und eine Displacement-Map voraus. Die vorhergesagten Displacement- und Texturmappen können auf das kanonische SMPL-Modell angewendet werden, während die Segmentierungsmap zur weiteren Bearbeitung des 3D-Avatars genutzt werden kann. Mit diesem Ansatz lassen sich neue Posen, Körperformen und sogar neue Kleidung auf natürliche Weise generalisieren. Unsere Ergebnisse mit realen Bildern aus dem Deep Fashion-Datensatz zeigen, dass unsere Methode 3D-Avatare aus einem einzigen Bild rekonstruieren kann. Im zweiten Teil der Arbeit befassen wir uns mit der editierbaren Novel-view Synthese (NVS). Methoden, die auf Neural Radiance Fields (NeRF) [Mildenhall et al., 2020] basieren, speichern die Radiance für jeden Punkt einer Szene innerhalb der Parameter des künstlichen neuronalen Netzwerks (KNN). Allerdings sind diese Modelle szenenspezifisch, sodass klassische Bearbeitungen wie Formmanipulation oder Szenenzusammensetzung schwierig oder gar nicht möglich sind. Wir präsentieren Control-NeRF, eine Methode zur flexiblen, 3D-bewussten Bildmanipulation bei gleichzeitiger hochqualitativer NVS aus einem Satz von Eingabebildern. Unser Hybridmodell kombiniert NeRF mit einer externen Szenenrepräsentation. Dabei koppelt unser Modell gelernte, szenenspezifische 3D-Feature-Volumen mit einem allgemeinen NeRF-Netz. Daher generalisieren wir auf neue Szenen, indem wir nur das szenenspezifische 3D Volumen optimieren, während die Parameter des Netzwerks unverändert bleiben. Die erlernten Volumina sind unabhängig vom NeRF-Netzwerk und wir kännen Szenen ändern und kombinieren, indem wir ihre entsprechenden Volumina bearbeiten. Wir demonstrieren verschiedene Szenenmanipulationen, darunter Szenenmischung, das Anwenden von starren und nicht-starren Transformationen sowie das Einfügen, Verschieben und Entfernen von Objekten innerhalb einer Szene – und das alles mit der Möglichkeit, die neue Szene aus unterschiedlichen Kameraperspektiven zu rendern. Die Generierung von 3D-renderbaren Darstellungen von Menschen stellt eine weitere Herausforderung dar, die viele aktuelle generative Modelle zu lösen versuchen. Viele dieser Methoden stützen sich auf jüngste Fortschritte in 2D-Text-zu-Bild-Generierungsmodellen, wobei meist der Score-Distillation-Ansatz von diffusionsbasierten generativen Modellen verwendet wird. Allerdings fehlt diesen Methoden oft die inh¨arente Koh¨arenz, die erforderlich ist, um eine konsistente 3D-Form zu gewährleisten. Dies macht den Generierungsprozess sehr langsam, da die Optimierung für ein einzelnes Subjekt oft mehrere Stunden dauert. Im dritten Teil dieser Arbeit stellen wir FirstSight3D vor, einen neuartigen Ansatz, der eine auf Aufmerksamkeit basierende Struktur- und Symmetriepriorität verwendet, wie in [Wewer et al., 2023] eingeführt. Diese Prior ermöglicht es uns, eine umfassende 3D-Repr¨asentation aus nur zwei bedingt generierten RGB-Bildern zu erzeugen. Unsere Methode nutzt eine Punktwolken-Darstellung mit gemeinsamen und subjekt-spezifischen Merkmalen sowie den oben genannten Aufmerksamkeitsmechanismus. Zusätzlich verwenden wir ein auf PointNeRF basierendes Neural Radiance Field (NeRF) Modell [Xu et al., 2022b], das auf einem Datensatz mit menschlichen Scans vortrainiert wurde. Zwei Ansichten werden mittels eines Image-to-Image Control-Net-Modells generiert, das durch ein gerendertes Bild, eine 2D-Pose und eine Textbeschreibung konditioniert wird. Anschließend optimieren wir die subjekt-spezifischen Merkmale unseres Modells unter Verwendung der generierten Bilder zur Überwachung. Unsere Methode kann nach etwa 20 Minuten Optimierung die endgültige 3D-Darstellung des Menschen generieren.
Link to this record:	urn:nbn:de:bsz:291--ds-471348 hdl:20.500.11880/41291 http://dx.doi.org/10.22028/D291-47134
Advisor:	Pons-Moll, Gerard Schiele, Bernt Steimle, Jürgen Kukleva, Anna
Date of oral examination:	13-Oct-2025
Date of registration:	10-Mar-2026
Faculty:	MI - Fakultät für Mathematik und Informatik
Department:	MI - Informatik
Professorship:	MI - Keiner Professur zugeordnet
Collections:	SciDok - Der Wissenschaftsserver der Universität des Saarlandes

Files for this record:

File	Description	Size	Format
Editable_Representations_for_3D_Reconstruction_and_Novel_View_Synthesis_PhD_Thesis.pdf	PhD dissertation	68,14 MB	Adobe PDF	View/Open

Export: BibTex