XITASO GmbH IT & Software Solutions | Germany | 76xxx Karlsruhe | Part time - flexible | Published since: 08.05.2026 on stepstone.de
Masterand – Semantic 4D Occupancy Forecasting (m/f/d)
The semantic 4D Occupancy Forecasting is crucial for safe autonomous driving as it allows vehicles to anticipate future scene dynamics and geometries. However, the training of modern state-of-the-art models is strongly based on fully monitored methods (fully supervised methods), which require massive and extremely expensive, dense 3D voxel annotations.
In order to overcome this data bottleneck, peak research is increasingly moving towards self-monitored (self-supervised) and weakly monitored (weakly-supervised) paradigms that use pre-trained 2D foundation models (e.g. DINOv2, CLIP or SAM). Due to the alignment (alignment) of these rich open-Vocabulary 2D semantic features on 3D-/4D spatial representations using advanced transformer architectures, it is possible to achieve a robust spatial-temporal understanding without dense 3D-round-Truth data.
Building on these breakthroughs, this master thesis focuses on the development of a Foundation-Model-based framework for the vision-based 4D occupation forecast. Your task will be to design an architecture that distills rich multi-view semantics into a 4D prediction pipeline and thus closes the gap between scalable, purely camera-based inputs and high-precision (high-fidelity) environmental predictions. .
* After clicking the Read more button, the original advert will open on our partner's website, where you can see the details of this vacancy and contact information. If you need a translation of this text, after returning to our website it will be prepared and you can read it by clicking the Show full translation button.
Your tasks • Your profile • What we offer
The semantic 4D Occupancy Forecasting is crucial for safe autonomous driving as it allows vehicles to anticipate future scene dynamics and geometries. However, the training of modern state-of-the-art models is strongly based on fully monitored methods (fully supervised methods), which require massive and extremely expensive, dense 3D voxel annotations.
In order to overcome this data bottleneck, peak research is increasingly moving towards self-monitored (self-supervised) and weakly monitored (weakly-supervised) paradigms that use pre-trained 2D foundation models (e.g. DINOv2, CLIP or SAM). Due to the alignment (alignment) of these rich open-Vocabulary 2D semantic features on 3D-/4D spatial representations using advanced transformer architectures, it is possible to achieve a robust spatial-temporal understanding without dense 3D-round-Truth data.
Building on these breakthroughs, this master thesis focuses on the development of a Foundation-Model-based framework for the vision-based 4D occupation forecast. Your task will be to design an architecture that distills rich multi-view semantics into a 4D prediction pipeline and thus closes the gap between scalable, purely camera-based inputs and high-precision (high-fidelity) environmental predictions.
Development of a Transformer-based network to predict the future semantic 4D occupation (4D Occupancy) from sequential multi-view camera data by means of weak or self-monitoring (weak / self-supervision). Construction and training of the PyTorch pipeline as well as design of alignment mechanisms to distill semantic features from 2D-foundation models into the spatial-temporal 4D representation. Benchmarking against fully monitored (fully-supervised) baselines on large data sets (e.g. nuScenes, OpenOccupancy) with special focus on prediction accuracy (IoU), semantic precision and label efficiency.
You are enrolled in a master's degree in Computer Science, Artificial Intelligence, Robotics or a comparable degree program. You have very good programming skills in Python and well-founded experience with deep learning frameworks (especially PyTorch). You bring sound background knowledge in the 3D Computer Vision area. Practical experience with semantic segmentation, Occupancy Networks or 3D Gaussian Splating is of great advantage. You have knowledge of Vision Transformers (ViT), Foundation Models (DINO, CLIP) and paradigms of self- or weakly monitored learning (Self-/Weakly-Supervised Learning). You have an independent and solution-oriented way of working, high motivation as well as very good English and German skills (C1 level) for clear communication in the team and with our partners.
New Work & Culture Self-organized teams with a lot of design space Responsibility and co-design Open error and feedback culture Mentoring & personal development Individual mentoring from the first day Regular development talks (Catch-ups) guidance on eye level, based on trust and respect Lifelong learning Professional and overtime training Internal TechTalks, external training and conferences High-end Software Engineering Demanding, innovative and versatile projects Cross-functional teams with modern technologies Experienced expert culture and knowledge sharing Family friendly Support costs up to €250/child Continue payment for children's sick days Community & Events Regular events (e.g. retreats, summer festivals) Personal encounters & team content From Day 1 part of the diverse networked community Working hours and flexibility Free choice of working time and place Flexible working time accounts, 30 days vacation, part-time option, Sabbatical & Workation Health and well-being Mental Health Taskforce JobRad & other offers Diversity & Inclusion Diversity Taskforce for Prospective Diversity Culture of belonging: Everyone should feel accepted
Location
![]() | XITASO GmbH IT & Software Solutions | |
| 76131 Karlsruhe | ||
| Germany |
The text of this ad was translated from German into English using an automatic translation system and may contain semantic and lexical errors. Therefore, it should be used for introductory purposes only. For more detailed information, see the original text of the ad at the link below.
For more information read the original ad