Object perception in images and videos unleashes the power of machines to decipher the visual world. Like virtual sleuths, computer vision systems scour pixels, recognizing, tracking, and understanding the myriad objects that paint the canvas of digital experiences. This technological prowess, fueled by deep learning magic, opens doors to transformative applications – from self-driving cars…
NVFi tackles the intricate challenge of comprehending and predicting the dynamics within 3D scenes evolving over time, a task critical for applications in augmented reality, gaming, and cinematography. While humans effortlessly grasp the physics and geometry of such scenes, existing computational models struggle to explicitly learn these properties from multi-view videos. The core issue lies…
Video super-resolution, aiming to elevate the quality of low-quality videos to high fidelity, faces the daunting challenge of addressing diverse and intricate degradations commonly found in real-world scenarios. Unlike previous focuses on synthetic or specific camera-related degradations, the complexity arises from multiple unknown factors like downsampling, noise, blur, flickering, and video compression. While recent CNN-based…
3D human motion reconstruction is a complex process that involves accurately capturing and modeling the movements of a human subject in three dimensions. This job becomes even more challenging when dealing with videos captured by a moving camera in real-world settings, as they often contain issues like foot sliding. However, a team of researchers from…
The field of pose estimation, which involves determining the position and orientation of an object in space, is a rapidly evolving area, with researchers continuously developing new methods to improve its accuracy and performance. Researchers from three highly regarded institutions – Tsinghua Shenzhen International Graduate School, Shanghai AI Laboratory, and Nanyang Technological University – have…
The Segment Anything Model (SAM) is an AI-powered model that segments images for object detection and recognition. It is an effective solution for various computer vision tasks. However, SAM is not optimized for edge devices, which can lead to retarded performance and high resource consumption. Researchers from S-Lab Nanyang Technological University and Shanghai Artificial Intelligence…
Researchers from Carnegie Mellon University and Google DeepMind have collaborated to develop RoboTool, a system leveraging Large Language Models (LLMs) to imbue robots with the ability to creatively use tools in tasks involving implicit physical constraints and long-term planning. The system comprises four key components:
Analyzer for interpreting natural language
Planner for generating strategies
Calculator…
Generative foundational models are a class of artificial intelligence models designed to generate new data that resembles a specific type of input data they were trained on. These models are often employed in various fields, including natural language processing, computer vision, music generation, etc. They learn the underlying patterns and structures from the training data…
Many branches of biology, including ecology, evolutionary biology, and biodiversity, are increasingly turning to digital imagery and computer vision as research tools. Modern technology has greatly improved their capacity to analyze large amounts of images from museums, camera traps, and citizen science platforms. This data can then be used for species delineation, understanding adaptation mechanisms,…
Large Vision-Language Models (LVLMs) combine computer vision and natural language processing to generate text descriptions of visual content. These models have shown remarkable progress in various applications, including image captioning, visible question answering, and image retrieval. However, despite their impressive performance, LVLMs still face some challenges, particularly when it comes to specialized tasks that require…