Multimodal AI agents are designed to process and integrate various data types, such as images, text, and videos, to perform tasks in digital and physical environments. They are used in robotics, virtual assistants, and user interface automation, where they need to understand and act based on complex multimodal inputs. These systems aim to bridge verbal…
Previously we discussed applying reinforcement learning to Ordinary Differential Equations (ODEs) by integrating ODEs within gymnasium. ODEs are a powerful tool that can describe a wide range of systems but are limited to a single variable. Partial Differential Equations (PDEs) are differential equations involving derivatives of multiple variables that can cover a far broader range…
Open-vocabulary object detection (OVD) aims to detect arbitrary objects with user-provided text labels. Although recent progress has enhanced zero-shot detection ability, current techniques handicap themselves with three important challenges. They heavily depend on expensive and large-scale region-level annotations, which are hard to scale. Their captions are typically short and not contextually rich, which makes them…
Machine learning and AI are among the most popular topics nowadays, especially within the tech space. I am fortunate enough to work and develop with these technologies every day as a machine learning engineer!
In this article, I will walk you through my journey to becoming a machine learning engineer, shedding some light and advice…
When Algorithms Dream of Photons: Can AI Redefine Reality Like Einstein? | by Manik Soni | Jan, 2025
In 1905, Albert Einstein published a paper on the photoelectric effect — a deceptively simple observation that light could eject electrons from metals. This work, which later won him the Nobel Prize, didn’t just explain an oddity in physics. It shattered classical mechanics, birthing quantum theory and reshaping our understanding of reality. But here’s a…
This AI Paper Introduces MAETok: A Masked Autoencoder-Based Tokenizer for Efficient Diffusion Models
Diffusion models generate images by progressively refining noise into structured representations. However, the computational cost associated with these models remains a key challenge, particularly when operating directly on high-dimensional pixel data. Researchers have been investigating ways to optimize latent space representations to improve efficiency without compromising image quality.
A critical problem in diffusion models is…
In December, we kicked off the agentic era by releasing an experimental version of Gemini 2.0 Flash — our highly efficient workhorse model for developers with low latency and enhanced performance. Earlier this year, we updated 2.0 Flash Thinking Experimental in Google AI Studio, which improved its performance by combining Flash’s speed with the ability…
Robots are usually unsuitable for altering different tasks and environments. General-purpose models of robots are devised to circumvent this problem. They allow fine-tuning these general-purpose models for a wide scope of robotic tasks. However, it is challenging to maintain the consistency of shared open resources across various platforms. Success in real-world environments is far from…
Which Outcome Matters?
Here is a common scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were selected for a campaign and they received Treatment A. Another sample was selected to receive Treatment B. “A” could be a communication or offer and “B” could be no communication or no…