Gemini-Powered DataFrame Agent for Natural Language Data Analysis with Pandas and LangChain, VisualYour Exclusive Invite for the World’s first 2-day AI Challenge (usually $895, but $0 today)51% of companies have started using AITech giants have cut over 53,000 jobs in 2025 itselfAnd 40% of professionals fear that AI will take away their job.But here’s the real picture - companies aren't simply eliminating roles, they're hiring people who are AI-skilled, understand AI, can use AI & even build with AI. Join the online 2-Day LIVE AI Mastermind by Outskill - a hands-on bootcamp designed to make you an AI-powered professional in just 16 hours. Usually $895, but for the next 48 hours you can get in for completely FREE!In just 16 hours & 5 sessions, you will:Learn the basics of LLMs and how they workMaster prompt engineering for precise AI outputsBuild custom GPT bots and AI agents that save you 20+ hours weeklyCreate high-quality images and videos for content, marketing, and brandingAutomate tasks and turn your AI skills into a profitable career or businessKick off Call & Session 1- Friday (10am EST- 1pm EST)Sessions 2-5:Saturday 11 AM to 7 PM EST; Sunday 11AM EST to 7PM ESTAll by global experts from companies like Amazon, Microsoft, SamurAI and more. And it’s ALL. FOR. FREE. You will also unlock $3,000+ in AI bonuses: Slack community access, Your Personalised AI tool kit, and Extensive Prompt Library with 3000+ ready-to-use prompts - all free when you attend!Join in now, we have limited free seats!SponsoredSubscribe|Submit a tip|Advertise with usWelcome to DataPro #138 - Where AI Acceleration Meets Practical InsightThis week’s edition dives into the cutting edge of data science, AI tooling, and intelligent automation, highlighting breakthroughs that are reshaping how we build, reason, and scale.From a staggering 10,000x speed-up in Bayesian inference to OpenAI’s battle against malicious AI use, this issue captures the pulse of innovation across MLOps, LLM infrastructure, and trustworthy deployment. Google’s new MCP Toolbox integrations promise seamless AI-assisted development on Cloud Databases, while Tekton and Buildpacks simplify model automation with no Dockerfile in sight.We also explore research frontiers, from advanced molecular design powered by ether0’s RL-tuned 24B model, to VeBrain’s leap in embodied AI, letting language models perceive, reason, and act in physical environments. On the tooling side, Alchemist shows how to distill open datasets into generative gold, and Meta’s LlamaRL raises the bar on scalable RL fine-tuning for LLMs.Looking ahead, our preview spotlights a Gemini-powered Pandas agent capable of transforming natural language queries into statistical and visual insights, no code required. Plus, you’ll find a walkthrough on automating customer support with Bedrock and Mistral, and even a guide to running DeepSeek-R1 locally at home (if your GPU can handle it).SponsoredCloudVRM slashes vendor review and audit time by connecting directly to cloud environments, no spreadsheets, no forms, just real-time compliance, 24/7. Watch the demo.Whether you're in research, ops, or product, this editionoffers powerful perspectives and hands-on resources to keep your stack smart and future-ready.Cheers,Merlyn ShelleyGrowth Lead, PacktGet Chapter 1 of Learning Tableau 2025 – Free!Explore Tableau’s newest AI-powered capabilities with a free PDF of Chapter 1 from the latest edition of the bestselling series, Learning Tableau 2025.Written by Tableau Visionary Joshua Milligan, this hands-on guide helps you build smarter dashboards, master data prep, and apply AI-driven insights.Sign up to download your free chapter!Grab Your Free Chapter Now!Top Tools Driving New Research 🔧📊🔳ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks. ether0 is a 24B-parameter language model developed by FutureHouse to tackle advanced chemical reasoning tasks. Trained using a blend of reinforcement learning and behavior distillation, it generates molecular structures as SMILES strings and significantly outperforms both general-purpose and chemistry-specific models. ether0 demonstrates exceptional accuracy and data efficiency, achieving 70% accuracy with only 60,000 training reactions, surpassing models trained on full datasets. Its architecture includes novel training strategies like GRPO, curriculum learning, and expert initialization, making it a new benchmark in scientific LLM development for molecular design and synthesis.🔳 OpenGVLab/VeBrain: Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces. Visual Embodied Brain (VeBrain) is a unified framework designed to extend multimodal large language models (MLLMs) into physical environments, enabling them to perceive, reason, and control in real-world spaces. By translating robotic tasks into text-based interactions within a 2D visual context, VeBrain simplifies multimodal objectives. It introduces a robotic adapter to convert MLLM-generated text into actionable control for physical systems. The accompanying VeBrain-600k dataset, meticulously curated with multimodal chain-of-thought reasoning, supports this integration. VeBrain significantly outperforms models like Qwen2.5-VL across multimodal and spatial benchmarks, and demonstrates superior adaptability and compositional reasoning in legged robot and robotic arm control tasks.🔳 Alchemist: Turning Public Text-to-Image Data into Generative Gold. Alchemist introduces a novel strategy for curating high-quality supervised fine-tuning (SFT) datasets to enhance text-to-image generation. By using a pre-trained generative model to identify impactful samples, the authors created a compact, diverse 3,350-sample dataset that significantly boosts the performance of five public T2I models. Unlike existing narrow-domain datasets, Alchemist is general-purpose and openly available, addressing limitations of proprietary data reliance. The approach offers a cost-effective and scalable alternative for dataset creation while improving image quality and stylistic variation in generative outputs. Fine-tuned model weights are also publicly released to support broader research and application.🔳 Meta Introduces LlamaRL: A Scalable PyTorch-Based Reinforcement Learning RL Framework for Efficient LLM Training at Scale. Meta’s LlamaRL is a new PyTorch-based framework designed to make reinforcement learning (RL) more scalable for training large language models. It uses an asynchronous, distributed architecture where components like generation and training run in parallel, reducing GPU idle time and improving memory efficiency. LlamaRL supports massive models, up to 405B parameters, with significant speedups, achieving over 10× faster RL step times compared to traditional methods. Features such as dedicated executors, NVLink-based synchronization, and offloading enable modularity and fine-grained parallelism. LlamaRL offers a flexible, high-performance infrastructure for aligning large models through RL at industrial scale.Topics Catching Fire in Data Circles 🔥💬🔳 Automate Models Training: An MLOps Pipeline with Tekton and Buildpacks. This tutorial introduces an automated MLOps pipeline for training GPT-2 models using Tekton and Buildpacks, without writing a Dockerfile. It demonstrates how to containerize training workflows and orchestrate CI/CD pipelines in Kubernetes. Using Buildpacks, the training code is converted into a secure container image, while Tekton Pipelines manages sequential tasks for building and executing training. A shared PersistentVolume ensures smooth data flow across steps. The pipeline is lightweight, reproducible, and perfect for integrating experimentation into production-grade ML workflows. This example highlights the growing importance of efficient, code-light automation in model development.🔳 Prescriptive Modeling Unpacked: A Complete Guide to Intervention with Bayesian Modeling. This guide explores how prescriptive modeling, using Bayesian methods, enables data-driven intervention in complex systems rather than just prediction. Moving beyond forecasting, it identifies causal drivers in systems and quantifies the effects of changes. With hands-on examples in predictive maintenance and Bayesian networks via the bnlearn Python library, the article walks through building causal models, inferring interventions, and applying them to real-world scenarios like water infrastructure. It also covers structure learning, synthetic data generation, and practical cost-benefit considerations, making it a comprehensive resource for actionable analytics in operations and engineering.🔳 How OpenAI responding to The New York Times’ data demands in order to protect user privacy? OpenAI is actively resisting a legal demand from The New York Times to indefinitely retain ChatGPT and API user data, a move it argues undermines its privacy commitments. The order excludes Enterprise and Zero Data Retention API users. OpenAI is appealing the decision, maintaining data will remain securely stored, restricted to legal teams, and used only to meet legal obligations. Deleted chats, normally erased within 30 days, are affected by the hold, but OpenAI vows to fight further access requests and uphold user privacy throughout the legal process. Training policies and business data protections remain unchanged.🔳 What execs want to know about multi-agentic systems with AI? This field report highlights key lessons from enterprise adoption of Multi-Agent Systems (MAS). While MAS can transform complex processes through coordinated AI agents, many leaders limit its value by simply automating legacy workflows. Success requires reimagining processes, designing thoughtful agent collaboration, and embedding governance and ethics from the start. Common missteps include neglecting collaboration logic, delaying ethical safeguards, and underestimating the shift needed to harness MAS fully. Executives most often ask how to measure ROI beyond cost, how to balance human and AI roles, and how to manage ethical risks. Effective MAS design relies on clear goals, rigorous testing, and human-AI orchestration.New Case Studies from the Tech Titans 🚀💡🔳 10,000x Faster Bayesian Inference: Multi-GPU SVI vs. Traditional MCMC. Bayesian inference has traditionally been limited by high computational demands, especially in large-scale applications. This guide demonstrates how Stochastic Variational Inference (SVI) on multi-GPU setups can dramatically accelerate Bayesian modeling, achieving up to a 10,000x speedup over traditional CPU-based MCMC. Using JAX and NumPyro, data is efficiently sharded and replicated across GPUs, enabling scalable inference for millions of observations and parameters. Benchmarks show multi-GPU SVI reduces training time from days to minutes, making large hierarchical Bayesian models feasible for production. This approach is ideal for practitioners seeking rapid, scalable, and approximate Bayesian solutions in real-world settings.🔳 BenchmarkQED: Automated benchmarking of RAG systems:BenchmarkQED is an automated benchmarking suite designed to rigorously evaluate retrieval-augmented generation (RAG) systems. Developed to support tools like GraphRAG, it includes components for query generation (AutoQ), evaluation (AutoE), and dataset structuring (AutoD). BenchmarkQED enables consistent testing across local-to-global query types, using synthetic queries and LLM-based judgments. LazyGraphRAG, evaluated with this suite, consistently outperforms traditional and advanced RAG methods, even those with massive 1M-token contexts, across comprehensiveness, diversity, empowerment, and relevance. BenchmarkQED and its datasets, now open-source, offer a scalable, structured path for testing next-gen RAG capabilities in real-world QA applications.🔳 OpenAI on Countering Malicious AI – June 2025 OpenAI’s June 2025 report highlights how its teams are actively detecting and disrupting malicious uses of AI. In line with its mission to ensure AI benefits humanity, the company outlines efforts to block harmful applications such as cyber espionage, social engineering, scams, and influence operations. By leveraging AI to augment internal investigative teams, OpenAI has rapidly identified and neutralized threats over the past three months. The report reinforces the importance of democratic AI governance and common-sense safeguards to prevent misuse by authoritarian regimes and bad actors while supporting global safety and accountability.🔳 Deploying Llama4 and DeepSeek on AI Hypercomputer: Google has released new optimized recipes for deploying Meta’s Llama4 and DeepSeek models using its AI Hypercomputer platform. These guides streamline the setup of powerful MoE-based LLMs like Llama-4-Scout and DeepSeek-R1 across Trillium TPUs and A3 GPUs. Using inference engines like JetStream, MaxText, vLLM, and SGLang, developers can now efficiently run large models with multi-host support, minimal configuration, and reproducible performance. Recipes cover tasks such as model checkpoint conversion, TPU/GPU provisioning, and benchmarking (e.g., MMLU), enabling scalable, high-throughput inference for cutting-edge open-source LLMs in production-grade environments.🔳 New MCP integrations to Google Cloud Databases: Google Cloud has announced new MCP Toolbox integrations for databases, designed to supercharge AI-assisted development. The open-source Model Context Protocol (MCP) server now supports seamless connections between AI coding assistants (like Claude Code, Cline, and Cursor) and databases such as BigQuery, AlloyDB, Cloud SQL, Spanner, and others. These new capabilities enable developers to perform tasks like schema design, data exploration, code refactoring, and integration testing using natural language prompts within their IDEs. The result: faster, smarter development workflows, with AI handling the SQL and schema logic, dramatically reducing setup and iteration time.Blog Pulse: What’s Moving Minds 🧠✨🔳 Mastering SQL Window Functions: Mastering SQL Window Functions offers a clear and practical introduction to using window functions for powerful row-wise analysis without collapsing data. Unlike traditional aggregations, window functions (like SUM() OVER or RANK() OVER) preserve individual records while enabling calculations across partitions. Examples include calculating totals per brand, ranking by price, and computing year-wise averages, all while retaining full row-level detail. These functions are essential for tasks like ranking, comparisons, and cumulative metrics, making them a vital tool in modern analytics workflows. However, they may incur performance costs on large datasets, so use them judiciously.🔳 Automate customer support with Amazon Bedrock, LangGraph, and Mistral models: This walkthrough demonstrates how to build an intelligent, multimodal customer support workflow using Amazon Bedrock, LangGraph, and Mistral models. By combining large language models with structured orchestration and image-processing capabilities, the solution automates tasks such as ticket categorization, transaction and order extraction, damage assessment, and personalized response generation. LangGraph enables complex, stateful agent workflows while Amazon Bedrock provides secure, scalable access to LLMs and Guardrails for responsible AI. With integrations for Jira, SQLite, and vision models like Pixtral, this framework delivers real-time, context-aware support automation with observability and safety built in.🔳 Run the Full DeepSeek-R1-0528 Model Locally: DeepSeek-R1-0528, a powerful reasoning model requiring 715GB of disk space, is now runnable locally thanks to Unsloth's 1.78-bit quantization, reducing its size to 162GB. This guide explains how to deploy the quantized version using Ollama and Open WebUI. With at least 64GB RAM (CPU) or a 24GB GPU (for better speed), users can serve the model via ollama run, launch Open WebUI in Docker, and interact with the model through a local browser. While GPU usage offers ~5 tokens/sec, CPU-only fallback is much slower (~1 token/sec). Setup is demanding, but viable with persistence.🔳 How to Build an Asynchronous AI Agent Network Using Gemini for Research, Analysis, and Validation Tasks? The Gemini Agent Network Protocol offers a modular framework for building cooperative AI agents, Analyzer, Researcher, Synthesizer, and Validator, using Google’s Gemini models. This tutorial walks through creating asynchronous workflows where each agent performs role-specific tasks such as breaking down complex queries, gathering data, synthesizing information, and verifying results. By using Python's asyncio for concurrency and google.generativeai for model interaction, the network dynamically routes tasks and messages. With detailed role prompts and shared memory for dialogue context, it allows for efficient multi-agent collaboration. Users can simulate scenarios such as analyzing quantum computing’s impact on cybersecurity and observe real-time agent participation metrics.🔳 Build a Gemini-Powered DataFrame Agent for Natural Language Data Analysis with Pandas and LangChain: This tutorial demonstrates how to combine Google’s Gemini models with Pandas and LangChain to create an intelligent, natural-language-driven data analysis agent. Using the Titanic dataset as a case study, the setup allows users to query the data conversationally, eliminating the need for repetitive boilerplate code. The Gemini-Pandas agent can answer simple questions such as dataset size, compute survival rates, or identify correlations. It can also handle advanced analyses like age-fare correlation, survival segmentation, and multi-DataFrame comparisons. Custom analyses, such as building passenger risk scores or evaluating deck-wise survival trends, are also supported. With just a few lines of Python and LangChain tooling, analysts can turn datasets into a conversational playground for insight discovery.See you next time!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}
Read more