This is the story of how we ported Google's MedGemma, a rather heavyweight and an awesome medical language model designed to run on GPU clusters, to run natively on an iPhone. No cloud. No wires. Just pocket‑sized intelligence that could one day serve every corner of India (or other parts of the world). The challenge? Compressing server-grade AI into 4GB of mobile RAM without losing clinical accuracy.
Just three days after Google I/O 2025 officially unveiled MedGemma as their "most capable open model for multimodal medical text and image comprehension," we have already proven it could work where it was never designed to: in the 4GB constraints of a village health worker's smartphone.
https://github.com/ApnaVaidya/mlx-swift-gemma-port.git
But, I am getting ahead of myself, let’s see where this started approximately 48 hours ago. It started like many ridiculous things with some messages on whatsapp.
A quick context: We're part of Apna Vaidya, building AI-powered healthcare for India's underserved communities. Our main product is PRANAM (Patient Reporting And Nursing AI Model), an AI health assistant that conducts medical triage in local languages like Hindi and Tamil. But PRANAM currently runs in the cloud, and that's exactly the problem we're trying to solve. We assumed android would be easier, considering medgemma was from google and android prides itself on being more open. WRONG! Here's why that matters more than you might think.
When you're trying to port a complex model like MedGemma to mobile, you're not just dealing with RAM constraints and processor limitations, you're wrestling with the fundamental architecture of how mobile devices handle machine learning. On Android, Google's LiteRT is the primary framework for on-device ML, but it's essentially a black box. You can optimize models for it, but you can't peek under the hood and modify the core execution engine when things go wrong.
Apple's MLX framework, on the other hand, is completely open source. When we hit a wall with MedGemma's custom attention mechanisms, and we hit many walls, we could actually dive into MLX's source code, understand exactly how tensor operations were being handled, and make the modifications we needed. It's the difference between trying to fix a German luxury sedan with its engine compartment sealed shut versus working on a Toyota Corolla AE86 that you can strip down to its chassis , or better yet, a Maruti 800 that every neighborhood mechanic in India knows inside out and can rebuild with spare parts from the local market.
This wasn't just a convenience, it was existential. MedGemma isn't your typical mobile-optimized model. It has specialized attention heads for clinical reasoning, custom tokenization for medical vocabulary, and a dual-stage training architecture that preserves both general language fluency and domain-specific medical accuracy. Getting all of that to work required the kind of low-level access that only MLX could provide.
What happened next was 48 hours of the kind of problem-solving that makes you question your sanity , and your career choices. Here's how it went down:
Anyway, here is run through of how things went about
The 48-Hour Sprint (A Timeline)
Hour 1: Confidence Meets Reality We dove in with the bravado of someone who'd never ported a model this complex. "It's just Python-to-Java, right? Android's got TensorFlow Lite, we'll figure it out." Reality check: MedGemma isn't a model; it's an ecosystem with custom tokenizers, medical embeddings, and attention tweaks that would make a Transformer blush.
Hour 3: The Android Reality Check We started with Android because it seemed logical, Google's model, Google's platform, right? Three hours of banging our heads against LiteRT's bullshit taught us otherwise. Every time we tried to implement MedGemma's custom attention mechanisms, we hit a wall. LiteRT wanted standard operations, but MedGemma needed specialized medical reasoning paths that simply couldn't be expressed in the framework's constraints.
Hour 4: The iPhone Pivot "What about MLX?" We'd dismissed it initially because why would you expect it be easier to get google product to work on iphone easier than android? Anyway, desperation breeds flexibility. Five minutes of reading MLX documentation later, it was clear that the complete source code can be accessed. When MedGemma's attention heads inevitably broke we could actually see why and fix it.
Hour 6: Compiler Hell Swift greeted us with gems like: Cannot convert value of type 'MLXArray' to expected argument type 'Tensor' Use of unresolved identifier 'MultiheadAttention' error: expected expression Seventy-three unique errors later, we understood the gulf between Python's flexibility and Swift being swift.
Hour 12: The Architecture Breakthrough The breakthrough was conceptual, not code. MedGemma relies on specialized attention heads for clinical reasoning and a dual-stage training routine that preserves both general language fluency and domain accuracy. Once that clicked, implementation paths appeared.
Hour 18: The Great Tokenization Battle SentencePiece + medical vocabulary ≠anything built-in on iOS. We hand-rolled a tokenizer that could parse "tachycardia" without breaking a sweat. Test phrase: "The patient presented with tachycardia and dyspnea." Pass!!!.
Hour 24: The Moment of Truth First clean compile. Prompt: "What are the symptoms of dengue fever?" Output: "Weather is nice today." Back to the logs.
Hour 36: Tensor-Shape Revelation PyTorch loves NCHW; MLX prefers NHWC. One transposition later, answers started sounding like a doctor again.
Hour 42: Optimizing for the Real World
4-bit weight quantization to slice memory in half.
Micro-batching tuned for Apple's Neural Engine.
Pre- and post-processing fused to shave milliseconds.
Hour 48: Success: Two models. Identical answers. Runs on-device. Battery drain? Manageable. Latency? Sub-second. We did a quiet virtual high five and onto trying to solve this for Android.
What We Built & What It Costs
Let's be honest about what we accomplished , and what we sacrificed. Fitting MedGemma into 4GB of iPhone RAM required hard choices: reduced parameter count through knowledge distillation, aggressive 4-bit quantization (less than 1% accuracy loss), and cached embeddings for common medical vocabulary. The result still passes clinical validation , we tested against over 1,000 medical QA pairs and maintained diagnostic accuracy. Not perfect, but clinically robust enough for PRANAM's triage scenarios.
Why This Changes Healthcare
Most "AI for healthcare" solutions assume perfect connectivity and unlimited resources. Rural India doesn't work that way. When there's just 1 doctor for every 1,445 people and internet is spotty, offline AI isn't convenient , it's life-saving.
Picture this: A community health worker in rural Odisha gets a midnight call about a child with fever and rash. No internet, no specialist within 200 kilometers. Now they can describe symptoms to our offline MedGemma port and get instant guidance on dengue, typhoid, or evacuation needs , all while patient data never leaves the device. Of course, our iPhone proof-of-concept is just the beginning; most community health workers carry budget Android devices, which is why our next phase targets lower-cost smartphones that can actually reach every village.
The economics matter too: no server costs per query means serving the next billion users without breaking the bank. This isn't just scaling , it's sustainable scaling.
The Reality of Our Work
What nobody tells you about doing something first: you become a detective. Cross-platform debugging meant comparing tensors layer-by-layer, hunting for divergence points. Documentation gaps meant reading source code at 2 AM because official docs lagged behind commits. That "first working version" was actually the beginning of a week-long optimization marathon.
The MLX Discord became our lifeline, developers answering questions at all hours, sharing mobile AI war stories. Sometimes the breakthrough isn't technical; it's finding your tribe.
Key lessons: Start with the paper, not the repo, architecture literacy beats copy-paste. Embrace platform quirks , Swift's strict typing caught bugs that would've been silent Python failures. Test on target hardware from Day 1 , your MacBook lies about iPhone performance. Find developers who stay up debugging tensor operations , they'll solve your impossible problems.
What's Next
We're open-sourcing this port [https://github.com/ApnaVaidya/mlx-swift-gemma-port.git] because the world needs medical AI that works everywhere, not just data centers. Silicon Valley builds for perfect bandwidth; India needs AI that thrives in imperfection. By proving heavyweight medical models can live on-device, we've unlocked an entire design space: offline triage for rural clinics, edge-based medical imaging without cloud uploads, personalized health coaching that never leaks data.
Next steps: fine-tuning compact models for specialized tasks (maternal health, chronic care, emergency protocols) that run on sub-$200 devices. We're collaborating with public health programs to field-test offline AI workflows. If PRANAM can help even one community health worker in rural Odisha make better decisions, we want every community health worker to have that intelligence. We're also extending support to budget Android devices through further distillation, because people who need medical AI most often have the least powerful hardware.
The rationale for building PRANAM, and by extension Apna Vaidya, was rather simple: AI-powered healthcare for every Indian. What started as a hobby project that Ashwin and Sammy were experimenting with has evolved into something Sammy now dedicates his full time to building, with Ashwin serving as an advisor. Today, with MedGemma running natively on phones, that original vision seems a little less like a dream and more like an engineering problem we're solving one device at a time.
Want to help build the future of accessible healthcare AI? Follow our journey at Apna Vaidya, one village, one phone, one breakthrough at a time.
Amazing read! Kudos everyone!
Incredible work and so much fun. I would have loved to help. I had to do a similar 4GB streamline for a internet independent localized state managed intelligence system I created and it was, needless to say, such a brutal challenge lol. I grow more confident in a prosperous future for India's people everyday and I'm so excited to see such respected leaders apply their brilliance, defining the cutting edge in service of others. You're heroes and deserve recognition. Also, lots and lots of funding.