Qwen3-VL Image Processing Fails: Model Hallucinations

by Admin 54 views

Qwen3-VL Image Processing Failure: A Deep Dive

Qwen3-VL Image Processing Failure: A Deep Dive

Hey guys, we've got a critical issue on our hands with the @defai.digital/mlx-serving package. It seems that the vision/image processing functionality, specifically when using the Qwen3-VL model, is completely busted. Instead of accurately analyzing and describing images, the model is spitting out pure gibberish – we're talking full-blown hallucinations. Let's break down what's happening, how we've tried to fix it, and what we need to get this sorted out. We're talking about a serious problem, and it's essential to understand the core issues. Understanding these problems will help other developers to use this tool, and hopefully, we can fix it together.

The Bug: Hallucinations Galore

The heart of the problem lies in the fact that the Qwen3-VL model, when fed images, isn't actually processing them. Instead, it's generating completely unrelated content. It's like asking someone to describe a mountain landscape photo, and they come back talking about a vibrant, stylized eye surrounded by a swirling red vortex. That's the level of disconnect we're dealing with. To give you the gist, the model is generating descriptions that have absolutely no connection to the actual image content. It's not just off a little; it's miles away. This makes any application requiring image analysis or document OCR, which we are building, totally useless, which is a significant issue. The key here is not just that the model is making mistakes, but that the errors are so profound and consistent that it's clearly not even attempting to do what it's supposed to. If you are also having this problem, please provide support.

Keywords: @defai.digital/mlx-serving, Qwen3-VL, image processing, hallucinations, model failure. This is important, as the issue is the core functionality.

Environment and Setup

To give you a better idea of the playing field, here's a detailed rundown of the environment: the package version is @defai.digital/mlx-serving@1.2.1. The model we're using is mlx-community/Qwen3-VL-8B-Instruct-4bit. We're running this on macOS (Darwin 25.1.0), with Node.js version v24.11.1 and Python version 3.12.9. We're also using MLX version 0.29.4 and MLX-VLM version 0.28.3. We have followed the documentation step by step to ensure that we did not miss a step. The setup should be clean. If you have any additional information that can assist, please provide it.

Keywords: environment, package version, model version, macOS, Node.js, Python, MLX, MLX-VLM. Each part of the system has been confirmed to operate correctly.

Steps to Reproduce and Test Code

We've created a straightforward test to replicate this issue. Here's the TypeScript code we're using:

import { createEngine } from '@defai.digital/mlx-serving';

const engine = await createEngine();

await engine.loadModel({
  model: 'mlx-community/Qwen3-VL-8B-Instruct-4bit'
});

// Test with a mountain landscape photo
for await (const chunk of engine.createGenerator({
  model: 'mlx-community/Qwen3-VL-8B-Instruct-4bit',
  prompt: 'Describe this image in detail.',
  images: ['/absolute/path/to/mountains.jpg'],
  maxTokens: 200,
  temperature: 0.7
})) {
  if (chunk.type === 'token') {
    process.stdout.write(chunk.token);
  }
}

await engine.shutdown();

We load the model, provide an absolute path to an image (a mountain landscape photo in this case), and ask the model to describe it. We made sure the paths were correct and the files were accessible. It is important to remember that we are checking the paths and confirming that they are correct.

Keywords: TypeScript, test code, createEngine, loadModel, createGenerator, prompt, images, maxTokens, temperature. The provided code is concise and designed to replicate the issue directly.

Expected vs. Actual Behavior

Expected behavior: The model should analyze the mountain landscape photo and give an accurate description. We'd expect something like: "The image shows a mountain range under a cloudy sky…" Something that actually relates to the image.

Actual behavior: Instead, the model generates completely fabricated descriptions. For example, it might describe the image as “an artistic, abstract composition…a stylized human eye, rendered with bold, graphic lines…”. This is a complete mismatch with the actual content of the image.

Example 1: Mountain Photo: Expected result: accurate description of mountains, clouds, etc. Actual result: description of a stylized eye and red vortex.

Example 2: Business Document: Expected result: analysis of document content (text, layout, etc.). Actual result: description of a restaurant interior.

Example 3: Empty Token Generation: Sometimes, with document images, the model just generates a single, empty token and stops. This is the worst outcome.

This behavior is consistent across different images and prompts. We have done more testing. We hope that this is an easy fix for the developers.

Keywords: expected behavior, actual behavior, mountain photo, business document, empty token generation. This helps to highlight the discrepancy.

Investigation Findings

We did some digging and here's what we found:

  1. Image Paths: We confirmed that the image paths are correct and that the files exist and are readable.
  2. Model Loading: The model loads successfully from the cache, so that's not the problem.
  3. Text Generation: Text generation works fine when no images are provided. The model is fine with text.
  4. Vision Capabilities: The runtime info confirms that vision capabilities are advertised.
  5. Image Processing Failure: The crucial part: the model is not actually processing the images. It's generating random content regardless of the image input.

Keywords: image paths, model loading, text generation, vision capabilities, image processing failure. Each test has given us more clarity.

Diagnostic Test Results

We ran some diagnostic tests to confirm the image file paths and accessibility:

# Files confirmed to exist
/Users/defaiadmin/test-image.jpg (72KB JPEG)
/Users/defaiadmin/126-6.png (995KB PNG - business document)
/Users/defaiadmin/154-7.png (286KB PNG - underwriting document)

All three images – a JPEG and two PNGs – produced either hallucinated or empty outputs. The images themselves are valid, and the file system can access them. We are continuing to provide all the information we can.

Keywords: diagnostic test results, JPEG, PNG, hallucinations, empty outputs, valid image files. Every image type has been tested.

Impact of the Issue

This is a major problem, guys. The impact is significant:

  • Document OCR is completely broken.
  • Image analysis is nonexistent.
  • The vision capabilities advertised in the documentation are non-functional.
  • All performance metrics are meaningless because the model isn't actually processing images.

This means that any project relying on this package for image processing or document analysis is effectively dead in the water. We need to fix this ASAP.

Keywords: impact, document OCR, image analysis, vision capabilities, performance metrics. Everything is broken when the key part doesn't work.

Hypothesis: The Image Passing Mechanism

Our current hypothesis is that the issue lies in how the TypeScript/Node.js layer passes images to the Python MLX-VLM backend. The images might not be:

  • Properly encoded/decoded.
  • Passed in the correct format expected by the Python runtime.
  • Reaching the model at all.

We're suspecting there's a problem with how the image data is being handled as it moves between the different layers of the application. The images might not be correctly formatted or transmitted in a way that the Python backend can understand. We are working together to try to fix it, and we are working hard on fixing it. Keep up the good work!

Keywords: hypothesis, image passing mechanism, TypeScript, Node.js, Python MLX-VLM backend, encoding/decoding, format. This gives the best place to start.

Additional Context and Request

We were following the official documentation guide for