pdf-icon

Product Guide

Offline Voice Recognition

Industrial Control

IoT Measuring Instruments

Air Quality

PowerHub

Module13.2 PPS

Input Device

Ethernet Camera

DIP Switch Usage Guide

Module GPS v2.0

Module GNSS

Module ExtPort For Core2

Module LoRa868 V1.2

Qwen3-1.7B

  1. Manually download the model and upload it to raspberrypi5, or pull the model repository via the following command.
Note
If git lfs is not installed, please refer to git lfs Installation Guide for installation.
git clone https://huggingface.co/AXERA-TECH/Qwen3-1.7B

File Description

m5stack@raspberrypi:~/rsp/Qwen3-1.7B$ ls -lh
total 21M
-rw-rw-r-- 1 m5stack m5stack    0 Aug 12 09:07 config.json
-rw-rw-r-- 1 m5stack m5stack 1.1M Oct 13 09:46 main_api_ax650
-rw-r--r-- 1 m5stack m5stack  132 Oct 13 11:45 main_api_axcl_aarch64
-rw-rw-r-- 1 m5stack m5stack 8.5M Oct 13 09:46 main_api_axcl_x86
-rw-rw-r-- 1 m5stack m5stack 963K Oct 13 09:46 main_ax650
-rw-rw-r-- 1 m5stack m5stack 1.7M Oct 13 09:46 main_axcl_aarch64
-rw-rw-r-- 1 m5stack m5stack 8.1M Oct 13 09:46 main_axcl_x86
-rw-rw-r-- 1 m5stack m5stack  277 Aug 12 09:07 post_config.json
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:07 qwen2.5_tokenizer
drwxrwxr-x 2 m5stack m5stack 4.0K Oct 13 11:46 qwen3-1.7b-ax650
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:10 qwen3_tokenizer
-rw-rw-r-- 1 m5stack m5stack 7.6K Aug 12 09:07 qwen3_tokenizer_uid.py
-rw-rw-r-- 1 m5stack m5stack  12K Oct 13 09:43 README.md
-rw-rw-r-- 1 m5stack m5stack 2.5K Oct 13 09:43 run_qwen3_1.7b_int8_ctx_ax650.sh
-rw-rw-r-- 1 m5stack m5stack 2.5K Oct 13 09:43 run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
-rw-rw-r-- 1 m5stack m5stack 2.5K Oct 13 09:43 run_qwen3_1.7b_int8_ctx_axcl_x86_api.sh
-rw-rw-r-- 1 m5stack m5stack 2.5K Oct 13 09:43 run_qwen3_1.7b_int8_ctx_axcl_x86.sh
Note
If the qwen virtual environment has already been created before, there’s no need to create it again, just activate it.
  1. Create virtual environment
python -m venv qwen
  1. Activate virtual environment
source qwen/bin/activate
  1. Install dependencies
pip install transformers jinja2
  1. Start the tokenizer parser
python qwen3_tokenizer_uid.py --port 12345
  1. Run the tokenizer service, Host IP defaults to localhost, port set to 12345. After running, the output is:
(qwen) m5stack@raspberrypi:~/Qwen3-0.6B $ python qwen3_tokenizer_uid.py --port 12345
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
Note
The following operations require opening a new terminal on raspberrypi.
  1. Set execution permission
chmod +x main_axcl_aarch64 run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
  1. Start Qwen3 model inference service
./run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh

After a successful start, the output is:

m5stack@raspberrypi:~/rsp/Qwen3-1.7B$ ./run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
[I][                            Init][ 136]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: 95e7d5f3-fc8d-48ea-b489-1de9f37924d1
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [1.08s<33.54s, 0.92 count/s] tokenizer init ok[I][                            Init][  45]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [1.08s<16.77s, 1.85 count/s] embed_selector init ok
[I][                             run][  30]: AXCLWorker start with devid 0
  100% | ████████████████████████████████ |  31 /  31 [64.75s<64.75s, 0.48 count/s] init post axmodel ok,remain_cmm(3788 MB)
[I][                            Init][ 237]: max_token_len : 2559
[I][                            Init][ 240]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 248]: prefill_token_num : 128
[I][                            Init][ 252]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 252]: grp: 2, prefill_max_token_num : 512
[I][                            Init][ 252]: grp: 3, prefill_max_token_num : 1024
[I][                            Init][ 252]: grp: 4, prefill_max_token_num : 1536
[I][                            Init][ 252]: grp: 5, prefill_max_token_num : 2048
[I][                            Init][ 256]: prefill_max_token_num : 2048
________________________
|    ID| remain cmm(MB)|
========================
|     0|           3788|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 335]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 372]: input_num_token:21
[I][                            main][ 236]: precompute_len: 21
[I][                            main][ 237]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> hello
[I][                      SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:12
[I][                      SetKVCache][ 631]: current prefill_max_token_num:1920
[I][                             Run][ 869]: input token num : 12, prefill_split_num : 1
[I][                             Run][ 901]: input_num_token:12
[I][                             Run][1030]: ttft: 796.38 ms
<think>

</think>

Hello! How can I assist you today?

[N][                             Run][1182]: hit eos,avg 7.38 token/s

[I][                      GetKVCache][ 597]: precompute_len:46, remaining:2002
prompt >>

API Usage

  1. Make sure the tokenizer service is running
(qwen) m5stack@raspberrypi:~/Qwen3-0.6B $ python qwen3_tokenizer_uid.py --port 12345
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
  1. Copy run_qwen3_1.7b_int8_ctx_axcl_x86_api.sh to run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh and set execution permission
cp run_qwen3_1.7b_int8_ctx_axcl_x86_api.sh run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh
chmod +x main_api_axcl_aarch64 run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh
  1. Modify run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh file content
./main_api_axcl_aarch64 \
--system_prompt "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
--template_filename_axmodel "qwen3-1.7b-ax650/qwen3_p128_l%d_together.axmodel" \
--axmodel_num 28 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel qwen3-1.7b-ax650/qwen3_post.axmodel \
--filename_tokens_embed qwen3-1.7b-ax650/model.embed_tokens.weight.bfloat16.bin \
--tokens_embed_num 151936 \
--tokens_embed_size 2048 \
--use_mmap_load_embed 1 \
--devices 0
Note
If you have installed the openai-api service provided by StackFlow, you need to manually execute sudo systemctl stop llm-openai-api to stop it.
  1. Start Qwen3 model inference API service
./run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh

After a successful start, the output is:

m5stack@raspberrypi:~/rsp/Qwen3-1.7B $ ./run_qwen3_1.7b_int8_ctx_axcl_aarch_api.sh 
[I][                            Init][ 130]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: 3f3c54ef-ddfa-4fbc-bd2f-74523109857e
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [0.95s<29.33s, 1.06 count/s] tokenizer init ok[I]
[I][                            Init][ 221]: max_token_len : 2047
[I][                            Init][ 224]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 232]: prefill_token_num : 128
[I][                            Init][ 236]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 236]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 236]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 236]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 236]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 236]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 236]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 236]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 236]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 240]: prefill_max_token_num : 1024
________________________
|    ID| remain cmm(MB)|
========================
|     0|           3665|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][                     load_config][ 282]: load config: 
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 263]: LLM init ok
Server running on port 8000...

API List

Method Path Function
GET /api/stop Stop current inference task
POST /api/reset Reset context (can set a new system prompt)
POST /api/generate Asynchronous text generation (stream output retrieved via /api/generate_provider)
GET /api/generate_provider Get the current incremental output (for polling)
POST /api/chat Synchronous Q&A (single turn)

1. POST /api/generate

curl -X POST "http://localhost:8000/api/generate" \
    -H "Content-Type: application/json" \
    -d '{
           "prompt": "Hello, please introduce yourself.",
           "temperature": 0.7,
           "top-k": 40
         }'

Response:

{"status": "ok"}

Notes:

  • prompt is required
  • temperature, top-k, top-p, repetition_penalty, etc. are optional sampling parameters
  • Returns "status": "ok" immediately, generation runs in the background

2. GET /api/generate_provider

Retrieve generation content and progress (streaming via polling):

curl "http://localhost:8000/api/generate_provider"

Response:

{"done":false,"response":"\n\nHello! I'm a large language model developed by Alibaba"}

When "done": true, it indicates the generation is complete.

You can request every 200~500ms to implement client-side streaming output.

3. POST /api/reset

Reset LLM context (clear dialog history), optionally provide a new system prompt:

curl -X POST "http://localhost:8000/api/reset" \
    -H "Content-Type: application/json" \
    -d '{"system_prompt": "You are a helpful assistant."}'

Response:

{"status": "ok"}

Used to clear KV cache or switch conversation scenarios.

4. GET /api/stop

Immediately interrupt the current generation task:

curl "http://localhost:8000/api/stop"

Response:

{"status": "ok"}

5. POST /api/chat

Send a message and return the result synchronously (non-streaming):

curl -X POST "http://localhost:8000/api/chat" \
    -H "Content-Type: application/json" \
    -d '{
          "messages": [
            {"role": "user", "content": "Hello, please introduce yourself in one sentence."}
          ],
          "temperature": 0.7
        }'

Response:

{"done":true,"message":"<think>\n\n</think>\n\nHi there! I'm a large language model developed by Alibaba Cloud, designed to assist with a wide range of tasks and answer questions."}

Notes

/api/generate + /api/generate_provider are asynchronous/streaming mode (suitable for UI scenarios)

/api/chat is synchronous blocking mode (suitable for obtaining a complete answer in one go)

If the model is running, the request will return:

{"error": "llm is running"}

If the model is not initialized, it will return:

{"error": "Model not init"}

Typical Call Flow (asynchronous)

POST /api/generate sends prompt

Client polls GET /api/generate_provider every few hundred milliseconds

Stop polling when done:true appears

On This Page