Computer Engineering and Applications

Select

Human Pose Estimation with Semantic Enhancement and Adaptive Multi-Scale Feature Fusion

ZHANG Jiabo, HE Ajuan, TANG Shangsong

Computer Engineering and Applications 2025, 61 (23): 212-223. DOI: 10.3778/j.issn.1002-8331.2407-0177

Abstract （20）

PDF（pc）（1752KB）（38）

Save

Due to the small scale and sensitive location of keypoints, how to effectively extract spatial and semantic information has always been the main challenge of pose estimation task. In order to solve this problem, this paper proposes a semantic-enhanced and adaptive multi-scale feature fusion network (SAMFFNet) for human pose estimation. SAMFFNet utilizes the lightweight MobileNetV2 as the backbone network to build the feature pyramid, and uses EfficientViT to generate scale-aware global semantics. In the designed deep semantic injection module, the content-guided attention is used to fuse global semantics with local features to enhance the semantic representation of key points. Furthermore, an adaptive multi-scale feature fusion module is proposed, which can dynamically adjust the large spatial receptive field according to the input features and enhance the information interaction between features at different scales by integrating the large selective convolution kernel module (LSK) and the cross-layer interaction mechanism. The experimental results show that on the COCO validation set, SAMFFNet has improved its accuracy index by 6.1 percentage points compared to the backbone network, reaching 70.7%. Although its accuracy is slightly lower than that of the larger model SimpleBaseline, it has reduced the number of parameters by 85.0% and the computational complexity by 78.3%. On the MPII dataset, an accuracy improvement of 2.3 percentage points is also achieved compared to the backbone network. The comprehensive performance on the COCO and MPII datasets fully confirms the effectiveness of SAMFFNet in enhancing human structural features and feature fusion strategies.

Reference | Related Articles | Metrics

Select

Small Object Detection Network Based on Step-by-Step Adaptively Feature Fusion Module

CHEN Peng, LIN Bin, BAI Yong, HUANG Weilun

Computer Engineering and Applications 2025, 61 (23): 224-232. DOI: 10.3778/j.issn.1002-8331.2409-0081

Abstract （22）

PDF（pc）（2530KB）（51）

Save

Small object detection plays a role in tasks such as driving assistance, smart healthcare, and drone inspections. Multi-scale feature learning is a commonly adopted strategy in designing small object detection networks. The classic feature pyramid structure achieves multiscale information transmission by integrating feature maps from different levels, thereby capturing key information about small objects across feature maps of varying resolutions. However, when fusing feature maps at different scales, semantic information conflicts often arise, leading to inconsistent gradient computations and causing the information of small objects to be overwhelmed. Therefore, a step-by-step adaptive feature fusion module (SAFF) is proposed, which divides the feature fusion process into three sequential stages. By progressively fusing adjacent scale feature maps, it resolves the issue of semantic conflict during the fusion process. Additionally, within each stage, adaptive feature fusion can alleviate the problem of inconsistent gradient calculations. The SAFF module is applied to general object detection networks to form the SAFF-RCNN and Cascade-SAFF-RCNN networks dedicated to small object detection. Experimental results show that the proposed networks achieve significant improvements in small object detection performance, reaching or surpassing other mainstream small object detection models, thus demonstrating the effectiveness of the proposed SAFF module in small object detection.

Reference | Related Articles | Metrics

Select

Lightweight and Synergistically Enhanced YOLOv8n Model for Traffic Sign Detection

FANG Tianrui, CHENG Guang, LIU Hailin, TANG Shaohu

Computer Engineering and Applications 2025, 61 (23): 233-247. DOI: 10.3778/j.issn.1002-8331.2507-0109

Abstract （31）

PDF（pc）（3104KB）（43）

Save

A lightweight object detection model, RACP-YOLO (reconstruction-aware compressed prediction YOLO), is proposed to address challenges in traffic sign detection, including missed detection of small targets, interference from complex backgrounds, and excessive model complexity. The backbone integrates a compact C2F-RVB module to improve low-level semantic representation and employs an ADown module for multi-scale downsampling, effectively balancing resolution and receptive field to enhance object perception. A channel?aware attention (CAA) mechanism is used to strengthen inter-channel dependencies and saliency response.The core improvement lies in the proposed SCConv detection head, composed of a spatial reconstruction unit (SRU) and channel reconstruction unit (CRU) in a dual-branch design. Combined with an additional P2 branch, the resulting SCHead enhances spatial modeling for small-scale and local targets. Experimental results on the TT100K dataset demonstrate that RACP-YOLO achieves a mAP0.5 of 0.685, surpassing YOLOv8n by 2.1%. The number of parameters is reduced from 3.01×106 to 1.12×106 (a reduction of 62.8%), and computational cost drops from 8.1×109 to 4.3×109 (a reduction of approximately 46.9%). Furthermore, generalization experiments on the CCSTB dataset confirm that the proposed model maintains stable detection performance and strong adaptability in complex scenarios, such as nighttime, strong light, and rainy conditions. This improvement enables higher detection accuracy while significantly enhancing model compactness and deployment efficiency, making it well-suited for real-time applications in in-vehicle and edge scenarios.

Reference | Related Articles | Metrics

Select

LGM-YOLOv11: Underwater Object Detection Model Fusing Multi-Scale Attention Mechanism

CHEN Hui, YU Yongjie

Computer Engineering and Applications 2025, 61 (23): 248-263. DOI: 10.3778/j.issn.1002-8331.2506-0362

Abstract （40）

PDF（pc）（2682KB）（45）

Save

Underwater images play a crucial role in applications such as marine ecological environment monitoring and underwater resource development. However, underwater images are often affected by factors such as light scattering, suspended particles, and color attenuation, resulting in low contrast, blurred edges, and noise interference, which in turn reduces the accuracy and efficiency of underwater target detection. To address these challenges, a waterborne target detection model integrating a multi-scale attention mechanism is proposed to enhance the detection performance of underwater objects. Firstly, the Laplacian-of-Gaussian stem (LoGStem) is introduced to replace the first two convolutional layers of the YOLOv11 backbone network, enhancing the extraction ability of edge and texture details in underwater images. Secondly, the gated activation convolution module (GSConv) is proposed and embedded in the feature pyramid network, using the gating mechanism to enable dynamic features for each spatial position and channel, thereby enhancing the model’s ability to capture details. Then, the multi-scale enhanced parallel attention module (MSEPA) is proposed and integrated into C3k2, and through the collaborative effect of multi-scale feature fusion and multiple attention mechanisms, the receptive field is enlarged and the feature representation is enhanced. Finally, to improve the accuracy and stability of small target localization, the Shape-NWD loss function is used. Experiments on the UTDAC, DUO, RUOD and underwater garbage datasets show that the proposed method achieves the best detection accuracy compared with the contrast models.

Reference | Related Articles | Metrics

Select

Detection Method of Railway Perimeter Intrusion Combined with Compact Features and Attention

WANG Hui, LI Zelong, YE Jiangang, TANG Xiaokun, XU Feng

Computer Engineering and Applications 2025, 61 (23): 264-273. DOI: 10.3778/j.issn.1002-8331.2411-0288

Abstract （15）

PDF（pc）（9595KB）（28）

Save

To address the issue of perimeter intrusions impacting train safety in railway environments, and to overcome the limitations of low accuracy and efficiency in existing methods, a perimeter intrusion foreign object detection approach is proposed based on the YOLOv9 model. The proposed feature aggregation module reduces the network’s computational complexity by employing a compact architecture, thereby enhancing detection efficiency. A multi-channel attention mechanism with inverted residual is proposed by integrating the transposed residual structure with the designed multi-channel attention. This approach reduces the number of convolutional parameters, promotes extensive interaction of information across channels, captures key features of the detection target, enhances anomaly detection accuracy, and minimizes both false negatives and false positives. The modified auxiliary detection branch effectively extracts image feature information while reducing the model’s parameter size. Experimental results demonstrate that the proposed model achieves an mAP@0.5 of 93.5% and a recall rate of 89.2% on the railway perimeter foreign object dataset, outperforming the YOLOv9 model by 6.1 and 4.6 percentage points, respectively, while reducing the model’s parameter count by 54.5%. Compared to other mainstream models, the proposed model achieves superior performance across key evaluation metrics, including mAP@0.5, recall rate, false positive rate, and false negative rate. In summary, the proposed model outperforms other mainstream models and demonstrates strong performance in perimeter intrusion detection tasks.

Reference | Related Articles | Metrics

Select

Multi-Target Segmentation Model for Aerial Remote Sensing Based on Global Reconstruction Semantics Aggregation

WU Xiaosuo, QIAO Yudong, HE Chenglong, LIU Xiaoming, YAN Haowen

Computer Engineering and Applications 2025, 61 (22): 215-225. DOI: 10.3778/j.issn.1002-8331.2408-0077

Abstract （58）

PDF（pc）（5658KB）（69）

Save

To address the challenges of multiple target scales, insufficient semantic information, and blurred feature boundaries in aerial remote sensing images, a segmentation model that aggregates global information and reconstructs semantic representations after feature classification is proposed. Swin-Transformer is employed as the encoder to capture contextual information and extract deep features. A designed deep shallow semantic reconstruction module and a channel residual reconstruction module classify and reconstruct these features based on their information contens. Subsequently, a regional upsampling and downsampling connection strategy is introduced to fuse the reconstructed features with the encoder features into a comprehensive feature aggregation block for final output. This approach enables the fine-grained reconstruction of multi-target features and the generation of accurate segmentation maps, thereby enhancing segmentation precision and achieving high-quality pixel-wise regression. Experimental results show that the model achieves mean intersection over union (mIoU) scores of 87.2% and 82.9%, and overall accuracy (OA) scores of 91.4% and 91.2% on the ISPRS Vaihingen and ISPRS Potsdam datasets, respectively.

Reference | Related Articles | Metrics

Select

Feature Refinement Skeletal Action Recognition Method Based on GCN and CNN Fusion

CHEN Xingqi, SONG Tao, ZOU Yangyang

Computer Engineering and Applications 2025, 61 (22): 226-234. DOI: 10.3778/j.issn.1002-8331.2408-0007

Abstract （40）

PDF（pc）（1072KB）（45）

Save

Graph convolutional networks (GCNs) have been widely applied and have achieved significant results in current mainstream research due to their ability to effectively capture skeleton features. However, the size of the fixed temporal convolution kernel limits the receptive field in the temporal convolution process, and the problems of insufficient extraction of bone feature information, cross-scale feature refinement, and multi-layer semantic feature connection in the graph convolution process need to be further solved. Aiming at these problems, a fusion network is designed, which utilizes the advantage of GCN in retaining skeleton features and the strong ability of convolutional neural network (CNN) in extracting spatial features. In the network, the multi-branch temporal enhanced convolution (MTE Conv) is set up with different branches and temporal enhancement to obtain more diverse fine-grained features at different scales. The graph vertex enhanced module (GVEM) serves as a multi-level semantic feature connection between GCN and CNN, enabling the graph skeleton features to be better mapped to CNN for spatial-temporal feature extraction. The accuracy of 97.63% and 91.16% is achieved on the X-view of NTU-RGB+D 60 and X-set of NTU-RGB+D 120, indicating that the proposed method has superior performance.

Reference | Related Articles | Metrics

Select

Low Light Target Detection Algorithm Based on Feature Fusion Enhancement

CHEN Xiefa, LI Min, ZHAO Jingyu, HE Yujie, YANG Aitao

Computer Engineering and Applications 2025, 61 (22): 235-244. DOI: 10.3778/j.issn.1002-8331.2503-0108

Abstract （62）

PDF（pc）（1614KB）（95）

Save

A target detection algorithm that combines visible light images and infrared images to enhance target features is proposed to address the problem of insufficient features of visible light image targets under adverse conditions such as low light. This algorithm uses YOLOv11n as the baseline algorithm, and constructs a dual branch image multi-level feature extraction and fusion network to extract shallow and deep features of the target visible light image and infrared image. The feature fusion at the same level is achieved according to the feature level fusion method. Semantic and detail injection modules are introduced to integrate the different scale features of visible light images and infrared images, achieving complementary advantages of target feature information between the two images, and improving the performance of target detection under low light conditions. The proposed algorithm is validated on the M3FD, LLVIP and FLIR aligned datasets, and the experimental results show that compared to the baseline algorithm for detecting visible light images and infrared images, the algorithm is effective on the M3FD dataset, mAP@0.5 is improved by 10.2 and 12.4 percentage points respectively, mAP@0.5：0.95 is improved by 7.1 and 6.0 percentage points respectively; On the LLVIP dataset, mAP@0.5 is improved by 4.8 and 0.3 percentage points respectively, mAP@0.5：0.95 is improved by 12.8 and 1.8 percentage points respectively; On the FLIR aligned dataset, mAP@0.5 is improved by 11.3 and 3.5 percentage points respectively, mAP@0.5：0.95 is improved by 8.2 and 2.0 percentage points respectively. Compared with other two Transformer based image fusion target detection algorithms, this algorithm has higher overall detection performance, which proves the progressiveness and effectiveness of the proposed algorithm.

Reference | Related Articles | Metrics

Select

Steel Surface Defect Detection Network Combining Element-Wise Multiplication Operators and Channel Pruning

YANG Chunlong, LYU Donghao, ZHANG Yong, TIAN Xu, WANG Chengzhi

Computer Engineering and Applications 2025, 61 (22): 245-256. DOI: 10.3778/j.issn.1002-8331.2408-0017

Abstract （22）

PDF（pc）（2439KB）（32）

Save

To address the challenge of real-time and high-precision defect detection on resource-constrained devices, a steel surface defect detection network is proposed which combines element-wise multiplication operators with channel pruning. To enhance the ability to capture defect characteristics, a feature space expansion module (FSEM) and an edge feature extraction module are designed, and a lightweight and efficient feature extraction network (LENet) is developed using a four-layer hierarchical architecture. To improve the effective fusion of multi-scale features, an adaptive multi-scale feature fusion network (AMFN) is constructed using the adaptive fusion (AW-Fusion) module based on channel-prior convolutional attention (CPCA) and FSEM within a feature pyramid architecture. To reduce network complexity and improve detection speed, channel pruning is employed for backend compression. Related experiments are conducted on the NEU-DET dataset to validate the effectiveness and superiority of the proposed network. Experimental results indicate that the pruned network achieves an accuracy of 78.1% and a speed of 179.8 FPS under low complexity, meeting practical application requirements.

Reference | Related Articles | Metrics

Select

Dynamic Context-Aware and Residual Attention Esophageal Cancer Lesion Segmentation

DING Nan, LI Xiaoxia, CAO Yaodan, MAO Yanhui, HE Qin, JIANG Kunyuan, CHENG Jie, ZHOU Yingyue

Computer Engineering and Applications 2025, 61 (22): 257-266. DOI: 10.3778/j.issn.1002-8331.2408-0166

Abstract （34）

PDF（pc）（1375KB）（31）

Save

A dynamic context-aware residual attention network is proposed to address small inter-class differences, large intra-class differences, and blurred edges in fine-grained segmentation of early esophageal cancer and precancerous lesions. The pyramid vision transformer v2 (PVTv2) is employed as the feature extraction network, forming the PVT branch to capture primary feature representations. A residual attention full convolution branch is designed by stacking residual blocks to enhance detailed features. A dynamic residual attention feature enhancement module is integrated into the encoder of this branch to reinforce key feature representations while preserving initial image information. During the decoding phase of the PVT branch, a dynamic context feature guidance module is designed to fuse local and global information using multi-scale features, enabling an adaptive progressive decoding process that retains details and enhances global context understanding. The validation on a self-built esophageal cancer dataset and public datasets Kvasir-SEG, CVC-ClinicDB, and ISIC2018 demonstrates Dice coefficients of 74.92%, 95.79%, 96.83% and 92.89%, respectively, outperforming mainstream segmentation networks.

Reference | Related Articles | Metrics

Select

Affection Analysis of Social Media Avatars Enhanced by Imagery Knowledge

LIU Junling, AN Ning, SUN Huanliang, XU Jingke

Computer Engineering and Applications 2025, 61 (22): 267-277. DOI: 10.3778/j.issn.1002-8331.2408-0216

Abstract （18）

PDF（pc）（2121KB）（21）

Save

Images contain rich affective information, which can be obtained quickly and intuitively. As a special type of image, avatars have a strong correlation with user’s self-cognition. Users reflect their self-cognition through the imagery contained in their avatars. However, the existing image affection analysis work lacks consideration of imagery. Therefore, on the basis of VAD affection model, this paper expands the imagery affection dimension to represent the user’s self-cognition reflected by the imagery. To measure the user’s affection reflected by VAD affection and imagery affection, psychological energy measurement is introduced. This paper proposes an affections collaborative fusion for psychological energy prediction model to analyze the psychological energy of avatars. The model learns image features and imagery knowledge, and uses attention mechanism to learn the correlation between them, thereby analyzing the psychological energy of avatars. The model is tested on the real dataset, and the NDCG (normalized discounted cumulative gain) of the psychological energy dimension is 0.499, which is 5.50% better than other best performing baseline models, verifying the effectiveness of the proposed method.

Reference | Related Articles | Metrics

Select

Image Classification for Film Surface Defect Based on Contrastive Learning and Diffusion Model

DENG Haowen, WANG Hengsheng

Computer Engineering and Applications 2025, 61 (21): 242-252. DOI: 10.3778/j.issn.1002-8331.2407-0205

Abstract （47）

PDF（pc）（7126KB）（55）

Save

The surface defects from the manufacturing process of film materials are usually in multiple categories. The properties of small inter-class differences and imbalanced datasets are the main reasons of poor classification performance. A classification method based on contrastive learning and diffusion model is proposed to address the aforementioned issues in this paper. A diffusion model is trained on the image dataset, and the noise features are obtained from the encoded outputs of the noise prediction network which is a part of this diffusion model. On the other hand, the image features are extracted from a CNN model, which are fused with noise features to obtain the enhanced representation of defect images (called the fused feature of image) in the feature space. Using label-embedding contrastive learning to map labels into the feature space, the prototype features are obtained, which are used to calculate the contrastive loss with respect to the fused features in the learning process, and finally the distribution of prototype features of different image categories is optimized in the feature space, which shows the delicate differences among image classes. Experimental validation on classifying surface defects of lithium battery aluminum-plastic films achieves a maximum accuracy of 96.97%, surpassing current mainstream methods.

Reference | Related Articles | Metrics

Select

Pose-Guided Human Instance Segmentation Driven by Contour Prior

MA Junlong, ZHOU Jun, ZHAO Jinye, LI Yangyang

Computer Engineering and Applications 2025, 61 (21): 253-264. DOI: 10.3778/j.issn.1002-8331.2407-0334

Abstract （38）

PDF（pc）（8949KB）（33）

Save

In response to the challenges faced by person instance segmentation, such as the complexity and variability of background environments, occlusions and overlaps between individuals, as well as the inadequacy of traditional single-task person instance segmentation networks in integrating human body feature information, a method for instance segmentation that integrates prior human contour extraction and pose-guided strategies is proposed. A multi-task learning network architecture is constructed for this purpose. The multi-task network consists of three modules: prior processing module, human body pose estimation module, and pose-guided person instance segmentation module. The design of a portrait contour extraction network serves as a prior processing component to delineate the approximate outline of human figures, effectively mitigating background interference. For images with attached human contours, contour mapping is employed to thoroughly capture key point information of the human body, enriching structural cues during the segmentation process and enhancing the capability to handle occlusions and overlaps. The integration of prior semantic segmentation masks with instance segmentation masks generated through pose-guided methods aims to improve segmentation accuracy. Experimental results demonstrate that this method outperforms baseline methods in bottom-up multi-person human body pose estimation. Furthermore, experimental results on person instance segmentation tasks show an average precision improvement of 3.4% compared to baseline pose-guided instance segmentation networks.

Reference | Related Articles | Metrics

Select

Text-Attributes Select Visual Token for Generalized Zero-Shot Image Recognition

YAN Wenshang, ZHANG Guimei

Computer Engineering and Applications 2025, 61 (21): 265-275. DOI: 10.3778/j.issn.1002-8331.2407-0468

Abstract （32）

PDF（pc）（2673KB）（40）

Save

Current zero-shot learning methods struggle to effectively align semantic information with visual features, and the presence of redundant information within visual features leads to suboptimal accuracy in zero-shot and generalized zero-shot image recognition. To address this issue, this paper proposes text-attributes select visual token for generalized zero-shot image recognition. Large language models are utilized to generate discriminative semantic information-text attributes. A class prior estimation module is introduced to compute the prior weight of each text attributes, enhancing its interpretability and optimizing model performance. The text attributes are used to select their corresponding visual features, effectively removing redundant information from the visual features. Under the guidance of the prior weights, the selected visual features are aligned with the text attributes in a cross-modal, enabling more precise and efficient visual-semantic interaction, thereby enhancing image recognition accuracy. Self-supervised generalized zero-shot image recognition experiments conducted on three benchmark datasets (AWA2, CUB, SUN). The harmonic mean achieves state-of-the-art performances on AWA2 and SUN, surpassing the second-best performance by 1.1 and 0.8 percentage points, respectively, and ranks second on the CUB dataset. The experimental results validate the efficacy of the proposed approach.

Reference | Related Articles | Metrics

Select

Industrial Defect Detection Algorithm with Knowledge Distillation Integrating Multi-Scale Features

WANG Ling, WANG Minghui, WANG Peng, BAI Yan’e

Computer Engineering and Applications 2025, 61 (21): 276-286. DOI: 10.3778/j.issn.1002-8331.2407-0518

Abstract （33）

PDF（pc）（5178KB）（51）

Save

With the improvement of automation level in industrial production lines, enterprises have increasingly strict requirements for product quality, and defect detection has become an important task in assisting automated production. However, due to the complex structure of some industrial products, various unknown types of defects may occur during the production process, and it is difficult to obtain defect samples, which makes industrial defect detection still challenging. In order to improve the detection efficiency of industrial defects in structural types, a knowledge distillation industry defect detection algorithm MSFFKD_DD is proposed, which integrates multi-scale features. Firstly, the artificial synthesis anomaly module SAM_AB is proposed to generate pseudo defect sample images using normal sample images, simulate unknown types of defects, and introduce pseudo defect features in the knowledge distillation process to enhance the defect detection capability of the algorithm. Then, a feature fusion module MSFFM is designed to enhance the ability of the algorithm to extract product detail features by fusing shallow and deep features. At the same time, SSIM loss is introduced into the loss function to improve the segmentation accuracy of industrial product defects with complex structures. Experiments are conducted on the industrial dataset MVTec AD, and the image level AUROC, pixel level AUROC, and AUCPRO of the MSFFKD_DD algorithm reach 98.4%, 97.2%, and 95.1%, respectively, effectively improving the accuracy and segmentation precision of industrial defect detection for structural types.

Reference | Related Articles | Metrics

Select

Multimodal Real-Time 3D Object Detection Based on Edge Differential Information Fusion

ZHANG Zhitian, ZHAO Hongdong, ZHANG Ke, CHEN Dan, LI Yanqi

Computer Engineering and Applications 2025, 61 (21): 287-296. DOI: 10.3778/j.issn.1002-8331.2407-0534

Abstract （35）

PDF（pc）（5178KB）（36）

Save

Multimodal 3D object detection makes full use of the geometric information of the point cloud and the semantic information of the image. Aiming at the problems in multimodal 3D object detection, such as the inability to make full use of edge information，the difficulty of heterogeneous data fusion and slow inference speed，an efficient multimodal 3D object detection algorithm (multimodal real-time 3D object detection based on edge differential information fusion, EDMR-Net) based on edge differential information is proposed. In the fusion stage, a differential feature enhancement fusion (DEF) module is proposed, which enhances the point cloud semantic expression by using the differential information of the image through the diffusion function to achieve the complementarity of heterogeneous data, and precisely locates the small objects using the rich edge information and the stead condition of the features; and the multimodal features are further refined with the multi-scale context information using the adaptive context awareness (ACA) network with the adaptive weight assignment. In order to enhance the model’s ability to capture detailed information, a multi-scale cross-axis attention mechanism is introduced into the shallow layer features. Experimental results on KITTI dataset show that the proposed method outperforms the mainstream methods in terms of speed and accuracy, effectively solves the problems of inadequate utilization of edge information and slow multimodal inference, and EDMR-Net greatly improves the detection performance for difficult scenes while guaranteeing the detection performance for easy and moderate levels.

Reference | Related Articles | Metrics

Select

Industrial Surface Anomaly Detection Based on Reconstruction with Multiple Memory Enhancement Modules and Image Edge

YANG Yao, XU Xiangyun, ZHANG Linna, CHEN Jianqiang, CEN Yigang, HUANG Yansen

Computer Engineering and Applications 2025, 61 (20): 248-259. DOI: 10.3778/j.issn.1002-8331.2406-0293

Abstract （63）

PDF（pc）（6138KB）（104）

Save

Reconstruction-based anomaly detection in industrial images usually assumes that the model can reconstruct the normal region well, but not the abnormal region well. However, due to the over-generalization problem of deep neural networks, abnormal regions can also be reconstructed well, resulting in the leakage of abnormal regions. In order to solve the above problems, this paper proposes an industrial surface anomaly detection model based on reconstruction with multiple memory enhancement modules and image edge (MMAERec). Specifically, multiple memory enhancement modules and image edge extraction modules are introduced on a UNet-type denoising self-encoder with skip connections. The memory features obtained from the multi-memory enhancement module facilitate a good reconstruction of the normal region, while the extracted image edge features facilitate a good reconstruction of the image contour. The fusion of these two different features processed by the attention mechanism and used for reconstruction can improve the quality of the reconstructed image very well. The proposed method can force the network to learn normal low-frequency and high-frequency information, preventing the model from directly replicating the abnormal regions, and effectively alleviating the overgeneralization problem. Experimental results on two industrial datasets, MVTec AD and BTAD, also demonstrate the good detection and localization performance of the proposed method.

Reference | Related Articles | Metrics

Select

Generative Adversarial Double Decoupling Staged Shadow Removal Algorithm

ZENG Wenxian, ZHANG Manyu, SUN Lei

Computer Engineering and Applications 2025, 61 (20): 260-269. DOI: 10.3778/j.issn.1002-8331.2406-0307

Abstract （37）

PDF（pc）（1843KB）（36）

Save

Aiming at the problems of artifacts at shadow edges and inconsistent color recovery between shadow and non-shadow regions, which are commonly found in current shadow removal algorithms, this paper proposes a generative adversarial double decoupling staged shadow removal algorithm. The algorithm constructs a generative adversarial subject model framework, adopts different image features in stages instead of the traditional single color image as the supervisory signal, and realizes the decoupling processing of the image in the channel and spatial dimensions. Firstly, the image is decoupled into luminance channel (L) and color channel (AB) in the channel dimension. For the luminance channel, a multi-scale attention-enhanced dilated network is proposed, which captures multi-scale contextual information through dilated convolution with stepwise increasing dilation rate, and accurately extracts local features by combining with the attention mechanism to effectively restore the brightness of shadow regions and avoid edge artifacts. For the color channel, a color adjustment network combining channel and spatial attention is designed to focus on color correction. The L-channel and AB-channel after luminance recovery and color adjustment are re-spliced and used as inputs in the second stage. Secondly, the image is decoupled into shadow and non-shadow regions in the spatial dimension, and a dynamic residual network is designed to extract features using standard convolution and depth-separable convolution, respectively, to further solve the problems of edge artifacts and color inconsistency. The experimental results show that the algorithm in this paper performs well compared with other state-of-the-art algorithms, and has significant advantages in avoiding edge artifacts, eliminating color inconsistency and improving image quality.

Reference | Related Articles | Metrics

Select

Cigarette Butt Detection Algorithm Based on Small Target Occlusion-Aware

TIAN Yanping, JIN Miao, CHEN Xiwen, ZHANG Jun, LIU Li, YU Feng, JIANG Minghua

Computer Engineering and Applications 2025, 61 (20): 270-280. DOI: 10.3778/j.issn.1002-8331.2407-0111

Abstract （81）

PDF（pc）（5708KB）（81）

Save

In complex power operation scenarios, cigarette butts are typically small targets and are easily occluded by other objects. Existing small object detection methods face limitations in multi-scale feature fusion and occlusion awareness, and fail to fully address the specific needs of power scenarios. To address these challenges, the paper proposes the CBD-STOA algorithm for cigarette butt detection with occlusion awareness in power scenes. Firstly, a multi-scale sequence feature fusion module (SSFF) is introduced into the neck network to improve detection accuracy for small objects. Secondly, a triple feature fusion module (TFF) is designed to reduce false positive rates in dense scenes by enhancing feature details through multi-scale feature fusion. Finally, an occlusion-aware detection head (OADHead) is developed to improve detection under partial occlusion by leveraging multi-scale spatial propagation and contextual information. Experiments on the custom ElectricSmoke dataset show that CBD-STOA achieves a 2.0 and 4.1 percentage points improvement in mAP50 and mAP50-95, respectively, compared to the original YOLOv8n algorithm, with strong performance on the TinyPerson dataset as well.

Reference | Related Articles | Metrics

Select

Multi-Scale Dual Cross Attention Transformer Network for Change Detection in Remote Sensing Images

DENG Wenhao, DUAN Zhongxing

Computer Engineering and Applications 2025, 61 (20): 281-294. DOI: 10.3778/j.issn.1002-8331.2407-0148

Abstract （75）

PDF（pc）（6340KB）（109）

Save

Existing deep learning-based methods tend to focus on extracting advanced change semantic features, making it challenging to capture changes in ground object details, resulting in fuzzy boundaries and vulnerability to pseudo change. Meanwhile, the skip connection in the traditional U-shaped architecture is difficult for narrowing the semantic gap between the encoder and decoder. To solve the above problems, a multi-scale dual cross attention transformer network (MDCATNet) is proposed for remote sensing image change detection. In the encoder part, MDCATNet utilizes a primary feature conservation strategy and convolutional blocks with residual structures to construct the Siamese network with shared weights to extract multiscale features of the dual-temporal image. In the decoder part, in order to narrow the semantic gap between the encoder and the decoder, and to fully integrate the remote channel and spatial information of the multi-scale features, a novel multi-scale multi-head channel-spatial cross fusion Transformer module is proposed as an alternative to the traditional skip connection. In order to further refine the features and obtain more detailed change regions and smoother boundary contours, a channel cross attention refinement module is proposed for refining the features layer by layer from bottom to top and generating high-quality prediction maps. Experiments on LEVIR-CD and SYSU-CD datasets show that compared with the other six algorithms, MDCATNet achieves the best detection results in both quantitative evaluation and visualization, and has stronger generalization ability.

Reference | Related Articles | Metrics

Select

Dual-Branch Network for Remote Sensing Image Semantic Segmentation with Dynamic Weights and Semantic Filtering

FANG Liang, XIE Gang, XU Xinying, QU Tiezuo, XIE Xinlin

Computer Engineering and Applications 2025, 61 (20): 295-305. DOI: 10.3778/j.issn.1002-8331.2409-0070

Abstract （53）

PDF（pc）（1954KB）（69）

Save

A dual-branch semantic segmentation network for remote sensing imagery is proposed to address the issues of fragmentation in large target segmentation and misclassification caused by intra-class feature heterogeneity. The network merges global and local modeling capabilities of Transformers and CNN through layer-by-layer interaction. Initially, a dynamic perception fusion module based on a dynamic weight adjustment mechanism is constructed. This module aggregates the advantages of global and local modeling from both branches, alleviating misclassification issues caused by inconsistencies in target class features. Subsequently, a semantic filtering attention mechanism is introduced in the cross-feature selection module. This mechanism enhances the model’s capability for continuous segmentation of large targets by mining important semantic information and filtering out ineffective features. Lastly, a shallow space-semantic information coupling module is designed, featuring a dual-path coupling attention structure that reduces the disparity between deep semantic information and shallow spatial details in skip connections. Furthermore, a high-resolution satellite remote sensing dataset of the Taiyuan urban area in Shanxi is constructed, and segmentation experiments on nine types of ground objects are conducted. The proposed algorithm outperforms comparative algorithms in experiments conducted on the custom dataset and the public ISPRS dataset, effectively enhancing continuous segmentation capabilities for large targets while mitigating misclassification issues within the same type of targets.

Reference | Related Articles | Metrics

Select

DFC-YOLO：Multi-Scale and Similarity Defect Target Detection Method for Metal Surfaces

WANG Kun, LI Jinhua

Computer Engineering and Applications 2025, 61 (19): 167-178. DOI: 10.3778/j.issn.1002-8331.2502-0209

Abstract （84）

PDF（pc）（1720KB）（101）

Save

To reduce the problems of false positives and missed detections in metal surface defect detection task and enhance the model’s detection capability for multi-scale and similar defects, the DFC-YOLO detection approach is proposed. Firstly, the diversity feature extraction module (DFEM) is designed to extract more comprehensive features from the multi-scale information processed by SPPF, improving the detection accuracy of targets at different scales. Secondly, the feature processing module (FPM) is designed to integrate shallow and deep feature information, while the shallow feature extraction module (SFEM) is used to fully extract key information from the shallow layers, reducing interference from background noise and improving the model’s ability to recognize similar targets. Finally, in order to further enhance the model’s ability to extract and integrate multi-scale features without significantly increasing memory and computational costs, the C2f_RFEM module is proposed. This module uses the receptive field expansion module (RFEM) to enlarge the model’s receptive field, obtain more contextual information, and improve detection performance. The experimental results show that DFC-YOLO achieves a mAP of 77.60% on the GC10-DET dataset, which is 3.60 percentage points higher than the original network. On the NEU-DET dataset, the mAP increases by 2.55 percentage points. The experimental results validate the feasibility and generalization of the proposed method and demonstrate its effective application to metal surface defect detection.

Reference | Related Articles | Metrics

Select

Detection Algorithm Based on Hierarchical Residuals and Bidirectional Feature Fusion Mechanism

LENG Qiangkui, LU Jianxu, MENG Xiangfu

Computer Engineering and Applications 2025, 61 (19): 179-189. DOI: 10.3778/j.issn.1002-8331.2409-0219

Abstract （38）

PDF（pc）（2170KB）（29）

Save

Although the existing YOLO series of object detection algorithms demonstrate excellent speed and real-time performance, they still have shortcomings in handling multi-scale objects and preserving boundary details. To address these issues, an improved object detection algorithm based on YOLOv8, named Res-YOLO, is proposed. Res-YOLO consists of three core modules: the Res-SPPF for feature enhancement, the RSBA for bidirectional feature fusion, and the C2f_ODC for dynamic feature selection. Specifically, the Res-SPPF utilizes hierarchical residual connections and a multi-head attention mechanism to enhance the model’s multi-scale feature representation capability; the RSBA employs an adaptive deep-shallow level feature fusion mechanism to retain boundary details and semantic information; the C2f_ODC filters unnecessary features progressively through incremental learning, thereby reducing model complexity. Additionally, a linear deformable convolution (LDConv) is introduced to handle objects with complex boundaries and irregular shapes. Experimental results on the MS COCO 2017 dataset show that Res-YOLO achieves a 2.9 percentage points improvement in mAP over the original algorithm, while GFLOPs being 94% of the original algorithm. Comparative experiments with other state-of-the-art detection algorithms further validate the effectiveness and competitiveness of Res-YOLO.

Reference | Related Articles | Metrics

Select

Wafer Defect Detection Algorithm Incorporating Inverted Residuals and Expansion Reparameterization

WANG Quan, WANG Mengnan, SUN Jiadong, CHEN Deji, XIAO Shang

Computer Engineering and Applications 2025, 61 (19): 190-201. DOI: 10.3778/j.issn.1002-8331.2412-0032

Abstract （30）

PDF（pc）（8697KB）（27）

Save

Aiming at the challenges in current wafer defect detection algorithms, which struggled to balance detection accuracy, the number of model parameters, and computational volume, a YOLOv8-based lightweight defect detection on wafers (YOLOv8_LDW) is proposed. First, by fusing the inverted residual mechanism and the dilation reparameterization module, the C2f_IDR module is designed and introduced into the backbone network, which enhances the model’s ability to jointly model the global context information and local detail features of complex defects, while improving reasoning efficiency. Secondly, the high-level screening path aggregation network (HSPAN) is proposed for the first time. The neck network is reconstructed through a bidirectional screening and fusion mechanism, which achieves efficient aggregation of multi-scale features and effectively suppresses the interference of redundant features. Finally, in order to further improve the model’s attention to tiny defects and the regression accuracy of complex shape defects, the Focaler-Shape IoU loss function is used to replace the traditional CIoU loss function. Experimental results show that the F1 Score and mAP50 of the improved model on the real wafer defect dataset reach 97.2% and 98.3%, respectively, which are improvements of 1.4% and 0.8% compared with the baseline model. The number of parameters and computational volume are reduced by 42.5% and 22.2%, respectively, and the model size is only 3.69 MB. In addition, the improved model is validated on the public wafer defect dataset, where Recall, F1 Score, and mAP50 are improved by 7.2%, 1.8% and 2.0%, respectively, compared to the original model. These results demonstrate strong generalization ability and robustness, effectively adapting to the data distribution of different defect types. This demonstrates that the improved algorithm significantly reduces the number of model parameters and computational costs while maintaining high detection accuracy, meeting the practical application requirements for high efficiency and lightweight design in wafer defect detection.

Reference | Related Articles | Metrics

Select

Fusion of Island Bi-Directional Feature Pyramid for Remote Sensing Image Object Detection

LIANG Liming, FENG Yao, LONG Pengwei, WANG Zexin

Computer Engineering and Applications 2025, 61 (19): 202-213. DOI: 10.3778/j.issn.1002-8331.2406-0063

Abstract （42）

PDF（pc）（32209KB）（54）

Save

Aiming at the problems of object detection in remote sensing images, such as complex background interference and multi-scale differences of targets, a remote sensing image object detection that integrates island bi-directional feature pyramid network, referred to as IFD-YOLOv8s, is proposed. Firstly, an island bi-directional feature pyramid network is designed to enhance the adaptability of the model to target scale changes, reduce the information loss in the process of multilevel feature fusion, and contribute to the efficient propagation of deep semantic and fine-grained information; then a feature context incremental module is proposed to capture the feature of the feature target in a more comprehensive way, and to improve the model detection capability; and finally, a dual path pooling attention module is designed to inhibit the interference of non-target noise that enhances the remote sensing target feature discriminability. The ablation and comparison experiments are conducted on the public datasets RSOD and NWPU VHR-10, and the mean average precision are 98.2% and 91.4%, respectively, which are improved by 1.8 and 2.1 percentage points compared with the baseline algorithm YOLOv8s. Compared with mainstream object detection algorithms, IFD-YOLOv8s is more effective in detecting complex background targets and multi-scale targets. Generalization experiments on the public dataset DOTA show an mean average precision of 78.7%, which is a 1.8 percentage points improvement over the original model.

Reference | Related Articles | Metrics

Select

Weakly-Supervised Semantic Segmentation Method with Saliency Boundary Constraints

BAI Xuefei, ZHANG Lina, WANG Wenjian

Computer Engineering and Applications 2025, 61 (19): 214-225. DOI: 10.3778/j.issn.1002-8331.2406-0116

Abstract （34）

PDF（pc）（2700KB）（24）

Save

In order to solve the problems of insufficient class activation and unclear pseudo-label boundaries in the existing weakly-supervised semantic segmentation methods, a weakly-supervised semantic segmentation method with saliency boundary constraints is proposed. A twin network with shared parameters is used as the class activation map generation network, and the images before and after the affine transformation are used as the inputs of two branches of the twin network. After obtaining different class activation maps, the complementary information is fused by consistency loss function to generate a more complete class activation map. The saliency correction module is designed, and boundary constraints are introduced into the class activation map to suppress the wrong activation of background information. At the same time, the saliency affinity module is designed to learn the affinity matrix between pixels from the saliency map, which further refines the initial pseudo-labels and improves the semantic segmentation performance of the model. The experimental results show that the mIoU value of this method is 71.4% on PASCAL VOC 2012 validation set, and the performance is improved by 2.1 percentage points compared with the baseline, and the mIoU value on the test set is 70.8%. The mIoU value on the COCO 2014 validation set is 39.2%, which shows a good segmentation result, and the method can better complete the task of weakly-supervised semantic segmentation.

Reference | Related Articles | Metrics

Select

Fine-Grained Class-Level Sketch-Based 3D Shape Retrieval Dataset

ZHENG Hu, BAI Jing, YAN Hao, SU Yawen

Computer Engineering and Applications 2025, 61 (19): 226-236. DOI: 10.3778/j.issn.1002-8331.2406-0143

Abstract （47）

PDF（pc）（2139KB）（26）

Save

Recently, fine-grained 3D shape classification has received growing attention in the community of computer graphics and computer vision. The paper constructs a sketch-based fine-grained class-level 3D shape retrieval dataset, named FGCL-SBSR, to support the use of sketches to retrieve 3D shapes of specific subclasses under a metaclass. The 3D shapes in this dataset are sourced from the FG3D dataset, while the sketches are collected through recruiting volunteers to draw them. After the sketches are collected, they are filtered and the dataset is accordingly divided rationally. Specifically, the Airplane sub-dataset in FGCL-SBSR contains 12 classes, including 1 388 3D shapes and 1 286 sketches, and the Chair sub-dataset contains 25 classes, including 2 321 3D shapes and 2 102 sketches. The dataset establishes a correspondence between sketches and 3D shapes in terms of subcategories and maintains the abstraction, sparseness, diversity and representativeness of the sketches. Finally, the experimental results of training models with different training sets and different methods on this paper’s dataset prove that this paper’s dataset can be well suited for the task of sketch-based fine-grained class-level 3D shape retrieval, fully demonstrating its necessity, consistency and effectiveness.

Reference | Related Articles | Metrics

Select

Adaptive Attention-Guided Conditional Diffusion Model for LDCT Image Denoising

WANG Shaoqi, JIANG Ailian, MA Jianfen

Computer Engineering and Applications 2025, 61 (19): 237-248. DOI: 10.3778/j.issn.1002-8331.2406-0150

Abstract （42）

PDF（pc）（2378KB）（35）

Save

Diffusion models have demonstrated higher image quality and more stable training process compared to generative adversarial networks (GANs) and convolutional neural networks (CNNs) in the task of image generation. However, the diversity of their generation results can lead to distortionss of details in low-dose computed tomography (LDCT) images. While classifier guidance (CG) and classifier-free guidance (CFG) control the diversity, they introduce additional training requirements and dependencies on external conditions. This study proposes an adaptive attention-guided conditional diffusion model (AACD) to ensure consistency of generated images with reduced extra training and lower dependence on external conditions. To further mitigate image detail distortion, this paper designs a multi-scale context-aware network (MSCAN) based on the downsampling layers of U-Net. MSCAN effectively enhances image details through downsampling processing, focusing on and fusing multi-scale information. Experimental results demonstrate that MSCAN and AACD exhibit strong competitiveness in terms of peak signal-to-noise ratio (PSNR), structural similarity (SSIM) and root mean square error (RMSE).

Reference | Related Articles | Metrics

Select

Tunnel Bolt Rust Detection Combining Local Enhancement and Improvement of YOLOv8

WU Xiaochun, LI Luyu

Computer Engineering and Applications 2025, 61 (19): 249-259. DOI: 10.3778/j.issn.1002-8331.2406-0288

Abstract （62）

PDF（pc）（1338KB）（47）

Save

A tunnel bolt corrosion detection model based on local enhancement algorithm (LEA) and improved YOLOv8(YOLOv8s+LEA+MSSf+FL, YOLO-LMF) is proposed to address the problem of low accuracy and high missed diagnosis rate of manual bolt inspection due to insufficient lighting in the maintenance environment of subway tunnels, transforming manual maintenance into intelligent detection to improve maintenance efficiency. Firstly, a local enhancement algorithm with neighbor check is used to enhance the bolt corrosion location, enabling the model to better identify corrosion features. Secondly, multi-scale channel group shuffle convolution (MSCGSC) is proposed. Integrating MSCGSC into the C2f (cross stage partial network fusion) module of YOLOv8, a new module MSSf (multi-scale shuffle fusion) is obtained, which enables the model to better learn the different behaviors of corroded bolts and stains near the bolts, and improve the detection accuracy of the model. Finally, considering the limitation of difficult samples in corroded bolts on the accuracy of model detection and the problem of imbalanced bolt samples, focal loss (FL) is introduced to reduce the weight of a large number of samples in training, allowing the model to concentrate on learning difficult samples for classification. The results show that the proposed model has increased by 0.032, 0.05, 0.011, and 0.003 respectively compared to the original model, and the number of parameters has decreased by 10.4%. The model performs better on the bolt dataset of subway tunnels, providing reference for the development of inspection robots for subway tunnel maintenance operations, reducing the workload of tunnel maintenance workers, and improving work efficiency.

Reference | Related Articles | Metrics

Select

Coherent Semantic-Driven Approach for Thick Cloud Removal in Optical Remote Sensing Images

CHU Yuting, LUO Xiaobo, ZHOU Jianjun, GOU Yongcheng, GUO Haihong

Computer Engineering and Applications 2025, 61 (18): 187-197. DOI: 10.3778/j.issn.1002-8331.2406-0114

Abstract （48）

PDF（pc）（6336KB）（47）

Save

Thick cloud cover significantly impacts the quality of optical remote sensing images, limiting their practical applications. Deep learning methods have shown promise in addressing the challenging task of thick cloud removal. However, existing approaches often suffer from issues such as blurry textures and distorted structures due to their disregard for semantic correlations and feature continuity within cloud-covered areas. To tackle these challenges, a novel coherent semantic-based two-stage generative adversarial network method for cloud removal (CSTGAN-CR) is proposed. This method effectively models the semantic correlations between cloud-covered and cloud-free regions, as well as within the cloud-covered areas, preserving contextual structures and improving the accuracy of missing part prediction. The CSTGAN-CR utilizes a two-stage deep neural network with a coherent semantic module and a multi-scale feature aggregation module embedded in the second stage. Experimental evaluations on the 38-cloud synthetic dataset and the RICE2 real dataset demonstrate that the proposed method generates higher-quality images compared to existing approaches, offering significant support for optical remote sensing image applications.

Reference | Related Articles | Metrics

Select

Crack-YOLOv7: Road Crack Detection Based on Deep Feature Extraction and Multi-Scale Information Fusion

ZHANG Yongqi, WANG Jie, DENG Bin, ZHOU Yuhao, YANG Junni

Computer Engineering and Applications 2025, 61 (18): 198-208. DOI: 10.3778/j.issn.1002-8331.2412-0120

Abstract （105）

PDF（pc）（7619KB）（144）

Save

The existing road crack detection methods usually rely on local features for detection, resulting in insufficient structural information and context relevance of the target, thus affecting the detection accuracy. In order to solve this problem, a pavement crack detection method Crack-YOLOv7 based on depth feature extraction and multi-scale information fusion is proposed. Firstly, the PSA (pyramid split attention) module is introduced into the backbone network to enhance the context information and location awareness of the feature map and obtain richer feature information. At the same time, the SSPPF (spatial stage pyramid pooling fast) module is designed to improve the inference speed of the network and effectively enhance the transmission of feedforward information. Secondly, the S2DT-FPN (spatial-shift dilated transformer feature pyramid network) structure is proposed. Through multi-scale feature fusion and cross-layer dependency establishment, the feature information of different semantic depths is further captured, while the global context features are retained. Finally, due to the diversity and overlap of road crack morphology, the flexible non-maximal suppression (Soft-NMS) algorithm is used to improve the detection accuracy in dense crack scenarios. The experimental results on the RDD2020 dataset show that the proposed method can effectively detect pavement cracks from the damaged image. The detection accuracy reaches 89.7%, and the mean average precision (mAP) value reaches 65.5%.

Reference | Related Articles | Metrics

Select

Underwater Object Detection Algorithm with Anti-Aliasing and Multi-Scale Feature Fusion

WANG Shupeng, LI Fan

Computer Engineering and Applications 2025, 61 (18): 209-217. DOI: 10.3778/j.issn.1002-8331.2406-0282

Abstract （47）

PDF（pc）（21074KB）（52）

Save

To address the challenges of multi-scale object detection in complex underwater environments, an improved algorithm, WPS-YOLOv8, is proposed. The wavelet pooling convolution (WPConv) module is designed, which reduces the resolution of feature maps after channel compression through wavelet pooling technology. This effectively suppresses frequency aliasing artifacts caused by traditional downsampling, improving both feature extraction quality and expressiveness. The partial pointwise group shuffle convolution (PGConv) module is introduced. By combining partial convolution with pointwise convolution, this module reduces information redundancy while maintaining information exchange between channels, addressing the limitations of depthwise separable convolution and enhancing feature fusion. The ShapeLoss loss function is proposed, which comprehensively considers factors affecting the accuracy of multi-scale object detection. By integrating Shape-IoU and Shape-NWD loss measures, it effectively improves overall detection accuracy for multi-scale objects. Experimental results show that, compared to YOLOv8, WPS-YOLOv8 achieves a mean average precision (mAP) improvement of 8.6 and 4.4 percentage points on the URPC2018 and UTDAC2020 underwater datasets, respectively, demonstrating its outstanding performance in underwater multi-scale object detection.

Reference | Related Articles | Metrics

Select

Geometric Transformation Combined with Image Enhancement for Small Target Detection in Aerial Images

QI Xiangming, LI Xiaolong

Computer Engineering and Applications 2025, 61 (18): 218-230. DOI: 10.3778/j.issn.1002-8331.2412-0294

Abstract （49）

PDF（pc）（3311KB）（33）

Save

Aerial image small target detection is complex, leading to a decline in detection metrics. An algorithm combining geometric transformations and image enhancement is proposed. Using YOLOv8n as the baseline, DCNv2-CA-GEO extracts spatial and channel features in parallel, dynamically adjusting the convolutional and pooling kernels to quickly adapt to geometric changes. SPD-OK-CSP adjusts channel dimensions, capturing fine-grained features and enhancing image quality. Dysample optimizes upsampling, while Dyhead improves detection head performance. The Inner-Wise-MPD-IoU strategy balances sample features and optimizes generalization. Evaluated by mAP@0.5, mAP@0.5:0.95, Precision, and Recall, experiments on VisDrone2021 show improvements of 6.1 percentage points in mAP@0.5, 4.4 percentage points in mAP@0.5:0.95, 6.0 percentage points in Precision, and 5.0 percentage points in Recall. On LEVIR-Ship, the improvements are 3.4 percentage points in mAP@0.5, 2.2 percentage points in mAP@0.5:0.95, 3.1 percentage points in Precision, and 5.3 percentage points in Recall. Generalization tests on VOC2007+2012 demonstrate enhancements of 2.0 percentage points in mAP@0.5, 4.3 percentage points in mAP@0.5:0.95, 1.5 percentage points in Precision, and 3.1 percentage points in Recall, indicating good robustness.

Reference | Related Articles | Metrics

Select

Focus Meta R-CNN: Few-Shot Object Detection Algorithm for Underwater Debris

WANG Kun, SHAO Chongzhou

Computer Engineering and Applications 2025, 61 (18): 231-240. DOI: 10.3778/j.issn.1002-8331.2404-0472

Abstract （49）

PDF（pc）（3892KB）（35）

Save

The issue of underwater litter and its associated hazards has attracted global attention, while the advancement of underwater robotics and object detection technology offering potential for the automated management of underwater debris. However, due to the high collection costs and difficulty of underwater data, applying deep learning methods to these tasks often results in a few-shot training environment, leading to a high risk of model overfitting. At the same time, the specificity of the underwater environment makes generic object detection algorithms are not well suited and require targeted improvements. Given the aforementioned two challenges, an underwater debris object detection algorithm is proposed suitable for few shot environments. Firstly, underwater data have singular foreground and a large amount of redundant noise in the background, to effectively preserve valuable information, features are extracted from the support set in focus part, enabling the model to focus more on the object itself while retaining appropriate contextual information. Secondly, to augment the model’s ability to extract information from the support set and maintain its generalization, a noise generator is introduced to send random perturbations to the focus region. Finally, considering that the support set and query set come from the same sampling domain, a joint meta-loss is proposed to make the model aware of this commonality, thereby enriching the information provided by the support set. Additionally, a diverse and contextually relevant underwater debris dataset is created, aligning more closely with real-world detection scenarios. The proposed approach achieves a precision of 16.9% under one-shot condition on this dataset, marking a 4.5?percentage points improvement over baseline models. Moreover, an increase of over 10 percentage points on the general dataset PASCAL VOC validates its generalizability.

Reference | Related Articles | Metrics

Select

Three-Dimensional High-Frequency Surface Reconstruction Under Spatial Frequency Domain Feature Encoding

WEI Dong, ZHANG Jingtian, BAI Yifan, SUN He

Computer Engineering and Applications 2025, 61 (18): 241-251. DOI: 10.3778/j.issn.1002-8331.2406-0106

Abstract （39）

PDF（pc）（3439KB）（19）

Save

Aiming at the deficiencies of existing neural implicit surface reconstruction algorithms in capturing high-frequency details on complex object surfaces, representing 3D textures, and reconstruction accuracy, this paper proposes a spatial frequency domain feature encoding network specifically designed for extracting high-frequency details. This network utilizes spatial feature triplanes to effectively model the interdependencies among 3D points and employs a high-frequency feature encoding module based on the fast Fourier transform (FFT) to enhance deep frequency variations during feature extraction, resulting in object surfaces with richer high-frequency geometric details. To maintain stability during the reconstruction process, a lightweight MLP is constructed to improve the fidelity of surface reconstruction and suppress noise generated during spatial frequency domain encoding. The proposed method is tested on six scenes from the DTU and NeRF Synthetic 360° datasets. This paper compares it against other algorithms using chamfer distance (CD) and peak signal-to-noise ratio (PSNR) metrics, and quantitatively evaluates the foreground details of complex objects. Experimental results demonstrate that the introduction of the high-frequency feature encoding module and the lightweight MLP decoder significantly enhances the reconstruction accuracy of complex objects, with the reconstructed 3D geometry and texture surfaces recovering finer geometric details.

Reference | Related Articles | Metrics

Select

Knowledge-Guided Graph Conjoint Reasoning Object Detection Method

XIE Binhong, WANG Wenbo, ZHANG Rui

Computer Engineering and Applications 2025, 61 (18): 252-262. DOI: 10.3778/j.issn.1002-8331.2406-0113

Abstract （42）

PDF（pc）（36239KB）（32）

Save

Mainstream object detection methods typically handle each region in isolation, neglecting crucial global context information and inter-object category relationships. To this end, this paper proposes a knowledge-guided graph conjoint reasoning object detection method (GCRKG), which includes the global relational reasoning (GRR) module and the global knowledge mapping (GKM) module. This method aims to enhance detection performance by emulating the human reasoning process. Firstly, the GRR module employs graph conjoint attention networks (GCAT) to perform category relationship reasoning by comprehensively considering the relative importance of features, co-occurrence, and semantic relevance knowledge among categories. Secondly, the GKM module utilizes multi-label image classification probabilities and object detection classifier category probabilities to effectively map category relationship knowledge onto visual regions. Finally, the mapped features are concatenated with the original visual region features to enhance the prediction of more reasonable results. Comparative results with baseline models on the VOC and COCO datasets demonstrate the effectiveness and superiority of this method.

Reference | Related Articles | Metrics

Select

Improved Transformer Industrial Image Classification Algorithm Fusing Local and Global Feature

WANG Ling, CUI Zhiyu, HUANG Jing, WANG Peng, BAI Yan'e

Computer Engineering and Applications 2025, 61 (18): 263-272. DOI: 10.3778/j.issn.1002-8331.2407-0131

Abstract （77）

PDF（pc）（9989KB）（89）

Save

The industrial images have some character, such as limited data acquisition, complex environments and variable lighting conditions, the classification accuracy of ViT model remains suboptimal. To address this issue, an industrial image classification algorithm is proposed based on the CMT model. Firstly, the Patch Embedding module is enhanced by incorporating affine transformations and sequential convolutional blocks, improving the generalization capability of model on small datasets. Subsequently, the CMT Block is refined by introducing a parallel local feature extraction module, which enhances the model??s ability to capture local features. The multi-head self-attention mechanism is replaced with a token interaction attention mechanism to improve the model??s global feature representation. Deep convolution and channel attention are then integrated into the feedforward neural network, enabling the model to effectively capture neighboring features. Finally, a feature fusion module is proposed to integrate local and global features, enriching the feature representation and enhance classification performance on small datasets. Experimental results on the self-made filling bucket dataset, the public Car Parts dataset, and the Tiny ImageNet dataset demonstrate that the improved CMT model achieves classification accuracy improvements of 4.7, 6.9 and 5.2 persentage points for Top-1 Accuracy over the CMT model and 0.057, 0.071 and 0.048 for Macro F1 over the CMT model.

Reference | Related Articles | Metrics

Select

Textual Modality-Assisted RGB Salient Object Detection

HAN Chunyu, MA Jun, SHA Honghan, XIAO Xin, LU Chenkai, YAN Xin, ZHANG Xia

Computer Engineering and Applications 2025, 61 (17): 259-271. DOI: 10.3778/j.issn.1002-8331.2403-0179

Abstract （56）

PDF（pc）（10336KB）（61）

Save

Salient object detection is the process of identifying the most visually prominent objects in images or videos. Addressing the performance degradation in cluttered scenes with low foreground-background contrast, this paper proposes a salient object detection model assisted by text modality information. The proposed model enhances target representation by integrating RGB features with image captions generated by an image description network, capturing the semantic information of most objects in the scene, thereby suppressing background noise. A cross-modality guidance fusion module is introduced, effectively merging text and RGB modalities through self-interaction and mutual interaction. To address the issue of global attention mechanisms overlooking detailed information, a hybrid attention module is proposed, modeling contextual information at both global and local levels to further improve prediction accuracy. The effectiveness of the proposed model has been experimentally verified on standard benchmarks such as mean absolute error (MAE), structural Similarity (S), and Weighted F-score (FW).

Reference | Related Articles | Metrics

Select

Illumination Transformation and Depth Invariance Constraint Depth Estimation for Low-Light Scene

CAO Xiaoqian, WANG Yang, LIU Weifeng

Computer Engineering and Applications 2025, 61 (17): 272-281. DOI: 10.3778/j.issn.1002-8331.2501-0103

Abstract （48）

PDF（pc）（1538KB）（30）

Save

A novel depth estimation algorithm based on illumination transformation and depth invariance constraint is proposed to solve the significantly degraded performance problem occurring in low-light scenarios, such as assisted driving at night. The key thought is to promote the depth estimation network’s low-light generalization ability through low-illumination diversity transformation and depth invariance constraint of the same scene, which can force the network to extract underlying light-independent depth features. Specifically, paired “RGB-Depth” dataset under well illumination is obtained with the SOTA depth estimation network at first. then, for each RGB image captured under good light, its illumination component is estimated and transformed referring to a good deal of low-light scene images to generate a series of low-light images with the same scene as it. Finally, the depth estimation network is fine-tuned using the depth invariance constraint between the generated low-light image and the original RGB image. The experimental results indicate that the proposed algorithm is superior to the original depth estimation algorithm（Lite-Mono）and the SOTA low-light depth estim-ation algorithms（STEPS and ADDS ） in all assessment criteria. In addition, the algorithm can be embedded into other classical depth estimation networks to improve the adaptive ability of the original algorithm conveniently.

Reference | Related Articles | Metrics

Select

Attention Tracking Algorithm for Efficient Tracking Heads

YANG Xiaoqiang, HU Hao

Computer Engineering and Applications 2025, 61 (17): 282-291. DOI: 10.3778/j.issn.1002-8331.2404-0128

Abstract （63）

PDF（pc）（6702KB）（28）

Save

To improve the accuracy and running speed of network tracking in tracking tasks, an improved attention mechanism tracking algorithm (MFATrack) is proposed. To address the interference caused by complex backgrounds on network tracking targets, a dynamic tracking module (MFAM) combining depth-wise separable convolution, ECA, and low-pass filters is used to enhance the ability to discover discriminative features of network targets. And the paper designs a tracking head network based on MFAM to reduce deep network information loss and improve its stability, thereby improving the network’s running speed. In the loss function, classification and regression losses are combined. The classification loss incorporates perceptual intersection and union ratio, while the regression loss uses generalized intersection and union ratio loss, making the network more focused on the tracked target during training. The experimental results show that compared with the basic algorithm, the accuracy of the GOT-10k dataset is increased by 4.1 percentage points, the OTB center error value of the dataset is increased by 1.9 percentage points, and the tracking success rate of UAV123 is increased by 3.6 percentage points.

Reference | Related Articles | Metrics

Content of Graphics and Image Processing in our journal