Loading...

Table of Content

    2024-11-15, Volume 60 Issue 22
    Research Hotspots and Reviews
    Research Progress on Designing Lightweight Deep Convolutional Neural Networks
    ZHOU Zhifei, LI Hua, FENG Yixiong, LU Jianguang, QIAN Songrong, LI Shaobo
    2024, 60(22):  1-17.  DOI: 10.3778/j.issn.1002-8331.2404-0372
    Abstract ( )   PDF (6330KB) ( )  
    References | Related Articles | Metrics
    Lightweight design is a popular paradigm to address the dependence of deep convolutional neural network (DCNN) on device performance and hardware resources, and the purpose of lightweighting is to increase the computational speed and reduce the memory footprint without sacrificing the network performance. An overview of lightweight design approaches for DCNNs is presented, focusing on a review of the research progress in recent years, including two major lightweighting strategies, namely, system design and model compression, as well as an in-depth comparison of the innovativeness, strengths and limitations of these two types of approaches, and an exploration of the underlying framework that supports the lightweighting model. In addition, scenarios in which lightweight networks have been successfully applied are described, and predictions are made for the future development trend of DCNN lightweighting, aiming to provide useful insights and references for the research on lightweight deep convolutional neural networks.
    Survey of Multimodal Knowledge Graph Construction Technology and Its Application in Military Field
    YAO Yi, CHEN Zhaoyang, DU Xiaoming, YAO Tianlei, LI Qingshang, SUN Mingwei
    2024, 60(22):  18-37.  DOI: 10.3778/j.issn.1002-8331.2404-0285
    Abstract ( )   PDF (7117KB) ( )  
    References | Related Articles | Metrics
    With the rich types of data resources and the development of large language model technology, the multimodal knowledge graph (MMKG) that can handle multi-source heterogeneous data has been widely concerned because of its excellent data processing and management capabilities. Combined with the requirements and characteristics of the field, this paper gives a general survey of the construction technology of multimodal knowledge graph and its application in military field. Based on the relevant concepts of traditional text knowledge graph, this paper summarizes the basic concepts and research status of multimodal knowledge graph, analyzes and summarizes the key technologies of multimodal knowledge graph construction, which are multimodal information extraction, multimodal entity link and multimodal representation learning, and the application of large language model technology in the process of multimodal knowledge graph construction, discusses the application scenarios of multimodal knowledge graph in military field. Finally, combined with the hot topics of large language model and military requirements, the development prospect and military application of multimodal knowledge graph construction technology are summarized.
    Survey on Temporal Knowledge Graph Completion Research
    XU Kaijia, LIU Lin, WANG Hailong, LIU Jing
    2024, 60(22):  38-57.  DOI: 10.3778/j.issn.1002-8331.2404-0331
    Abstract ( )   PDF (7217KB) ( )  
    References | Related Articles | Metrics
    Currently, temporal knowledge graphs widely suffer from incompleteness, which severely restricts their application and development in downstream tasks. Temporal knowledge graph completion (TKGC) techniques aim to predict the missing links within these graphs to address the incompleteness issue. By incorporating the time dimension, TKGC seeks to capture temporal information, thus understanding how entities and relationships change over time, which helps in more accurately completing the temporal knowledge graph. This paper reviews the latest research advancements in TKGC based on different strategies for applying temporal information. Firstly, it provides a detailed explanation of the research background of TKGC, including problem definitions and key benchmark datasets. Secondly, it introduces and summarizes existing TKGC methods based on the proposed classification approach, and discusses the applications of TKGC in downstream tasks. Finally, this paper proposes current challenges and future research directions.
    Research Progress of Image Inpainting Methods Based on Deep Learning
    CHEN Wenxiang, TIAN Qichuan, LIAN Lu, ZHANG Xiaohang, WANG Haoji
    2024, 60(22):  58-73.  DOI: 10.3778/j.issn.1002-8331.2406-0100
    Abstract ( )   PDF (9001KB) ( )  
    References | Related Articles | Metrics
    Image inpainting is the process of recovering and repairing damaged or missing parts of an image through algorithms or techniques, which is a significant research focus in the field of computer vision. This paper reviews the development trajectory of deep learning-based image inpainting methods in recent years, and categorizes them into single-modal and multi-modal methods. The single-modal image inpainting methods are divided into convolutional autoencoder-based methods, GAN-based methods, Transformer-based methods and diffusion model-based methods. Meanwhile, the multi-modal image inpainting methods include text-guided methods, audio-guided methods, video-guided methods and multi-modal fusion-based methods. Furthermore, this paper provides a comparative analysis of the principles, advantages and disadvantages of various methods. It also introduces commonly used datasets and evaluation metrics, assesses the performance of representative methods on standard datasets, and discusses current challenges and future directions in this domain.
    Survey of Image Detection Methods Generated by GAN Models
    XIE Tianqi, WU Yuanyuan, JING Chao, SUN Weiheng
    2024, 60(22):  74-86.  DOI: 10.3778/j.issn.1002-8331.2405-0346
    Abstract ( )   PDF (8123KB) ( )  
    References | Related Articles | Metrics
    As a powerful tool for generating high-quality images, generative adversarial network (GAN) has been widely used in the field of image synthesis in recent years. However, with the rapid development of GAN technology, it also raises serious concerns about image forgery and fraud, especially in key areas such as news reporting, identity authentication and judicial forensics. These fake images not only are difficult to identify, but also may be used to spread false information, commit fraud, or even cause irreparable damage in legal cases. To cope with this challenge, researchers have proposed a variety of methods for detecting GAN-generated images, which can be mainly divided into feature-based methods and data-driven methods. This paper systematically sorts out the current main GAN image detection methods, and verifies their detection accuracy on different datasets through re-training experiments. Finally, the development trend of GAN image detection in the future is prospected, and potential research directions are proposed, in order to promote further innovation and development in this field.
    Theory, Research and Development
    Classification Multi-Strategy Predictive Dynamic Multi-Objective Optimization with Pareto Set Rotation
    LI Erchao, LIU Chenmiao
    2024, 60(22):  87-104.  DOI: 10.3778/j.issn.1002-8331.2401-0145
    Abstract ( )   PDF (9375KB) ( )  
    References | Related Articles | Metrics
    In order to solve the dynamic multi-objective optimization problem of Pareto set (PS) rotation more effectively, this paper proposes a classification multi-strategy prediction method based on PS rotation (RFM). Firstly, the rotation types of PS are divided into PS center point rotation, PS origin rotation and non-standard rotation. Then, the appropriate prediction model is adaptively selected for the above different PS rotation types, and the time series of different point sets is established to provide the initial population for the subsequent evolution. Finally, the random population generated by Latin hypercube strategy (LHS) is introduced to construct a new population together with the above predicted population to ensure the diversity of the population. In order to verify the effectiveness of the algorithm, the RFM algorithm is compared with DNSGA-II, PPS, SPPS and MMP algorithms on eight standard dynamic test functions. The experimental results show that the RFM algorithm achieves six optimal [IGD] values, seven optimal [SP] values and three optimal [MS] values, which proves that the RFM algorithm can solve the dynamic multi-objective optimization problem based on PS rotation more effectively. At the same time, the generality of the RFM algorithm is verified by experiments on the FDA series of functions. The experimental results show that the algorithm still has better performance in dealing with non-rotating dynamic multi-objective optimization problems.
    Optimization Research on Quantum Circuit Scheduling Strategies for NISQ Devices
    LI Hui, LU Kai, HAN Zi’ao, JU Mingmei, LIU Shujuan, DU Zuoqiang
    2024, 60(22):  105-113.  DOI: 10.3778/j.issn.1002-8331.2401-0224
    Abstract ( )   PDF (7034KB) ( )  
    References | Related Articles | Metrics
    In the noisy intermediate-scale quantum (NISQ) era, scheduling plays a critical role in the compilation of quantum circuits. Traditional scheduling strategies fail to fully exploit the parallelism of quantum computing and overlook the potential parallel optimization within layers. To address this, two optimization strategies are designed:topological layered scheduling strategy (TLSS) and layerwise conflict optimization strategy (LCOS). TLSS utilizes greedy algorithms and topological sorting principles to allocate quantum gates within the layer structure, maximizing the parallel execution of quantum gate operations. LCOS inserts SWAP gates within layers to minimize conflicts and enhance parallelism, optimizing overall computational efficiency. Experimental results demonstrate that, in specific environments involving 4 to 22 qubits, planar topological structures, and an average lifespan of 67 μs for two qubits, TLSS and LCOS respectively reduce the number of SWAP gates by 51.1% and 53.2%, and decrease hardware gate overheads by 14.7% and 15%. Combining both strategies reduces the number of SWAP gates by 51.6% and hardware gate overheads by 14.8%, due to the complexity of quantum circuits and the interference in inter-layer temporal relationships. However, the applicability of the results is subject to different structure and hardware constraints.
    Pattern Recognition and Artificial Intelligence
    Cross-Modal Semantic Alignment and Information Refinement for Multi-Modal Sentiment Analysis
    DING Meirong, CHEN Hongye, ZENG Biqing
    2024, 60(22):  114-125.  DOI: 10.3778/j.issn.1002-8331.2307-0431
    Abstract ( )   PDF (730KB) ( )  
    References | Related Articles | Metrics
    In order to solve the problems of heterogeneous gap, semantic gap and inability to effectively fuse modalities in multi-modal sentiment analysis, this paper proposes a new framework, a multi-modal sentiment analysis model CM-SAIR based on cross-modal Transformer for semantic alignment and information refinement, which can effectively solve problems such as multi-modal semantic misalignment and semantic noise, and achieve better interactive fusion of multi-modal data. Multi-modal feature embedding module (MFE) is used to enhance the emotional information of visual and audio modalities. A well-defined inter-modal semantic alignment module (ISA) is proposed for bimodal temporal dimensions alignment. Sentiment parsing and sentiment refinement are performed through an intra-modal information refinement module (IIR). Effective modal fusion is achieved through the multi-modal gated fusion module (MGF). Extensive experiments on popular multi-modal sentiment analysis datasets demonstrate the advantages of the CM-SAIR framework over state-of-the-art baselines.
    Multi-Source Intelligence Driving International Strategy Game Decision Analysis Based on Extenics
    ZHANG Wei, WEI Xinlei, NIE Yun, DU Yanshuang, NIU Pengfei, LIANG Jia, LEI Jiyue, WANG Jikun
    2024, 60(22):  126-136.  DOI: 10.3778/j.issn.1002-8331.2404-0369
    Abstract ( )   PDF (3959KB) ( )  
    References | Related Articles | Metrics
    International strategy game is related to national security and competition, military conflict and war, crisis control and other major national strategic issues. In the international strategy game, there are some features that are macroscopic, wholeness, foreseeability and fuzziness. There are complex correlations between related elements, relationships, rules and indicators, so the quantization simulation and modeling of the international strategy game is difficult. However, the most of international strategy game decision analysis methods focus on current strategy situations and are short of prospective analysis based on contradiction motion. It is hard to avoid the awkward position appeared in international strategy game. For above problems, a multi-source intelligence driving international strategy game decision analysis method based on extenics is proposed. In this method, nation-society structure extension model and national interest structure extension model are constructed for international strategy game. Meantime, national power model is proposed based on five domains, including political, economic, military, diplomatic and cognitive, and national interest measure model is proposed based on the relative interests of the state. The international strategy game decision analysis algorithm is designed. Using open source data for verification, the experimental results show that, the nation-society structure extension model and national interest structure extension model can effectively characterize the international strategy game situation, and accurately identify the principal game object country and key domain, which are consistent with the real situation. The results demonstrate the effectiveness of the proposed international strategy game decision analysis method.
    Low-Resource Knowledge Graph Completion by Combining Entity and Relation Message Passing
    ZHANG Ting, DU Fang, SONG Lijuan, SHI Yingjie, ZHAO Guodong, LI Ting
    2024, 60(22):  137-144.  DOI: 10.3778/j.issn.1002-8331.2312-0151
    Abstract ( )   PDF (3162KB) ( )  
    References | Related Articles | Metrics
    Knowledge graph completion is a significant research topic in knowledge graphs. Most existing work assumes that it is carried out on sufficient triadic instances. However, in vertical areas such as medicine and law, data are difficult to obtain, and they may belong to low-resource scenarios due to the lack of adequate data and prior knowledge. Therefore, studying knowledge graph completion methods in low-resource scenarios is crucial for solving practical problems. This paper proposes a low-resource knowledge graph completion method SMKGC (similarity messages knowledge graph completion), which combines nodes and edges message passing to predict links by sensing local relations and entities. Different with existing methods, SMKGC fuses the feature information of semantically similar nodes and semantically parallel edges with the Pathcon model, thereby to enhance the feature representation of message passing and improve link prediction accuracy. Specifically, it includes two modules: (1) Entity-similarity-based messaging, which aggregates information surrounding similar entities on the basis of capturing the neighboring edges of a given entity pair. (2) Relation-similarity-based messaging, which obtains the relative positions of a given entity pair through relational paths and combines the similar edges of the paths to perform relational prediction. Experimental results demonstrate that this method significantly outperforms other methods on benchmark datasets commonly used in knowledge graphs. The results also show that similar entities and relation messaging based on similar nodes can improve prediction accuracy in low-resource knowledge graph completion tasks.
    Dual-Branch Feature Fusion Remote Sensing Building Detection Model
    CHENG Jiawei, GUO Rongzuo, WU Jiancheng, ZHANG Hao
    2024, 60(22):  145-153.  DOI: 10.3778/j.issn.1002-8331.2307-0270
    Abstract ( )   PDF (4946KB) ( )  
    References | Related Articles | Metrics
    In order to solve the problem of low accuracy caused by different building sizes and fuzzy edges in remote sensing building images, a dual-branch parallel fusion attention mechanism network model TC-UNet++ is proposed. Firstly, considering that convolutional neural networks are good at extracting local features and difficult to capture global information, Transformer structure is introduced to solve the problem of global information loss. Secondly, to solve the problem of mismatch between the feature dimension and channel number of the two structures, a TC (Transformer to CNN) module is designed to interactively integrate local and global features at different resolutions. Finally, the coordinate attention mechanism is introduced to locate and identify buildings according to the position information of pixels in the image. Experimental results show that the interaction ratio, accuracy and total accuracy of TC-UNet++ on WHU dataset reach 93.1%, 95.9% and 98.8% respectively, showing good effectiveness without significantly increasing parameters.
    Completion of Temporal Knowledge Graph for Historical Contrastive Learning
    XU Zhihong, QIU Penglin, WANG Liqin, DONG Yongfeng
    2024, 60(22):  154-161.  DOI: 10.3778/j.issn.1002-8331.2307-0291
    Abstract ( )   PDF (3069KB) ( )  
    References | Related Articles | Metrics
    Aiming at the problem that the existing temporal knowledge graph completion model is highly dependent on events that have occurred in history and the prediction of events that have not occurred in history is inaccurate, a completion of temporal knowledge graph for comparing historical and non-historical information (CHNH) with time series information is proposed. Firstly, the model captures long-term dependencies in the sequence through BiLSTM, ensuring accurate encoding of historical information. Then, the graph convolution operation is performed using RGCN to learn the global graph representation. In the prediction process, different scoring functions are used for separately coded historical and non-historical information to determine the dependence degree of the prediction entity on these two types of information. In this way, the model can more effectively complete entities and relationships, improving the predictive performance of the model. Experimental results on ICEWS18, GDERT and YAGO datasets show that the CHNH model generally outperforms the baseline model in MRR, Hits@1, Hits@3 and Hits@10.
    Chinese Named Entity Recognition Based on External Knowledge and Position Information
    LI Yuan, Luosang Gadeng, JIANG Weili
    2024, 60(22):  162-171.  DOI: 10.3778/j.issn.1002-8331.2307-0395
    Abstract ( )   PDF (4014KB) ( )  
    References | Related Articles | Metrics
    Named entity recognition (NER) is an important and fundamental task in the field of information retrieval and natural language processing. Different from English, existing Chinese NER methods suffer from Chinese word segmentation (CWS) problem, and lack of domain knowledge. To solve the above problems, this paper proposes a Chinese NER method that combines knowledge graphs embedding (KGE) and position information with mask to enhance Lattice semantics. The use of Lattice information lays a structural foundation for completing word-level information and solving the CWS problem. The use of KGE can supplement and locate the missing domain knowledge of pre-trained language models. The use of position information with mask can solve the problem of knowledge noise caused by using knowledge graphs. The method proposed in this paper works well both in the general domain and the specific domain, and the F1 values on Weibo, Resume and CCKS 2017 reach 74.01%, 96.62% and 94.95%, respectively.
    Multimodal Aspect-Level Sentiment Analysis Based on Multi-Granularity View Dynamic Fusion
    YANG Ying, QIAN Xinyu, WANG Hening
    2024, 60(22):  172-183.  DOI: 10.3778/j.issn.1002-8331.2309-0082
    Abstract ( )   PDF (7065KB) ( )  
    References | Related Articles | Metrics
    To solve the problems of inadequate feature extraction, low data information utilization, and ignoring the complex interaction in multimodal data for aspect sentiment analysis, a multi-granularity view dynamic fusion model (MVDFM) is proposed. Firstly, text and image data are encoded from two perspectives of coarse-grained and fine-grained, so as to fully capture data features and enhance the information representation ability of the model. Secondly, multi-granularity view features of text and image are extracted, and dynamic gated self-attention mechanism is designed to reduce the noise of fine-grained text and image views to further ensure the quality of feature extraction. Finally, in order to excavate the complementarity and consistency between multiple views at different granularity, a triple-view factorized bilinear pooling mechanism is proposed to carry out two-stage dynamic fusion of multi-granularity view features to obtain the final target aspect sentiment polarity. The experimental results show that the accuracy and F1 values of the model on the public data sets Twitter-2015 and Twitter-2017 reach 78.69% and 74.48%, and 72.77% and 71.61%, respectively. Compared with the best baseline model, the improvement is 0.55, 0.88 percentage points and 1.67, 2.45 percentage points, respectively. This method can make full use of the information contained in the multimodal data, and effectively mine the key parts related to the target aspect words to improve the effect of aspect-level emotion prediction.
    Multi-Hop Knowledge Base Question Answering with Pre-Trained Language Model Feature Enhancement
    WEI Qianqiang, ZHAO Shuliang, LU Danqi, JIA Xiaowen, YANG Shilong
    2024, 60(22):  184-196.  DOI: 10.3778/j.issn.1002-8331.2311-0459
    Abstract ( )   PDF (4685KB) ( )  
    References | Related Articles | Metrics
    Knowledge base question answering (KBQA) is a challenging and popular research direction. The main challenge of multi-hop knowledge base question answering is the inconsistency between unstructured natural language questions and structured knowledge base reasoning paths. The multi-hop knowledge base question answering model based on graph retrieval is good at grasping the topological structure of the graph, but ignores the text information carried by the nodes and edges in the graph. In order to fully learn the text information of knowledge base triples, this paper constructs the text form of knowledge base triples and proposes three feature enhancement models RBERT, CBERT, and GBERT based on non-graph retrieval. The three feature models respectively use feedforward neural networks, deep pyramid convolutional networks, and graph attention networks to enhance features. The three models significantly improve feature representation capabilities and question and answer accuracy. RBERT has the simplest structure, CBERT is the fastest in training, and GBERT has the best performance. Experimental comparisons are conducted on the datasets MetaQA, WebQSP and CWQ, the three models are significantly better than the current mainstream models on the two indicators of Hits@1 and F1, and are also significantly better than other BERT improved models.
    Lightweight and Efficient Human Pose Estimation Fusing Transformer and Attention
    WU Chengpeng, TAN Guangxing, CHEN Haifeng, LI Chunyu
    2024, 60(22):  197-208.  DOI: 10.3778/j.issn.1002-8331.2401-0173
    Abstract ( )   PDF (6008KB) ( )  
    References | Related Articles | Metrics
    Aiming at the heavy computational cost and huge network scale problem of human posture estimation algorithms, lightweight efficient vision transformer for human posture estimation (LEViTPose) is proposed. Firstly, a lightweight preprocessing module LStem is designed by introducing deepwise separable convolution, channel shuffle and multi-scale convolution kernel parallel techniques. Then, a cascaded group spatial linear reduction attention (CGSLRA) is proposed, which uses feature grouping to divide multiple attention heads to improve memory efficiency, and uses intra-group feature dimension reduction to reduce computational redundancy. Finally, a lightweight feature recovery module (LFRM) is designed by introducing pointwise convolution and group transposed convolution. The experimental results show that the proposed method can improve the network performance and inference speed while reducing the network size and computational overhead compared to the baseline model. Compared with LiteHRNet-30 on the MPII and COCO validation sets, the average accuracy is improved by 2.6 and 3.4 percentage points, and the inference speed is increased by a factor of 1.
    Graphics and Image Processing
    Small-Scale Hand Detection Method in Complex Backgrounds Based on Parallel Mixed Attention Mechanism
    LIANG Chao, WANG Yangping, WANG Wenrun
    2024, 60(22):  209-218.  DOI: 10.3778/j.issn.1002-8331.2307-0302
    Abstract ( )   PDF (6380KB) ( )  
    References | Related Articles | Metrics
    In response to the challenges posed by unclear hand features and significant scale variations in complex backgrounds, this paper proposes a small-scale hand detection method based on YOLOv5. Firstly, a parallel mixed attention mechanism (PMAM) is designed and integrated into the backbone network to enhance the extraction of hand features. Secondly, a path bidirectional-feature pyramid network (PB-FPN) is introduced, combining path aggregation network (PANet) and bidirectional feature pyramid network (BiFPN), and incorporating new pathways for bottom-level feature fusion to improve the detection capability of small-scale hand objects. Furthermore, the spatial pyramid pooling-fast (SPPF) from the backbone network is incorporated into the feature fusion network and is connected with the prediction heads of the model to further enhance the algorithm performance. FReLU is utilized as the activation function in the network model to improve spatial sensitivity and robustness. To validate the effectiveness of the proposed method, a new dataset named TV-COCO-Hand, tailored to the research context, is constructed and used for related experiments. The results show that the improved model achieves an mAP of 91.4% on the constructed dataset, which is a 3.8 percentage points improvement over the baseline network model, and outperforms current mainstream detection network models. Additionally, the dataset comparison experiment and real-world scenarios detection experiment on public datasets are conducted to verify the generalization of the model.
    Combining Dynamic Split Convolutions and Attention for Multi-Scale Human Pose Estimation
    FENG Mingwen, XU Yang, ZHANG Yongdan, XIAO Ci, HUANG Yiqian
    2024, 60(22):  219-229.  DOI: 10.3778/j.issn.1002-8331.2307-0301
    Abstract ( )   PDF (7405KB) ( )  
    References | Related Articles | Metrics
    Human pose estimation has become increasingly important in many fields such as animation design, security monitoring, and motion analysis. However, current human pose estimation algorithms focus on accuracy, leading to complex networks with high computational costs, making it difficult to apply them on mobile devices and embedded platforms. To address this challenge, this paper proposes the DNSNet, a multi-scale human pose estimation network that combines dynamic split convolution and normalized attention. Firstly, the bottleneck layer DKASCneck of the high-resolution network is redesigned using dynamic split convolution and dynamic kernel aggregation operations. This avoids excessive use of large convolution kernels, reduces computational costs while enhancing the ability of the network to extract useful features. Secondly, the NAMPCblock, a basic module using partial convolution and normalization-based attention mechanism, is introduced. This module reduces computational redundancy and memory access while enhancing information interaction across channels and spatial dimensions. Finally, the output feature fusion method of the network is redesigned based on multi-resolution features and deconvolution to improve the accuracy of heatmap regression predictions. Experimental results show that compared to high-resolution networks, on the COCO validation set, the average accuracy of the proposed network model is increased by 2.1 percentage points, the computational complexity is reduced by 32.4% and the model parameters are reduced by 71.9%. On the MPII validation set, the computational complexity is reduced by 38.9%, and the model parameters are reduced by 71.9%. The experimental data demonstrate that the proposed network significantly reduces network complexity while slightly improving detection accuracy.
    Multi-Branch Thinning Congested Pedestrian Detection Algorithm
    YUAN Heng, WANG Jiali, ZHANG Shengchong
    2024, 60(22):  230-239.  DOI: 10.3778/j.issn.1002-8331.2307-0283
    Abstract ( )   PDF (6170KB) ( )  
    References | Related Articles | Metrics
    Crowded pedestrian detection is a research hotspot in the field of small target detection. Aiming at the problem of missing detection caused by dense people and occlusion in crowded pedestrian detection scenes, an improved SSD (single shot multibox detector) target detection algorithm is proposed. Firstly, the shallow Vgg (visual geometry group) network plain structure uses batch normalization (BN) operation to increase the branch structure, and renames multi-branch thinning network structure, so that it can refine shallow semantic information, improve network generalization ability, and fully express pedestrian information. Secondly, the improved Ghost model is used to replace the 3×3 convolution in the multi-branch thinning network, the cheap_operation convolution in the Ghost model is used to reduce the number of model parameters increased due to the multi-branch structure, and the primary_conv is used to improve the feature extraction capability of shallow networks and strengthen the network recognition capability. Finally, the Huber loss function is improved by using the two-normal form instead of the difference square, which enhances the stability of network training and makes it achieve better convergence effect. The detection results on Wider_Person crowded pedestrian detection dataset show that the proposed improved SSD target detection algorithm MAP50 reaches 72.9%, which is 7.4 percentage points ahead of YOLO-X algorithm, 3.5 percentage points ahead of baseline algorithm, and 14.4 percentage points ahead of other advanced algorithms on average. The feasibility of the algorithm in pedestrian detection is verified, and it meets the detection requirements of the scene of blocking pedestrians.
    Vehicle Detection Algorithm Based on Dual Branch Feature Aggregation Network
    LYU Meng, MAO Shenghui, CHAI Liang, GAO Pengfei, SHI Lei
    2024, 60(22):  240-250.  DOI: 10.3778/j.issn.1002-8331.2405-0401
    Abstract ( )   PDF (962KB) ( )  
    References | Related Articles | Metrics
    Vehicle target detection is an important part of autonomous driving. Existing vehicle target detection algorithms have not fully considered the advantages and disadvantages of CNN (convolutional neural network) and Transformer in feature extraction, which to some extent limits the overall performance of the network. This paper proposes a dual branch feature aggregation network consisting of CNN and Transformer. In the encoding stage, based on the respective advantages of CNN and Transformer, a dual branch backbone network is constructed to extract the feature information of the original image. By designing a multi-level spatial attention module and a dual branch feature aggregation module, the feature information between the two branches is guided to learn from each other. Finally, a dual branch attention module is constructed to further reduce the loss of feature information in deep neural networks. In the experimental section, the effectiveness of the proposed algorithm is further verified through ablation experiments and comparative experiments. Compared to mainstream object detection algorithms, it has improved by about 3.5% in the mAP (mean average precision) metric.
    Improved MobileViT Algorithm for Small Samples
    ZHANG Bushi, FAN Hong
    2024, 60(22):  251-260.  DOI: 10.3778/j.issn.1002-8331.2307-0250
    Abstract ( )   PDF (4500KB) ( )  
    References | Related Articles | Metrics
    To improve the classification ability, training speed, convergence, and inference speed of the MobileViT algorithm based on Transformer on small-sample data, two modules are proposed and inserted into the MobileViT algorithm:convolutional maxpooling downsampling (CMP) and multi-branch residual feature fusion (MR-FF). These modules are respectively used to reduce model parameters, reduce feature redundancy and prevent input feature loss. Taking the results of the MobileViT with the minimum number of parameters as an example, comparative experiments are conducted on the Oxford Flower102 and Mini-ImageNet small-sample datasets. The MobileViT with the inserted CMP and MR-FF modules achieves a 12.9 and 9.4 percentage points increase in test accuracy, a 17% increase in training speed, and a 0.31?ms increase in inference speed. Furthermore, it is found that when only the CMP module is inserted into MobileViT, higher classification accuracy and shorter inference time can be achieved on small-sample datasets with fewer than 60?000 images. Finally, a comparison is made with 5 advanced image classification algorithms, and the improved MobileViT achieves the best test results on small-sample data.
    YOLOv8 Crack Defect Detection Algorithm Based on Multi-Scale Features
    ZHAO Baiting, CHENG Ruifeng, JIA Xiaofen
    2024, 60(22):  261-270.  DOI: 10.3778/j.issn.1002-8331.2404-0332
    Abstract ( )   PDF (4458KB) ( )  
    References | Related Articles | Metrics
    To solve the problems of low detection efficiency and missing detection caused by complex background and large aspect ratio difference of shaft lining cracks, a crack defect detection model EDG-YOLO with multi-scale features is proposed. Firstly, the feature extraction module EIRBlock (efficient inverted residual block) is designed, and C2fEIR is constructed to enhance the ability of backbone network to extract the shallow crack feature information. Secondly, the CSP_EDRAN (CSP efficient dilated reparam aggregation network) is fused in the neck to realize the reuse of the crack feature information, and promote the interaction between the shallow and deep semantic information. Meanwhile, the attention mechanism of DAM (dual attention module) is embedded to enhance the expression ability of shaft lining crack features. Finally, a lightweight detection head GDetect is constructed, and the network is further lightweight with the help of GSConv module. The experimental results on the self-made shaft lining crack dataset show that, compared with YOLOv8, the average detection accuracy of EDG-YOLO is 87.4%, which is increased by 2.3 percentage points, the number of parameters and the amount of calculation of the model are reduced by 33% and 47% respectively. The inference time of a single image is 13.2?ms, which meets the real-time detection requirements of downhole scenes.
    Remote Sensing Image Super-Resolution Algorithm Based on LR Coding Network and Diffusion Model
    XU Xiaoyang, ZHANG Mengfei
    2024, 60(22):  271-281.  DOI: 10.3778/j.issn.1002-8331.2311-0166
    Abstract ( )   PDF (4694KB) ( )  
    References | Related Articles | Metrics
    Aiming at the problem that the effect of remote sensing image super-resolution reconstruction is fuzzy and the detail texture is lost in the reconstruction process, a remote sensing image super-resolution network model pDDPMSR suitable for multi-scale tasks is proposed. Firstly, an efficient pixel shift convolution module SCAM is constructed by combining shift convolution and serial multi-attention mechanism to expand the receptive field to enhance the extraction of local features, so as to improve the image clarity. At the same time, multi-attention is used to focus on the high-frequency information of the image in the channel and spatial dimensions to enhance the expression of contour detail information. Secondly, in order to prevent the loss of detailed texture, CA-ASPP is designed to fuse coordinate attention and multi-scale atrous convolutional pyramid network, so as to capture context information at different scales. Finally, the denoising diffusion probabilistic model (DDPM) is introduced to generate the high-resolution image. The layer skip sampling is used to accelerate the reasoning speed of DDPM. A nonlinear noise scheduling scheme is designed to solve the problem of excessive noise at the end of DDPM adding noise. Experimental results on the public dataset RSSCN7 show that the reconstruction effect of pDDPMSR is more significant than the comparison algorithms in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), and the method of layer skip sampling accelerates the inference process of diffusion model by 10 times.
    Lightweight Full-Flow Bidirectional Fusion Network for 6D Pose Estimation
    LIN Haotian, LI Yongchang, JIANG Jing, QIN Guangjun
    2024, 60(22):  282-291.  DOI: 10.3778/j.issn.1002-8331.2307-0335
    Abstract ( )   PDF (7348KB) ( )  
    References | Related Articles | Metrics
    Six degrees of freedom (6D) pose estimation is a key step in applications such as robot grasping and manipulation, augmented reality, and autonomous driving. Conventional 6D pose estimation methods focus more on designing complex networks to improve the estimation effect, while ignoring the practical deployment difficulties due to the high complexity of the model and the large number of parameters. Based on FFB6D, this paper attempts to design a lightweight full-flow bidirectional fusion network (LFFB6D), a lightweight 6D pose estimation method based on RGBD. The method consists of two parallel encoder-decoder networks, convolutional neural network (CNN) and point cloud network (PCN). Specifically in the CNN part, this method introduces FasterNet to replace 3×3 convolution. By replacing the encoding network of CNN and proposing an upsampling module FUPB (faster upsample block) to reduce network parameters. In the PCN part, this method introduces PoolFormer to process and aggregate point cloud features. A new pooling module PFPB (PoolFormer pooling block) is proposed to improve the performance of the network. Experiments show that the parameter quantity of LFFB6D is reduced by 46% compared with FFB6D. When only 1/13 of the LineMOD training set and 1/9 of the YCB-Video training set are used, the 6D pose estimation results of LFFB6D surpass PoseCNN, DenseFusion and other methods, and achieve similar results to PVN3D and FFB6D.
    Big Data and Cloud Computing
    Time Strategy-Proof Mechanism for Online Task Scheduling in Edge Computing
    LI Linjie, FU Xiaodong, FENG Yan
    2024, 60(22):  292-303.  DOI: 10.3778/j.issn.1002-8331.2312-0111
    Abstract ( )   PDF (4763KB) ( )  
    References | Related Articles | Metrics
    Incentive mechanisms aim to motivate users to participate in task scheduling and report private information truthfully. However, existing studies mainly focus on ensuring that users submit true task valuations when bidding, overlooking the issue of time strategy in online scenarios. Thus, selfish users can increase their utility by manipulating time, which affects the participation motivation of edge users, the total value of successfully scheduled tasks, and the fairness of scheduling results. To this end, an online mechanism-based time strategy-proof task scheduling method is proposed. False bidding means are analyzed to establish a practical range of limited-time misreporting. Considering time strategy, an allocation function applicable to online scenarios is designed, which ensures the monotonicity of task types and allocates tasks according to the allocation probability sequentially, thus obtaining the task scheduling results. A critical payment pricing algorithm is derived, that satisfies the incentive compatibility and prevents users from increasing their expected utility through time strategy. It is theoretically proved that the task scheduling mechanism satisfies truthfulness and individual rationality. The experimental results show that the mechanism effectively prevents both price strategy and time strategy by users.
    Duration-Aware for Short Video Sequential Recommendation
    WANG Hang, YIN Ling, SHI Zhicai, HUANG Bo, GAO Zhirong
    2024, 60(22):  304-313.  DOI: 10.3778/j.issn.1002-8331.2401-0030
    Abstract ( )   PDF (4146KB) ( )  
    References | Related Articles | Metrics
    Addressing the issues of data sparsity in click data, noise in watch duration feedback, and bias in short video sequential recommendation, a duration-aware for short video sequential recommendation model (DASR) is proposed. This model effectively alleviates the data sparsity issue by deeply modeling user watch duration feedback. Additionally, an unbiased multi-semantic watch duration feedback label generation method is proposed. This method combines the [K]-nearest neighbors algorithm and percentile analysis of training data to dynamically generate label thresholds adapted to different video durations, effectively eliminating the impact of video duration bias. Furthermore, a noise extraction method based on a strong-weak attention network is introduced, accurately extracting positive and negative interest signals from the watch duration, thus addressing the noise issue in watch duration feedback. Extensive experiments on open-source datasets demonstrate that this model outperforms other mainstream methods on multiple evaluation metrics.
    Engineering and Applications
    Improved RTMDet for SAR Ship Detection
    ZHANG Yuning, JIA Yuan, CHEN Yue
    2024, 60(22):  314-322.  DOI: 10.3778/j.issn.1002-8331.2307-0175
    Abstract ( )   PDF (5257KB) ( )  
    References | Related Articles | Metrics
    A synthetic aperture radar (SAR) ship detection algorithm with improved RTMDet (real-time models for object detection) is proposed to address the problem of low detection accuracy in small target ships and complex backgrounds in SAR images. Firstly, the basic building blocks in backbone network structure are optimized, and the global attention mechanism SimAM (simple, parameter-free attention module) is introduced, which improves the ability of the model to extract key feature information without adding additional parameters. In order to reduce the loss of small target feature information and increase its share in shallow feature map during feature fusion, a new lightweight feature fusion module SPD-RPAFPN (space to depth reverse path aggregation feature pyramid network) is constructed. Finally, the regression loss function is replaced with KFIoU (Kalman filter based intersection over union) in the prediction module to improve the detection capability of the model for small target ships. Experimental comparisons are conducted on the publicly available dataset RSDD. Compared with RTMDet, the improved model improves the inshore AP value by 14.6 percentage points and the total AP value by 2.7 percentage points to 90.7%, while the number of model parameters and computational effort are decreased by 4.5% and 10.8%. Compared with the current mainstream algorithm, the SAR ship detection accuracy is also significantly improved, which proves the effectiveness of the improved RTMDet algorithm.
    Winograd Neural Network Accelerator Using Dynamic Hardware Reconfiguration on FPGA Platform
    MEI Bingxiao, TENG Wenbin, ZHANG Chi, WANG Wenhao, LI Fuqiang, YUAN Fuli
    2024, 60(22):  323-334.  DOI: 10.3778/j.issn.1002-8331.2307-0257
    Abstract ( )   PDF (4849KB) ( )  
    References | Related Articles | Metrics
    To address the low resource utilization and resource-restricted problems of convolutional neural networks (CNNs) in FPGA-based hardware acceleration, this paper proposes a convolutional neural network accelerator based on FPGA dynamic partial reconfiguration technique and Winograd fast convolution. The accelerator multiplexes FPGA resources in runtime and dynamically configures various calculation pipelines to the FPGA using a pipeline method. The convolutional computation cores corresponding to each pipeline segment are optimized using Winograd algorithm customization to maximize the utilization of computing resources while solving the resource limitation problem. For the proposed accelerator architecture, this paper further establishes a combinatorial optimization model to search for the optimal parallel strategy to deploy a specific network model on a particular FPGA hardware platform, working with genetic algorithm for exploring the design space. Based on the Xilinx VC709 FPGA platform, the VGG-16 network model is deployed and analyzed. The comprehensive simulation results show that large-scale neural network models can be adaptively implemented on resource-limited FPGAs. The overall performance of the accelerator can reach 1?078.3 GOPS, which is 2.2 times and 3.62 times better than the performance and computing resource utilization efficiency of previous accelerators, respectively.
    Game Neural Network Algorithm for Generating Autonomous Driving Test Scenarios
    LI Wenli, LI Chao, ZHANG Yinan, SONG Yue, HU Xiong
    2024, 60(22):  335-346.  DOI: 10.3778/j.issn.1002-8331.2307-0320
    Abstract ( )   PDF (1499KB) ( )  
    References | Related Articles | Metrics
    In order to improve the interpretability of virtual test scenarios for autonomous vehicles and the coverage of high-risk scenarios, a virtual test scenario generation algorithm combining game theory and neural network SIG-GAN (social interactive gaming-generative adversarial network) is proposed. Taking the high-speed ramp merging scenario as an example, a converging interaction game model is constructed by capturing the interaction characteristics of ramp converging vehicles and vehicles traveling in the main lane. The converging data are used to obtain the vehicle priority probability to calculate the Nash equilibrium solution of the game strategy, and are integrated into the S-GAN neural network model for trajectory generation. At the same time, PICT (pairwise independent combinatorial testing) model is introduced to combine the real trajectories of interacting vehicles in the observation area, which is combined with SIG-GAN algorithm to generate a large number of high-risk interaction trajectories with realistic game interaction behavior. Through the comparison experiment with LSTM, S-LSTM, S-GAN and other trajectory generation algorithms, the results show that: (1) The model generates trajectories with an average decrease of 25.30%, 18.98%, 7.02% in ADE and 17.33%, 16.06%, 7.65% in FDE compared with other algorithms in the time domains of 3.2 s and 4.8 s, and generates trajectories more accurately. (2) The number of generated trajectories after the combination test is 150 times of the original trajectories, with higher coverage. The TTC (time to collision) of the generated trajectory and the original trajectory is concentrated in 1.057 7 s and 3.513 5 s respectively, with a greater degree of scene risk, which is of practical significance for the virtual scene enhancement test of autonomous vehicles.
    Contrastive Feature Enhancement for Elevated Warehouse Small Target Detection Method
    ZHU He, BIAN Changzhi, ZHANG Jing, WANG Li, LI Xiaoxia, CHEN Yuling
    2024, 60(22):  347-354.  DOI: 10.3778/j.issn.1002-8331.2307-0273
    Abstract ( )   PDF (706KB) ( )  
    References | Related Articles | Metrics
    In response to the issues of limited target feature information and low classification accuracy in safety helmet detection in elevated warehouse scenarios, a small target contrastive feature enhancement network is proposed. Firstly, a spatial pyramid pooling fast cross layer fusion module is introduced to reduce the loss of target information in the spatial dimension. Then, a small target contrastive feature enhancement module is presented, utilizing dual-path parallel dilated convolutions to capture different receptive fields, and incorporating channel attention to obtain more precise feature information in the channel dimension. Additionally, the large object information in shallow feature maps is weakened by subtracting them from deep feature maps, aiming to enhance the expression of small object features. Finally, an efficient channel attention decoupled detection head is incorporated, separating the detection head into classification and regression branches to learn semantic and positional information of the targets, respectively. Experimental results on the TT100K dataset demonstrate that the proposed method improves the mAP@0.5 compared to the YOLOv5 baseline network by 6.4 percentage points and outperforms YOLOv7 by 1.9 percentage points. Moreover, on a self-built elevated warehouse dataset, the method achieves a 4.9 percentage points improvement in mAP@0.5 compared to the baseline network, and a 6.9 percentage points increase in mAP@0.5 specifically for safety helmets.