Computer Science ›› 2026, Vol. 53 ›› Issue (6A): 250300107-16.doi: 10.11896/jsjkx.250300107
• Artificial Intelligence • Previous Articles Next Articles
YANG Geer1,5, WANG Xin2, SUN Wei1, WANG Xinge3, HU Zhongrui3, MENG Wenjun3, ZHANG Junqiang3, WU Xinghui3, LIU Jinshan4, YAN Yuming3
CLC Number:
| [1] DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Minneapolis:Association for Computational Linguistics,2019:4171-4186. [2] LIU X Q,YU H F,DHILLON I S,et al.Learning to encode position for transformer with continuous dynamical model[C]//Proceedings of the 37th International Conference on Machine Learning.New York:PMLR,2020:6327-6335. [3] WANG S,LI B Z,KHABSA M,et al.Linformer:Self-attention with linear complexity[J].arXiv:2006.04768,2020. [4] KIYONO S,KOBAYASHI S,SUZUKI J,et al.SHAPE:Shifted absolute position embedding for transformers[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Punta Cana:Association for Computational Linguistics,2021:3309-3321. [5] CHEN P C,TSAI H,BHOJANAPALLI S,et al.A simple and effective positional encoding for transformers[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Punta Cana:Association for Computational Linguistics,2021:2974-2988. [6] KE G L,HE D,LIU T Y.Rethinking positional encoding in language pre-training[C]//International Conference on Learning Representations.2021. [7] HE P C,LIU X D,GAO J F,et al.DeBERTa:Decoding-en-hanced BERT with disentangled attention[C]//International Conference on Learning Representations.2021. [8] YANG Z L,DAI Z H,YANG Y M,et al.XLNet:Generalized autoregressive pretraining for language understanding[C]//Advances in Neural Information Processing Systems.2019:5753-5763. [9] HE Z Y,FENG G H,LUO S J,et al.Two stones hit one bird:Bilevel positional encoding for better length extrapolation[C]//Proceedings of the 41st International Conference on Machine Learning.Vienna:PMLR,2024:17858-17876. [10] PRESS O,SMITH N A,LEWIS M.Train short,test long:Attention with linear biases enables input length extrapolation[C]//International Conference on Learning Representations.2022. [11] RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].Journal of Machine Learning Research,2020,21(140):1-67. [12] WU C H,WU F Z,HUANG Y F.DA-Transformer:Distance-aware transformer[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:Association for Computational Linguistics,2021:2059-2068. [13] HUANG Z H,LIANG D,XU P,et al.Improve transformermodels with better relative position embeddings[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Stroudsburg:Association for Computational Linguistics,2020:3327-3335. [14] SHEN T,ZHOU T Y,LONG G D,et al.DiSAN:Directionalself-attention network for RNN/CNN-free language understanding[C]//Proceedings of the Thirty-Second AAAI Confe-rence on Artificial Intelligence.New Orleans:AAAI Press,2018:5446-5455. [15] NEISHI M,YOSHINAGA N.On the relation between position information and sentence length in neural machine translation[C]//Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).Hong Kong:Association for Computational Linguistics,2019:328-338. [16] LIUTKUS A,CÍFKA O,WU S H,et al.Relative positional encoding for transformers with linear complexity[C]//Proceedings of the 38th International Conference on Machine Learning.New York:PMLR,2021:7067-7079. [17] CHI T C,FAN T H,RAMADGE P J,et al.KERPLE:Kernelized relative positional embedding for length extrapolation[C]//Advances in Neural Information Processing Systems.2022:12058-12068. [18] CHI T C,FAN T H,GU L,et al.Dissecting transformer length extrapolation via the lens of receptive field analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Toronto:Association for Computational Linguistics,2023:13522-13537. [19] LI S D,LIU C,ZHOU L,et al.Functional interpolation for relative positions improves long-context transformers[C]//The Twelfth International Conference on Learning Representations.2024. [20] DEHGHANI M,GOUWS S,VINYALS O,et al.Universaltransformers[C]//International Conference on Learning Representations.2019. [21] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [22] SCHLAG I,SMOLENSKY P,FERNANDEZ R,et al.Enhancing the transformer with explicit relational encoding for math problem solving[C]//NeurIPS 2019 Workshop on Context and Compositionality.Vancouver:NeurIPS,2019. [23] LIKHOMANENKO T,XU Q,SYNNAEVE G,et al.CAPE:Encoding relative positions with continuous augmented positional embeddings[C]//NeurIPS 2021.2021:16079-16092. [24] KITAEV N,KAISER Ł,LEVSKAYA A.Reformer:The efficient transformer[C]//International Conference on Learning Representations.2020. [25] LI Y Y,FANG Y X,LONG T.Noise-robust autoregressivetransformer for aircraft trajectory prediction via hybridposi-tional encoding[J].Scientific Reports,2025,15(1):11370. [26] YAN H,DENG B,LI X,et al.TENER:Adapting transformer encoder for named entity recognition[J].arXiv:1911.04474,2019. [27] SU J L,LU Y,PAN S,et al.RoFormer:Enhanced transformer with rotary position embedding[J].Neurocomputing,2024,568:127063. [28] SUN Y T,DONG L,YI B,et al.A length-extrapolatable transformer[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Toronto:Association for Computational Linguistics,2023:14590-14604. [29] SHIV V L,QUIRK C.Novel positional encodings to enable tree-based transformers[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:12058-12068. [30] YING C X,CAI T,LUO S,et al.Do transformers really perform bad for graph representation?[C]//Proceedings of the 35th Conference on Neural Information Processing Systems.2021:28800-28814. [31] ZHANG J W,ZHANG H P,XIA C Y,et al.Graph-BERT:Only attention is needed for learning graph representations[C]//Proceedings of the 29th ACM International Conference on Information & Knowledge Management.New York:ACM,2020:2325-2328. [32] DWIVEDI V P,BRESSON X.A generalization of transformer networks to graphs[J].arXiv:2012.09699,2020. [33] PARK W,CHANG W G,LEE D,et al.GRPE:Relative positional encoding for graph transformer[J].arXiv:2201.12787,2022. [34] LUO Y K,LIU H,LIU Z,et al.Enhancing graph transformers with hierarchical distance structural encoding[C]//Advances in Neural Information Processing Systems 37.New York:Curran Associates,Inc.,2024. [35] LAKEW S M,DI GANGI M,FEDERICO M.Controlling theoutput length of neural machine translation[C]//Proceedings of the 16th International Conference on Spoken Language Translation.Hong Kong:Association for Computational Linguistics,2019:284-292. [36] TAKASE S,OKAZAKI N.Positional encoding to control output sequence length[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Minneapolis:Association for Computational Linguistics,2019:3999-4004. [37] OKA Y,KAJIWARA T,ARASE Y.Incorporating noisy length constraints into transformer with length-aware positional enco-dings[C]//Proceedings of the 28th International Conference on Computational Linguistics.Barcelona:International Committee on Computational Linguistics,2020:3580-3585. [38] BUTCHER B,O'KEEFE M,TITCHENER J.Precise lengthcontrol in large language models[J].arXiv:2412.11937,2024. [39] CONNEAU A,LAMPLE G.Cross-lingual language model pretraining[C]//Advances in Neural Information Processing Systems 32.New York:Curran Associates,Inc.,2019:7059-7069. [40] GUMMA V,CHITALE P A,BALI K.On the interchangeability of positional embeddings in multilingual neural machine translation models[J].arXiv:2408.11382,2024. [41] RAVISHANKAR V,SØGAARD A.The impact of positionalencodings on multilingual compression[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Punta Cana:Association for Computational Linguistics,2021:843-848. [42] ITO T,TSUCHIYA K,KANAZAWA T,et al.Learning positional encodings in transformers depends on initialization[J].arXiv:2406.08272,2024. [43] WANG Y A,CHEN Y N.What do position embeddings learn? An empirical study of pre-trained language model positional encoding[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.Online:Association for Computational Linguistics,2020:6840-6849. [44] GU Z H,LIU Y,ZHAO H,et al.Unpacking positional encoding in transformers:A spectral analysis of content-position coupling[J]arXiv:2505.13027,2024. [45] DUFTER P,SCHMITT M,SCHÜTZE H.Position information in transformers:An overview[J].Computational Linguistics,2022,48(3):733-763. [46] WANG B Y,SHANG L F,LI C,et al.On position embeddings in BERT[C]//International Conference on Learning Representations.2021. [47] BARBERO F,JALALZADEH A,PONTIL M,et al.Round and round we go:What makes rotary positional encodings useful?[J].arXiv:2410.06205,2024. [48] WU X Y,ZHAO H,ZHANG M.On the emergence of position bias in transformers[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Toronto:Association for Computational Linguistics,2023:1144-1158. [49] GIBSON E.Linguistic complexity:Locality of syntactic dependencies[J].Cognition,1998,68(1):1-76. [50] FUTRELL R,MAHOWALD K,GIBSON E.Large-scale evi-dence of dependency length minimization in 37 languages[J].Proceedings of the National Academy of Sciences,2015,112(33):10336-10341. [51] BOCK J K,WARREN R K.Conceptual accessibility and syntactic structure in sentence formulation[J].Cognition,1985,21(1):47-67. [52] FERREIRA V S,YOSHITA H.Given-new ordering effects on the production of scrambled sentences in Japanese[J].Journal of Psycholinguistic Research,2003,32(6):669-692. [53] LERNER Y,HONEY C J,SILBERT L J,et al.Topographicmapping of a hierarchy of temporal receptive windows using a narrated story[J].Journal of Neuroscience,2011,31(8):2906-2915. [54] HEWITT J,MANNING C D.A structural probe for findingsyntax in word representations[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Minneapolis:Association for Computational Linguistics,2019:4129-4138. [55] CLARK K,KHANDELWAL U,LEVY O,et al.What doesBERT look at? An analysis of BERT's attention[C]//Procee-dings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP.Florence:Association for Computational Linguistics,2019:276-286. [56] ALMEIDA-FILHO D G,LOPES-DOS-SANTOS V,VASCONCELOS N A P,et al.An investigation of Hebbian phase sequences as assembly graphs[J].Frontiers in Neural Circuits,2014,8:34. [57] MILLIDGE B,TSCHANTZ A,SETH A,et al.Predictive coding networks for temporal prediction[J].PLOS Computational Bio-logy,2024,20(4):e1011183. [58] MAKUUCHI M,BACHER J,FRIEDERICI A D.Segregatingthe core computational faculty of human language from working memory[J].Proceedings of the National Academy of Sciences,2009,106(20):8362-8367. [59] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [60] PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]//Proceedings of the 30th International Conference on Machine Learning.Atlanta:PMLR,2013:1310-1318. [61] BENGIO Y,SIMARD P,FRASCONI P.Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Neural Networks,1994,5(2):157-166. [62] LUONG T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Lisbon:Association for Computational Linguistics,2015:1412-1421. [63] CHEN S M,WONG S,CHEN L Q,et al.Extending contextwindow of large language models via positional interpolation[J].arXiv:2306.15595,2023. [64] RUMELHART D E,HINTON G E,WILLIAMS R J.Learning representations by back-propagating errors[J].Nature,1986,323(6088):533-536. [65] PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]//Proceedings of the 30th International Conference on Machine Learning.Atlanta:PMLR,2013:1310-1318. [66] SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681. [67] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//International Conference on Learning Representations.2015. [68] SHAW P,USZKOREIT J,VASWANI A.Self-attention withrelative position representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.New Orleans:Association for Computational Linguistics,2018:464-468. [69] CHILD R,GRAY S,RADFORD A,et al.Generating long sequences with sparse transformers[J].arXiv:1904.10509,2019. [70] DAI Z H,YANG Z L,YANG Y M,et al.Transformer-XL:Attentive Language Models Beyond a Fixed-Length Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2978-2988. [71] CHOROMANSKI K,LIKHOSHERSTOVV,DOHAN D,et al.Rethinking Attention with Performers [C]//Proceedings of the 9th International Conference on Learning Representations.2021. [72] BANIATA L H,KANG S,AMPOMAH L K E.A Reverse Positional Encoding Multi-Head Attention-Based Neural Machine Translation Model for Arabic Dialects[J].Mathematics,2022,10(19):3666. [73] ZHENGJ,REZAGHOLIZADEHM,PASSBAN P.Dynamic Position Encoding for Transformers[C]//Proceedings of the 29th International Conference on Computational Linguistics.2022:5076-5084. [74] RUOSS A,DELÈTANG G,GENEWEIN T,et al.Randomized Positional Encodings Boost Length Generalization of Transformers[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:1889-1903. [75] GOLOVNEVA O,WANG T L,WESTON J,et al.Contextual Position Encoding:Learning to Count What's Important[J]arXiv:2405.18719,2024. [76] bloc97.NTK-Aware RoPE Scaling for Efficient Extrapolation[EB/OL].https://www.reddit.com/r/LocalLLaMA/ comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.Technical note. [77] PENG B W,QUESNELLE J,FAN H L,et al.YaRN:Efficient Context Window Extension of Large Language Models[C]//Proceedings of the 12th International Conference on Learning Representations.2024. [78] WESTON J,CHOPRA S,BORDES A.Memory Networks[C]//3rd International Conference on Learning Representations.2015. [79] KALCHBRENNER N,ESPEHOLT L,SIMONYAN K,et al.Neural Machine Translation in Linear Time[J].arXiv:1610.10099,2016. [80] GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.Sydney:PMLR,2017:1243-1252. [81] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[C]//International Conference on Learning Representations.2021. [82] BLACK S,BIDERMAN S,HALLAHAN E,et al.GPT-NeoX-20B:An open-source autoregressive language model[C]//Proceedings of BigScience Episode #5-Workshop on Challenges &Perspectives in Creating Large Language Models.Dublin:Association for Computational Linguistics,2022:95-136. [83] TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and efficient foundation language models[J].arXiv:2302.13971,2023. [84] HEO B,RYU J,CHOI J,et al.Rotary position embedding forvision transformer[C]//Computer Vision-ECCV 2024.Milan:Springer,2024:174-190. [85] FANG Y X,SUN Q,WANG X G,et al.EVA-02:A visual representation for neon genesis[J].Image and Vision Computing,2024,149:105171. [86] LIU Z K,ZHANG H,LIU Y,et al.VRoPE:Rotary position embedding for video large language models[J].arXiv:2502.11664,2025. [87] SUKHBAATAR S,WESTON J,FERGUS R,et al.End-to-end memory networks[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems.2015: 2440-2448. [88] LIU X R,ZOU H,KONG L,et al.Scaling laws of RoPE-based extrapolation[J].arXiv:2310.05209,2023. [89] JACOT A,GABRIEL F,HONGLER C.Neural tangent kernel:Convergence and generalization in neural networks[C]//Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing.2018:8571-8580. [90] TANCIK M,SRINIVASAN P,MILDENHALL B,et al.Fourier features let networks learn high frequency functions in low dimensional domains[C]//Proceedings of the 34th International Conference on Neural Information Processing System.2020:7537-7547. [91] LU E Z,WANG X,YANG Z,et al.MoBA:Mixture of block attention for long-context LLMs[J].arXiv:2502.13189,2025. [92] YUAN J Y,LIU Z,MA X,et al.Native sparse attention:Hardware-aligned and natively trainable sparse attention[J].arXiv:2502.11089,2025. [93] GU A,DAO T.Mamba:Linear-time sequence modeling with selective state spaces[C]//The Twelfth International Conference on Learning Representations.2024. [94] BORGEAUD S,MENSCH A,HOFFMANN J,et al.Improving language models by retrieving from trillions of tokens[C]//Proceedings of the 39th International Conference on Machine Learning.Baltimore:PMLR,2022:2206-2240. [95] ZHENG C Y,LIU Z,WANG X,et al.DAPE:Data-adaptive positional encoding for length extrapolation[C]//NeurIPS 2024.2024. |
| [1] | LI Jie, WANG Baohui, ZHANG Jingyuan. DDoS Attack Detection Based on Attention Mechanism TCN-BiLSTM [J]. Computer Science, 2026, 53(6A): 250300060-9. |
| [2] |
CHANG Xuanwei, DUAN Liguo, CHEN Jiahao, CUI Juanjuan, LI Aiping.
Method for Span-level Sentiment Triplet Extraction by Deeply Integrating Syntactic and Semantic Features [J]. Computer Science, 2026, 53(2): 322-330. |
| [3] | GUAN Xin, YANG Xueyong, YANG Xiaolin, MENG Xiangfu. Tumor Mutation Prediction Model of Lung Adenocarcinoma Based on Pathological [J]. Computer Science, 2025, 52(6A): 240700010-8. |
| [4] | LI Daicheng, LI Han, LIU Zheyu, GONG Shiheng. MacBERT Based Chinese Named Entity Recognition Fusion with Dependent Syntactic Information and Multi-view Lexical Information [J]. Computer Science, 2025, 52(6A): 240600121-8. |
| [5] | HOU Zhexiao, LI Bicheng, CAI Bingyan, XU Yifei. High Quality Image Generation Method Based on Improved Diffusion Model [J]. Computer Science, 2025, 52(6A): 240500094-9. |
| [6] | HU Jintao, XIAN Guangming. Self-attention-based Graph Contrastive Learning for Recommendation [J]. Computer Science, 2025, 52(11): 82-89. |
| [7] | LI Jiaying, LIANG Yudong, LI Shaoji, ZHANG Kunpeng, ZHANG Chao. Study on Algorithm of Depth Image Super-resolution Guided by High-frequency Information ofColor Images [J]. Computer Science, 2024, 51(7): 197-205. |
| [8] | QUE Yue, GAN Menghan, LIU Zhiwei. Object Detection with Receptive Field Expansion and Multi-branch Aggregation [J]. Computer Science, 2024, 51(6A): 230600151-6. |
| [9] | LIU Xiaohu, CHEN Defu, LI Jun, ZHOU Xuwen, HU Shan, ZHOU Hao. Speaker Verification Network Based on Multi-scale Convolutional Encoder [J]. Computer Science, 2024, 51(6A): 230700083-6. |
| [10] | ZHANG Lanxin, XIANG Ling, LI Xianze, CHEN Jinpeng. Intelligent Fault Diagnosis Method for Rolling Bearing Based on SAMNV3 [J]. Computer Science, 2024, 51(6A): 230700167-6. |
| [11] | ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru. Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer [J]. Computer Science, 2024, 51(2): 196-204. |
| [12] | REN Yuheng, ZHAO Yunfeng, WU Chuang. Deep Gait Recognition Network Based on Relative Position Encoding Transformer [J]. Computer Science, 2024, 51(11A): 240400064-6. |
| [13] | XU Junwen, CHEN Zonglei, LI Tianrui, LI Chongshou. Time Series Prediction of Hybrid Neural Networks Based on Seasonal Decomposition [J]. Computer Science, 2024, 51(11A): 231200008-7. |
| [14] | YAO Tianlei, CHEN Xiliang, YU Peiyi. Review of Generative Reinforcement Learning Based on Sequence Modeling [J]. Computer Science, 2024, 51(11): 213-228. |
| [15] | ZHOU Xueyang, FU Qiming, CHEN Jianping, LU You, WANG Yunzhe. Chemical-induced Disease Relation Extraction:Graph Reasoning Method Based on Evidence Focusing [J]. Computer Science, 2024, 51(10): 351-361. |
|
||