Computer Science

Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm

WANG Zhuo-hao, YANG Dong-ju, XU Chen-yang

Computer Science. 2019, 46 (12): 1-7. doi:10.11896/jsjkx.190100023

Abstract

PDF(1988KB) ( 1403 )

References | Related Articles | Metrics

With the expansion of the data warehouse,ETL tasks have also increased under data integration.Stand-alone scheduling obviously cannot meet the needs of many complex ETL tasks.Aiming at how to improve the efficiency of ETL task scheduling,reduce the critical task waiting time,and improve the resource utilization and so on,this paper constructed a distributed ETL task scheduling framework consisting of a scheduler and several actuators and completing the ETL task scheduling through the task preprocessing,task scheduling and task execution.In the task preprocessing stage,a weight model is established for the ETL task,and the scheduling priority is determined according to the weight.In the task scheduling stage,the scheduler constrains the choice of actuator nodes according to the performance and load conditions of each actuator node,and a greedy balance (GB) algorithm is designed to distribute the ETL task execution requests,so that the load of the actuator nodes is relatively balanced.In the task execution phase,the execution priority of tasks under the actuator node queue is determined by the high response ratio first (Highest Response Ratio Next,HRRN) algorithm.Experiment results show that the distributed ETL task scheduling framework and the corresponding integrated scheduling execution (ISE) algorithm can effectively improve the utilization of cluster resources and shorten the task scheduling execution time.

AdaBoostRS:Integration of High-dimensional Unbalanced Data Learning

YANG Ping-an, LIN Ya-ping, ZHU Tuan-fei

Computer Science. 2019, 46 (12): 8-12. doi:10.11896/jsjkx.180901813

Abstract

PDF(1341KB) ( 1276 )

References | Related Articles | Metrics

The class imbalance problem in machine learning contains a skewed distribution of data samples among different classes,resulting in a learning bias toward the majority class.In high-dimensional data,the sparseness of the data makes the classification bias more obvious.For high-dimensional unbalanced data,the two challenging problems of dimensional disaster and class imbalance distribution are superimposed,making it more difficult to solve high-dimensional imbalance problems.This paper proposed an AdaBoost integration method combining random subspace and SMOTE oversampling technology,named AdaBoostRS (AdaBoost ensemble of Random subspace and SMOTE),to deal with the classification of high-dimensional unbalanced data.AdaBoostRS trains each classifier by selecting partial features in a random subspace to increase the diversity of the classification samples and reduce the dimensions of the high-dimensional data.Thena few classes of dimensionality reduction data are linearly interpolated through the SMOTE method to solve the class imbalance problem.The experiment is based on 8 high-dimensional unbalanced standard time series dataset.The results show that AdaBoostRS is superior to the traditional integrated learning method in terms of three performance indicators of F-measure,G-mean and AUC.

Column-oriented Store Based Sampling Query Process on Big Data

QI Wen, BAO Yu-bin, SONG Jie

Computer Science. 2019, 46 (12): 13-19. doi:10.11896/jsjkx.190500155

Abstract

PDF(2881KB) ( 1154 )

References | Related Articles | Metrics

The era of big data bring performance challenges to traditional data query,even if the query algorithm is O(n) linear complexity,but when the n is extremely large,its time cost is also unbearable.In many practical applications,exact query results may be unnecessary but the queries should be accomplished at a given time,so appropriately losing the query accuracy is acceptable to meet performance constraints.Sampling queries can improve query perfor-mance by reducing query ranges.Existing researches are often studied for specific algorithms and specific application scenarios,and there is a lack of research on general sampling and query methods in the big data environment,as well as research on performance and accuracy guarantee.This paper studied the sampling and query processing in the big data environment,which improves the query efficiency of big data from data partition and data reduction.This paper proposed a sampling method based on speedup and potential distribution,which supports all kinds of sampling algorithms,and achieves randomicity guarantee,performance assurance and approximation evaluation of sampling queries in distri-buted environment,and is compatible with precise queries.This method can be applied to the column store for the big data with good expansibility and maintainability.The experimental results show that as the Top-K query case,the proposed method has better loading performance,while the sampling errors are less than 2%,and the variances of query accuracy are between 0.1 and 0.12 under various sampling rates,data volumes and sampling algorithms.The sampling efficiency of proposed partition is also higher than that of linear partition based or uniform partition based sampling.

Link Prediction Based on Correlation of Nodes’ Connecting Patterns

SHAN Na, LI Long-jie, LIU Yu-yang, CHEN Xiao-yun

Computer Science. 2019, 46 (12): 20-25. doi:10.11896/jsjkx.190700057

Abstract

PDF(3147KB) ( 938 )

References | Related Articles | Metrics

As a research hotspot in complex network analysis,link prediction has a wide range of applications in many fields,and hence has captured much attention of researchers.Similarity-based methods,which compute the similarity scores between unconnected node pairs based on the known network structures and estimate their connection likelihood according to the similarity scores,are commonly used.In general,different kinds of networks have diverse structural characteristics,and hence the correlation of characteristics between nodes has an important influence on the formation of links.To enhance the performance of link prediction,this paper defined the connecting pattern of a node,and proposed a new link prediction model based on the Correlation of Nodes’ Connecting Patterns (CNCP).By combining CNCP with a similarity-based method,this model can take both similarity and correlation between nodes into account.In this paper,four CNCP-based methods,i.e.,CNCP-CN,CNCP-RA,CNCP-AA and CNCP-PA,are derived from the model,in which similarity indexes are CN(Common Neighbors),RA(Resource Allocation),AA(Adamic-Adar) and PA(Preferential Attachment),respectively.The experimental results on six networks show that the proposed methods are superior to the compared ones under the criteria of AUC and Precision.

Distinguishing Patterns Mining Based on Density Constraint

CHAI Xin, GAO Yi-han, WU You-xi, LIU Jing-yu

Computer Science. 2019, 46 (12): 26-30. doi:10.11896/jsjkx.181202289

Abstract

PDF(1925KB) ( 712 )

References | Related Articles | Metrics

Sequential patterns mining is to find interest patterns from sequential data.Distinguishing patterns mining is one of the mining methods,which is characterized by finding feature information in two or more categories of sequence databases.It is widely used in real life and production.With the increasing size of data,the efficiency of algorithm mi-ning is particularly important.However,the mining speed of distinguishing patterns mining is too slow at present.In order to quickly mine the distinguishing patterns that satisfy density constraint and gap constraint,this paper proposed an approximate solution algorithm ADMD (Approximately Distinguishing Patterns Mining Based on Density Constraint).This algorithm allows a small number of patterns to be lost in the process of patterns mining in exchange for a large increase in mining speed.In this algorithm,the support of the pattern is calculated by the special structure of the Net tree,the candidate patterns are generated by patterns growth approach,and the patterns are pruned by the prejudgment pruning strategy to avoid the generation of a large number of redundant patterns.However,some non-redundant patterns may be pruned in the pruning process,resulting in incomplete mining results,so the algorithm is an approximate algorithm.Based on ADMD,the ADMD-k algorithm was proposed by setting the parameter k in the pruning strategy.The algorithm can adjust the pruning degree by setting k,to achieve a balance between mining efficiency and accuracy.Finally,in real protein datasets,the number of mining patterns and mining speed are compared with other algorithms.The experimental results verify that when k is 1.5,the proposed algorithm costs no more than 13% of the time,but can find up more than 99% of patterns.Therefore,the proposed algorithm is very effective with high approximation rate and high speed.

HMRF Semi-supervised Approximate Kernel k-means Algorithm

JIA Hong-jie, WANG Liang-jun, SONG He-ping

Computer Science. 2019, 46 (12): 31-37. doi:10.11896/jsjkx.190600159

Abstract

PDF(1258KB) ( 711 )

References | Related Articles | Metrics

Massive data are produced with the development of information technology.Clustering can help to discover the intrinsic links of data and extract valuable information from them.In data analyzing,it is easy to get some background knowledge about data.Using these limited prior information to guide clustering can significantly improve the clustering results.The semi-supervised clustering based on Hidden Markov Random Fields (HMRF) uses pairwise constraints as the supervision information.Although it has good clustering results in many applications,its time and space complexity are very high,which cannot meet the needs of large-scale data processing.To solve this problem,this paper first analyzed the mathematical relationship between HMRF semi-supervised clustering and kernel k-means,and used matrix trace to unify the objective functions of the two clustering methods.In order to reduce the complexity of HMRF semi-supervised clustering,this paper proposed a HMRF semi-supervised approximate kernel k-means algorithm (HMRF-AKKM),which constructs an approximate kernel matrix by sampling,and used the approximate kernel k-means to optimize the clustering objective function.Finally,the HMRF-AKKM algorithm was compared with the related clustering algorithms on several benchmark datasets and the clustering performances of different algorithms were analyzed in the experiments.The experimental results show that the HMRF-AKKM algorithm has similar clustering quality to the original HMRF semi-supervised clustering on the same clustering task,but the HMRF-AKKM algorithm has shorter clustering time.This indicates that the HMRF-AKKM algorithm inherits the advantages of HMRF semi-supervised clustering and approximate kernel k-means.On the one hand,HMRF-AKKM can make full use of pairwise constraints to achieve high clustering quality.On the other hand,it improves the clustering efficiency by sampling and matrix approximation.Moreover,the clustering quality and clustering efficiency can be balanced by adjusting the sampling ratio and the number of pairwise constraints.Therefore,the proposed HMRF-AKKM algorithm has good scalability and it is suitable for the clustering problems of large-scale nonlinear data.

Q-sample-based Local Similarity Join Parallel Algorithm

WANG Xiao-xia, SUN De-cai

Computer Science. 2019, 46 (12): 38-44. doi:10.11896/jsjkx.190100240

Abstract

PDF(1890KB) ( 730 )

References | Related Articles | Metrics

Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.

Community Detection Algorithm Based on Random Walk of Signal Propagation with Bias

YIN Xin-hong, ZHAO Shi-yan, CHEN Xiao-yun

Computer Science. 2019, 46 (12): 45-55. doi:10.11896/jsjkx.190700051

Abstract

PDF(4377KB) ( 945 )

References | Related Articles | Metrics

Complex networks are abstracted from various complex systems.The overall function is reflected in the interaction among nodes,and community structure is one of the most significant structural properties presented in many networks.Generally,the community corresponds to the functional modules of the system.Therefore,extracting these communities of the network helps us to explore the internal rules,and it has important theoretical research significance and practical value for community detection of complex networks.As a result,it is paid attention widely by many resear-chers,and many community-detection algorithms are proposed,such as the algorithms based on modularity optimization,label propagation,and random walk.In the process of signal propagation,as the propagation distance increases,the signal quantity will decay slowly.On the basis of the full study of these algorithms,by simulating the process of random walk,a community detection algorithm with random walk based on the signal propagation mechanism with bias was proposed.The algorithm selects a node from the network as the signal source,chooses the neighbor node randomly as the next hop node,transmits the attenuated semaphore to the node,and iteratively selects the next hop node and transmits the signal randomly.Considering the attenuation of the signal,an attenuation factor is attached to each edge to constrain the signal propagation process.Through the propagation of the analog signal,each process of the network is repeated as a signal source to obtain a propagation matrix.Then,the self-loop is added for each vertex.By considering the similarity matrix with new attributes between the adjacency matrix and the similarity among vertices,attributes for each vertex are constructed based on the new attribute matrix and propagation matrix.Finally,k-means algorithm is used for clustering to obtain high-quality community structure.In the end,k-means algorithm is used for clustering to obtain high-quality community structure with the minimum cost.In order to verify the performance of this method,this paper conducted experiments on 10 actual network data sets and artificial synthetic networks of different sizes.The experimental results fully prove that this algorithm can extract high-quality community structure from the network,thus effectively providing a basis for community detection field.

Collaborative Filtering Recommendation Algorithm Based on Multi-relationship Social Network

BIN Sheng, SUN Geng-xin

Computer Science. 2019, 46 (12): 56-62. doi:10.11896/jsjkx.181102189

Abstract

PDF(1743KB) ( 1749 )

References | Related Articles | Metrics

Recommendation system is one of the most common applications in big data.Traditional collaborative filtering recommendation algorithm is directly based on user-item scoring matrix.For massive user and commodity data,the efficiency of the algorithm will be significantly reduced.Aiming at this problem,this paper proposed a collaborative filtering recommendation algorithm based on multi-relational social network.The information propagation method is used to detect communities in the multi-relationship social network based on multi-subnet composite complex network model,the users with similarity are divided into the same community.And then the k-nearest neighbor set of users is selected to construct the user-item scoring matrix within the community.Then the collaborative filtering algorithm is used to recommend through the new user-item scoring matrix,thus improving the efficiency of recommendation algorithm without reducing the accuracy of recommendation.Compared with traditional collaborative filtering recommendation algorithm on real data set Epinions,the results show that the proposed algorithm has high recommendation efficiency and accuracy.Especially for big data,the execution time of the proposed recommendation algorithm is improved by more than 10 times.

Community Detection Based Point-of-interest Recommendation

GONG Wei-hua, SHEN Song

Computer Science. 2019, 46 (12): 63-68. doi:10.11896/jsjkx.190400440

Abstract

PDF(2002KB) ( 1140 )

References | Related Articles | Metrics

In recent years,LBSN (Location-based Social Networks) has attracted more and more attention as a typical heterogeneous information network.In view of the sparse check-in information of users in LBSN,this paper proposed a recommendation algorithm CBR (Community-Based Recommendation) based on community detection.It first calculates the similarity between the target user and the clustered interest topic cluster on the social media layer,and then calculates the user’s membership degree on the geographic cluster through the association matrix R between the interest topic cluster and the geographic cluster.Then it further integrates user’s social relationship to get user’s preference scores for each point of interest,and finally sorts according to the interest scores to achieve the Top-k recommendation.The experimental results show that the proposed algorithm can significantly improve the recommendation quality of the points of interest.

Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization

HUANG Meng-ting, ZHANG Ling, JIANG Wen-chao

Computer Science. 2019, 46 (12): 69-73. doi:10.11896/jsjkx.190400107

Abstract

PDF(1474KB) ( 687 )

References | Related Articles | Metrics

In this paper,a feature extension method based on non-negative matrix factorization (NMFFE) was proposed to overcome the sparse of short text feature.This method only considers the data itself and does not rely on external resources for feature extension.Firstly,the internal relationship of text and word is taken into account in the factorization of the relationship matrix between text and word ,and word clustering instruction matrix is obtained by graph dual re-gularization non-negative matrix triple factorization (DNMTF) method.Then,word clustering instruction matrix is reduced in dimensionality to get the feature space.Finally,according to the degree of correlation between words,the feature in the feature space is added to the short text,thus solving the problem of feature sparse in short text and improving the accuracy of text classification.The experimental data show that compared with the better performance in BOW algorithm and Char-CNN algorithm,the accuracy of short text classification based on NMFFE algorithm is increased by 25.77%,10.89% and 1.79% on the three datasets,which are Web snippets,Twitter sports and AGnews,respectively.The experimental data fully demonstrate that NMFFE algorithm is superior to BOW algorithm and Char-CNN algorithm in terms of classification accuracy and algorithm robustness.

Euler Kernel-based Data Stream Clustering Algorithm

ZHU Ying-wen, YANG Jun

Computer Science. 2019, 46 (12): 74-82. doi:10.11896/jsjkx.190600158

Abstract

PDF(2822KB) ( 967 )

References | Related Articles | Metrics

With the advance of both cloud computing and internet of things,many applications generate huge amounts of data streams at fast speed.Examples include real-time surveillance systems,vehicle traffic monitoring systems,electricity consumption recording,and network traffic monitoring.Data stream mining has become a hot research topic.Its goal is to extract hidden knowledge/patterns from continuous data streams.Clustering,one of the most important problems in stream mining,has been highly explored.Different from traditional data clustering algorithm where given datasets are generally static and can be repeatedly read and processed,clustering data streams face more challenges due to having to satisfy such constraints as bounded memory,single-pass,real-time response and concept-drift detection.This paper pre-sented a new clustering algorithm for data streams,called EG-Stream,by combining the Euler kernel method with the Growing Neural Gas(GNG) model.It can not only maintain the benefit of nonlinear modeling using kernel function,but also significantly solve the large scale computational problem in kernel-based clustering.Euler kernel is relying on a nonlinear and robust cosine metric that is less sensitive to outliers.More important,it intrinsically induces an empirical map which maps data onto a complex space of the same dimension,and it takes these advantages to measure the similari-ty between data in a robust way without increasing the dimensionality of data,which avoids the problem that other kernel clustering algorithms can not deal with data streams.Although this method is embarrassingly simple just by incorporating the Euler kernel into GNG,the experimental results on variety of UCI datasets indicate that this method can still achieve comparable or even better performance than G-Stream algorithm,and identify the structural information from stream data.

Bi-directional Oversampling Method Based on Sample Stratification

ZHOU Xiao-min, CAO Fu-yuan, YU Li-qin

Computer Science. 2019, 46 (12): 83-88. doi:10.11896/jsjkx.190400053

Abstract

PDF(1618KB) ( 720 )

References | Related Articles | Metrics

Resampling technology has gradually become an important direction to solve the problem of classification for imbalanced data because of its simplicity and intuition.However,in the case of small data sets,under-sampling in resampling technology may lose important information of data sets,so oversampling is the focus of classification for imba-lanced data.Although the existing oversampling methods effectively overcome the imbalance between classes,they may cause dense areas of minority class to be denser,even lead to overlapping of samples.In addition,due to the noise of minority class,the existing oversampling methods may generate new samples around the noise,which makes the distribution of minority class more confusing.Aiming at these problems,this paper proposed a bi-directional oversampling method based on sample stratification.It firstly divides the minority samples into dense area and sparse area based on the highest density point and the intra-class average distance.And then the bi-directional oversampling is performed in the boundary region of dense area and the sparse area.In order to verify the effectiveness of the proposed algorithm,comprehensive experiments were conducted on 9 data sets of UCI database.The experimental results and Friedman test results show the superiority of the proposed algorithm for the task of imbalanced data classification.

Domain Adaptation Algorithm Based on Tensor Decomposition

XU Shu-yan, HAN Li-xin, XU Guo-xia

Computer Science. 2019, 46 (12): 89-94. doi:10.11896/jsjkx.190300095

Abstract

PDF(1610KB) ( 1155 )

References | Related Articles | Metrics

Because training data tend to be outdated,training data and test data have different feature distributions in most cases.Therefore,when using the source domain information,it is necessary to minimize the difference of feature distributions in different fields.Features represented by tensor can maintain the intrinsic structure information of high-dimensional spatial data.Naive tensor subspace learning is a domain adaptation method for tensor features,but it has high complexity and can not achieve good knowledge transfer effect.For this reason,this paper proposed a domain adaptation algorithm based on tensor decomposition,namely tensor train subspace learning and tensor ring subspace lear-ning,and the main ideas of the two methods are similar.Firstly,the features of source domain and target domain are coded into tensor.By using the tensor decomposition,the tensor of features is decomposed into a series of third-order tensors to represent the subspace.Then,the features of source domain and target domain are mapped into subspace successively.Finally,the feature tensor is reshaped into matrix form.Based on the mapped features of source domain training model and the mapped feature of target domain,the task in new domain is completed.Experiments show that the tensor train subspace learning and the tensor ring subspace learning are improved in terms of accuracy and running time for unsupervised image classification.Compared with the naive tensor subspace learning,the accuracy of the tensor train subspace learning and the tensor ring subspace learning is improved by 1.68% and 2.08% respectively,the running time is also reduced significantly,and the complexity of the algorithm is smaller.Experimental results show that the domain adaptation algorithm based on tensor decomposition can reduce the difference between source domain and target domain,and realize the reuse of knowledge between different domains.

Node Resource Scheduling for Future Network Experimentation Facility

WANG Chen-xin, YANG Jia-hai, ZHUANG Yi, LUO Nian-long

Computer Science. 2019, 46 (12): 95-100. doi:10.11896/jsjkx.190400106

Abstract

PDF(2066KB) ( 929 )

References | Related Articles | Metrics

With the expansion of the Internet industry,research and innovation of network core technologies are urgent.The Future Network Experimentation Facility (FNEF) was designed to provide efficient and convenient resources for network-related researchers to support innovative research and experiments on network technologies.The resources scheduling management system is a very important task in FNEF.Based on the resource scheduling and test service requirement of FNEF,this paper designed a combination of centralized and distributed architecture.The central resource scheduling management system cooperates with the node resource scheduling management system to schedule the backbone network resources and computing resources in each disperse site.This paper proposed a multi-objective optimized resource scheduling algorithm by considering the communication cost between the virtual machines,the physical machine resource utilization and the resource balance.Simulation experiments show that the proposed algorithm can effectively optimize the above multiple objectives.

Improved PBFT Consensus Mechanism Based on K-medoids

CHEN Zi-hao, LI Qiang

Computer Science. 2019, 46 (12): 101-107. doi:10.11896/jsjkx.181002014

Abstract

PDF(1923KB) ( 1384 )

References | Related Articles | Metrics

With the popularization and development of digital currency,the blockchain technology enters the public’s vision,and has been hailed as the fourth milestone in credit history,the cornerstone of future credit^[1].However,blockchain technology is also facing problems such as low efficiency of consensus and waste of computing power.The K-medoids clustering algorithm is used to cluster and hierarchically divide the large scale network nodes participating in the blockchain consensus based on features,and then the improved multi-centered PBFT (Practical Byzantine Fault Tole-rance) consensus algorithm is applied to the clustered model.Moreover,in order to improve the controllability of clustering algorithm to cluster nodes in blockchain model under various scenarios,this paper improved K-medoids algorithm.Simulation results show that when appropriate clustering features are selected to evaluate the similarity between nodes,the improved algorithm K-PBFT reduces the single-consensus time consumption by 20%,and the consensus process communication times can be reduced by three orders of magnitude.The experimental results show that K-PBFT optimizes the consensus process involving large-scale nodes,so that the blockchain technology can be applied to a wider range of application scenarios.

Mobile Traffic Forecasting Model Based on Spatio-temporal Features

ZHANG Jie, BAI Guang-wei, SHA Xin-lei, ZHAO Wen-tian, SHEN Hang

Computer Science. 2019, 46 (12): 108-113. doi:10.11896/jsjkx.181102207

Abstract

PDF(1794KB) ( 2164 )

References | Related Articles | Metrics

Research shows that historical traffic data can be used for the prediction of mobile network traffic,and traffic information in surrounding areas can improve the accuracy of traffic prediction.To this end,this paper proposed the traffic prediction model STFM for mobile network based on spatio-temporal features.STFM uses the historical mobile traffic of the target area and surrounding areas to predict the traffic of the target area.Firstly,3D convolutional neural network(3D CNN) is used to extract the spatial features of the mobile network traffic,then time convolutional network (TCN) is used to extract the temporal features of the mobile network traffic.Finally,fully connected layers establish a mapping relationship between the real traffic and extracted features and generate a predicted traffic value.Validation and analysis of experiments show that the STFM reduce the normalized root mean square error (NRMSE) by 28%,21.7% and 10%,compared to TCN,CNN and CNN-LSTM Consequently,STFM can effectively improve the accuracy of mobile network traffic prediction.

Offline Static Virtual Network Mapping Algorithm Based on Tabu Search Genetic Optimization

YU Jian-jun, WU Chun-ming

Computer Science. 2019, 46 (12): 114-119. doi:10.11896/jsjkx.181001981

Abstract

PDF(1520KB) ( 688 )

References | Related Articles | Metrics

The offline static virtual network mapping problem is an NP-hard problem.It is to map the subset of virtual networks onto the physical network aimed to maximize the profit of physical network provider.This paper reviewed the offline static virtual network mapping problem and the current research progress for this problem and pointed out that the current offline static virtual network mapping algorithm is only suitable for solving small-scale problems or special problems.Therefore,this paper proposed a general offline static virtual network mapping algorithm suitable for solving medium and large scale.The virtual network mapping order strategy based on revenue priority,the virtual node mapping strategy by node rank matching and the virtual link mapping strategy that minimizes resource consumption are used to complete the greedy algorithm design for offline static virtual network mapping problem.Then,the optimization strategy based on the hybrid algorithm of genetic algorithm and tabu search are used to complete the tabu search genetic algorithm design for offline static virtual network mapping problem.Experiments show that the proposed algorithm has higher virtual network construction completion rate and gets better physical network provider revenue,and the virtual network construction completion rate and physical network provider revenue are 34% and 42% higher than the baseline algorithm,respectively.

RFID Indoor Positioning Algorithm Combining Grasshopper Optimization Algorithm and Extreme Learning Machine

WANG Zhe, ZHENG Jia-li, LI Li, YUAN Yuan, SHI Jing

Computer Science. 2019, 46 (12): 120-125. doi:10.11896/jsjkx.181202381

Abstract

PDF(1756KB) ( 823 )

References | Related Articles | Metrics

With the rapid development of indoor positioning technology,radio frequency identification (RFID) technology has become the preferred solution due to its advantages of non-contact and rapid identification.However,the accuracy of existing RFID indoor positioning algorithms is easily affected by the tag density and algorithm efficiency,and environmental adaptation of existing algorithms is not strong enough.Therefore,this paper introduced an RFID indoor positioning algorithm based on the grasshopper optimization algorithm (GOA) fused with extreme learning machine (ELM).The algorithm is proposed to tune the input layer weight and hidden layer threshold biases randomly generated by the extreme learning machine,so that it can reduce learning time in the offline phase.At the same time,the algorithm can effectively resist the environmental interference and overcome the change of signal strength value on the positioning accuracy.Experiments are carried out to study the influence factors and validate the performance.Both the simulation and test experiment results show that compared with NN-based algorithm and NMDS-RFID algorithm,the average positioning error of the proposed algorithm is reduced by 22.32% and 20.06% respectively,and the average execution time is reduced by 58.7% and 7.55% respectively.GOA-ELM indoor positioning algorithm can achieve more accurate positioning results and has certain adaptability to the changes of the environment.

Task Offloading and Cooperative Load Balancing Mechanism Based on Mobile Edge Computing

YIN Jia, GUAN Xin-jie, BAI Guang-wei

Computer Science. 2019, 46 (12): 126-131. doi:10.11896/jsjkx.181202453

Abstract

PDF(1751KB) ( 1043 )

References | Related Articles | Metrics

Because of the corresponding delay and communication cost associated with the use of cloud service,mobile edge computing(MEC) which is closer to mobile users has become the main technology for processing computing-intensive and delay-sensitive applications.Small data centers located on the edge of network are called “cloudlets”,which can provide computing resource for nearby mobile devices.The service delivery delay can be reduced significantly by using MEC.However,in the edge network environment composed of mobile micro-clouds,load balancing directly affects the response time of tasks.In order to improve the quality of service for users,this paper proposed a task offloading and cooperative load balancing mechanism.The mechanism includes a latency-aware target selection strategy (LATS) for mobile users and a collaborative load-balancing strategy (CLB) for mobile cloudlets.LATS chooses the best task migration object for mobile users according to the current load information of cloudlets.CLB uses balls-in-bins theory and it can balance the task loads with extremely limited information.Simulations and evaluations demonstrate that the proposed mechanism can effectively reduce the system delay and load gaps,as well as the communication and computing cost.

Attention Mechanism Based Detection of Malware Call Sequences

ZHANG Lan, LAI Yao, YE Xiao-jun

Computer Science. 2019, 46 (12): 132-137. doi:10.11896/jsjkx.181102171

Abstract

PDF(1545KB) ( 1380 )

References | Related Articles | Metrics

Typical machine learning approaches,which learn a classifier based on hand crafted features,are not sufficiently robust.Attackers can reorder the malware code or insert useless code to avoid detection.Aiming at the problems of the large number of malware,confusion technology progress and the cost of artificially constructed feature in the Internet environment,this paper proposed a different malware detection approach G2ATTbased on API call sequence and attention mechanism in natural language process.First,dynamic API call sequences are extracted by using the sandbox environment and split them into several subsequences by using a sliding window.Then,the concept of multi-instance learning and attention mechanism are introduced to design the hierarchical feature extraction neural networks.Recurrent neural networks are used for API-level features.Two attention mechanism are combined to extract window-level features and sequence-level features.Then,those sequence-level features are used for malware detection.Ultimately,the model is trained and used to detect malware.The experimental results based on real dataset show that the window-level feature extraction layer learns effectively attention scores in the subsequences.In addition,the sequence-level feature extraction layer improves the performance of malware detection model on precision and recall by calculating attention scores across the subsequences.G2ATT achieves 98.19% on detection accuracy rate,98.78% on precision rate,97.60% on recall rate and 99% on AUC (Area Under the Curve of ROC),which improves by 10% compared with othermachine learning approaches based on API call sequences on detection accuracy.

Trust Model for P2P Based on Blockchain

WU Dai-yue, LI Qiang, YU Xiang, HUANG Jun

Computer Science. 2019, 46 (12): 138-147. doi:10.11896/jsjkx.181202307

Abstract

PDF(2258KB) ( 948 )

References | Related Articles | Metrics

At present,in the process of trust evaluation of trust model,because the sources of evaluation data are not uniform,the ability of different nodes to obtain evaluation data is different,and the recognition degree of different nodes to data is also different,the computational results are low accuracy,subjective and difficult to be used as a reference.Aiming at these problems,this paper proposed a blockchain-based peer-to-peer network trust model,named ChainTrust.The evaluation sequence graph is defined.The indirect trust weight is determined according to the reliability of the indirect trust degree of the evaluation node.Meanwhile,this paper improved the current blockchain structure,by using the Merkle Patricia tree and the binary Merkle tree to store the evaluation data,and gave the corresponding storing and reading algorithms.Simulation and analysis results show that ChainTrust can better resist a variety of malicious attacks,thus reducing the impact from the collusion attack,changing the sensitivity of the model by adjusting the model parameters.Therefore,ChainTrust is effective and has high flexibility and universality.

Big Data Plain Text Watermarking Based on Orthogonal Coding

LI Zhao-can, WANG Li-ming, GE Si-jiang, MA Duo-he, QIN Bo

Computer Science. 2019, 46 (12): 148-154. doi:10.11896/jsjkx.181001972

Abstract

PDF(2277KB) ( 1672 )

References | Related Articles | Metrics

Data leakage is one of the biggest challenges for big data applications.Digital watermarking is an effective way for data tracking and copyright protection.However,the current digital watermarking method is mainly focus on multimedia file,such as images,audio and video files.There are little digital watermarking methods for data protection in the big data environment.Therefore,this paper proposed a plain text watermarking method based on orthogonal co-ding for big data.First,the plain text watermark is converted into a binary byte stream by coding.The orthogonal watermarking method based on row hash value and row-sequence permutation are designed.The binary watermark string is divided into segments and numbers.The watermark segment number to be embedded is calculated according to the hash value of each line of content,and the corresponding watermark segment is converted into an invisible string which is embedded to the end of line.Then,the line order is adjusted so that the hash value of each line corresponds to the binary watermark string with the flag added,which achieves the embedding of the watermark.Watermark extraction method is the inverse process of the embedding method.It can resist the destruction of watermark by operations such as replacement operation for row order in big data environment,and achieve the effect of text tampering detection by embedding fragile watermarks at the same time.Based on the proposed method,a big data watermarking system was designed and implemented.Spark was adopted to solve the problem of watermark embedding and extraction performance of massive texts,which can quickly trace the source of data leakage and improve the security of big data.Experimental and theoretical analysis prove that the proposed method has better watermark capacity performance and good concealment.At the same time,it has strong robustness since it can resist multiple content attacks and format attacks.

Double Blockchain Based Station Dynamic Loop Information Monitoring System

FAN Jian-feng, LI Yi, WU Wen-yuan, FENG Yong

Computer Science. 2019, 46 (12): 155-164. doi:10.11896/jsjkx.190300041

Abstract

PDF(2881KB) ( 788 )

References | Related Articles | Metrics

The power and environment monitoring system of base station realizes the functions of the whole network base station power and environment monitoring and real-time alarm by constructing the base station intelligent monitoring unit on the underlying intelligent and non-intelligent devices.Therefore,the stability and safe operation of the po-wer and environment monitoring system is a prerequisite.However,as the number of base stations increases,the system in the central server architecture mode will show problems such as increased load and traffic overload,and the many-to-one mode is prone to DoS.Security issues such as attacks and data breaches.In addition,in multi-user mode,the existing system mode cannot achieve fine-grained access control.Aiming at the above problems,combined with the unique advantages of blockchain technology in distributed architecture,this paper proposed a double blockchain power and environment monitoring system of base station architecture based on improved PBFT consensus algorithm to solve the centralization,security,expansion and other.of the existing system problems.Specifically,the system is a hierarchical architecture information system,and each layer maintains a blockchain.the system is a dual-chain blockchain system that is maintained and shared by multiple nodes.One is responsible for the flow of cross-domain information in the form of a league chain,and the control of the authority,the other is responsible for the access control of the base station device and the flow of the base station transaction information in the form of a private chain.And achieves fine-grained access control of the device through PKI system and the key management system,and the improves block header to store the permission information.Finally,the results show that compared with the existing traditional system,the system of this paper proposed has certain advantages of multi-center service,anti-DoS attack,user-based fine-grained rights management,high degree of information encryption and good scalability through the qualitative analysis.

Extended Attack Graph Generation Method Based on Knowledge Graph

YE Zi-wei, GUO Yuan-bo, LI Tao, JU An-kang

Computer Science. 2019, 46 (12): 165-173. doi:10.11896/jsjkx.190400092

Abstract

PDF(1658KB) ( 2354 )

References | Related Articles | Metrics

Existing attack graph generation and analysis techniques mainly depend on vulnerability scores.External factors such as hardware and software cann’t be considered to judge their impact and correct vulnerability scores.As a result,generated attack graph is difficult to accurately reflect the real risk of nodes and attack paths.Information extraction and knowledge reasoning in knowledge graph technique are effective means to integrate vulnerability information acquired by multiple sources,and can be used to calculate the risk of nodes and attack paths more accurately in the network.Firstly,knowledge graph based on atomic attack ontology is designed to extend the input and display information of attack graph.Then,an extended attack graph generation framework based on knowledge graph is proposed.On this basis,the attack graph generation algorithm and calculation of attack success rate and attack profit are given,so as to achieve a more comprehensive and accurate evaluation of vulnerabilities.Finally,experimental results verify the effectiveness of proposed method.

Medical Information Security Storage Model Based on Blockchain Technology

WANG Hui, ZHOU Ming-ming

Computer Science. 2019, 46 (12): 174-179. doi:10.11896/jsjkx.181102034

Abstract

PDF(1773KB) ( 1244 )

References | Related Articles | Metrics

According to the status quo of medical informationization in China,the traditional electronic data model has the following problems in information storage.Firstly,electronic data have poor standardization and slow circulation.Secondly,electronic data security and privacy are threatened.Aiming at the security problem of information storage and the sharing of information between medical information systems,combined with the consensus mechanism of blockchain,encryption mechanism and peer-to-peer network,a blockchain-based information management scheme was proposed to realize the storage and sharing of medical information,which is non-tamperable and decentralized.The consensus of the network nodes is realized through the improved Byzantine protocol,the medical information is shared through the access control mechanism,the public information of the medical information is saved through the blockchain,and the real data of the medical treatment is encrypted and stored in the database or the cloud,which conveniently and effectively implements the storage of sensitive medical data and the sharing of information between systems.In the environment of the alliance chain,several network nodes are selected for security and performance testing.The experimental analysis shows that the proposed scheme has better performance in terms of security protection such as tamper resistance,privacy protection and throughput.In addition,the indexing mechanism proposed in this scheme can achieve fast retrieval and improve retrieval efficiency.By establishing a test network,the feasibility of this program in medical information storage is proved.The decentralized and non-tamperable management of the program using the blockchain can greatly improve the efficiency and the quality of medical services.It lays a good foundation for further research on the bottom technology of block chain and for exploring the application of block chain technology in the field of medical information.

Multi-stage Cascade Wireless Security Authentication Scheme Based on Blockchain Technology

HU Zhao-peng, DING Wei-ping, GAO Zhan, ZHU Xiao-hui, WANG Jie-hua

Computer Science. 2019, 46 (12): 180-185. doi:10.11896/jsjkx.181102170

Abstract

PDF(1397KB) ( 859 )

References | Related Articles | Metrics

Blockchain technology has the advantages of decentralization,trust removal,anonymity and non-tamperable.In order to more effectively ensure that users can safely identify and connect to the wireless network,this paper proposed a multi-stage cascade wireless security authentication scheme(MWSASB) based on blockchain technology.The MWSASB program designs a multi-stage cascade protocol process:registration phase,login and certification phase,and transaction phase.And it records the transaction of users’ information in the non-tamper and decentralized blockchain ledger by using workload proof and the extension of the longest chain.Firstly,during the registration phase,the user enters the registration information Then the cryptographic technology and the consensus mechanism are used to store the registration information on each node of the blockchain in the decentralized network.At the same time,during the login and authentication phase,the user inputs the login information,then login and authenticate with the blockchain server.After successful authentication the login information is also stored on each node of the blockchain.Secondly,in the transaction phase,the registration information and the login and authentication information are used to ensure that their information are securely recorded in the blockchain in the form of transactions.Finally,the security and computation of the MWSASB are analyzed.The results show that the MWSASB has security attributes such as wireless security authentication and can effectively avoid various common network attacks such as man-in-the-middle attacks,DDoS attacks,etc.In terms of computation,blockchain cannot be tampered with and cryptographic algorithm and consensus mechanism can be used for encryption verification,which can effectively reduce the number of calculations and improve the efficiency of security authentication.

Two-dimensional Code Encryption Based on Revocable Outsourced Attribute Encryption

GAO Dan, LING Jie, CHEN Jia-hui

Computer Science. 2019, 46 (12): 186-191. doi:10.11896/jsjkx.181102187

Abstract

PDF(1676KB) ( 672 )

References | Related Articles | Metrics

The two-dimensional code technology has a wide range of applications and superior performance,but the traditional two-dimensional code technology has low security and is only suitable for single-privilege scanning of a single information,and cannot implement different functions for users to scan different codes.As a fine-grained data encryption method,ciphertext policy attribute-based encryption (CP-ABE) can realize user access control while ensuring data security,and realize information transmission in one-to-many mode.Combining the characteristics and advantages of two-dimensional code and ciphertext policy attribute-based encryption technology,a two-dimensional code encryption scheme based on revocable outsourcing attribute encryption was proposed.The information block based on the rights division is secondarily encrypted,and then outsourced to the server for corresponding decryption and permission matching.Then the first decrypts ciphertext is returned to the user,and the user obtains the private key by scanning the code and decrypts it twice to get the plaintext.The generation of the two-dimensional code can vary with the different random keys.Through the analysis of the security of the scheme,it is proved that the scheme has forward security,backward security and selective plaintext attack security (IND-CPA) under the bilinear q-BDHE assumption.Through the design experiments,it is verified that the scheme can realize the one-to-many information of the two-dimensional code while ensuring the security of the two-dimensional code information.This scheme has the advantages of low computational overhead on the client side,reversible attributes,and random generation of two-dimensional codes.

Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information

FAN Dao-yuan, SUN Ji-hong, WANG Wei, TU Ji-ping, HE Xin

Computer Science. 2019, 46 (12): 192-200. doi:10.11896/jsjkx.181102232

Abstract

PDF(1743KB) ( 825 )

References | Related Articles | Metrics

Software defect is the root of software errors and failures.Software defect is caused by unreasonable requirement analysis,imprecise programming language and lack of experience of developers.Software defects are inevitable,and submitting defect reports is an important way to find and improve defects.Defect report is the carrier of describing defects,and the repair of defect report is the necessary means to improve software.Maintenance personnel and users submit reports for the same defect repeatedly,resulting in a large number of redundant reports in the defect report library.Manual triage is unable to adapt to more and more complex software systems.The detection of duplicate defect reports can filter redundant duplicate reports from defect report libraries and invests human and time in new defect reports.The prediction accuracy rate of current research methods is not high,and the difficulty is to find a suitable and comprehensive method to measure the similarity between defect reports.Based on the idea of the integration method and the python language,a new method named BSO (combination of BM25F,LSI and One-Hot) for detecting duplicate defect report was proposed by using text information and categorization information.On the basis of data preprocessing,duplicate defect report is divided into text information domain and categorization information domain.BM25F and LSI algorithms are used to get similarity scores in text information domain,and One-Hot algorithm is used to get similarity scores in categorization information domain.The similarity fusion method is used to synthesize the similarity score between text information domain and categorization information domain,and a recommendation list for each defect report corresponds to a duplicate defect report.The accuracy of the duplicate defect report detection is calculated.Compared with the baseline method and the state-of the art methods including REP and DBTM on OpenOffice.The experimental results show that the accuracy of the proposed method is 4.7% higher than that of DBTM,6.3% higher than that of REP,and higher than that of baseline method.Experiment results fully prove the effectiveness of BSO method.

Software Feature Extraction Method Based on Overlapping Community Detection

LIU Chun, ZHANG Guo-liang

Computer Science. 2019, 46 (12): 201-207. doi:10.11896/jsjkx.181001856

Abstract

PDF(1332KB) ( 679 )

References | Related Articles | Metrics

Extracting software features from natural language of product descriptions has gained a lot of attentions in recent years.In light that the sentences in the descriptions can describe the semantics of software features more precisely and one sentence may be concerned about more than one software feature,this paper proposed a feature identification method by detecting the overlapping clusters of these sentences in the natural language descriptions.Based on the overlapping community detection algorithm (LMF),the proposed method defines a metric to measure the similarity between each pair of sentences in the descriptions,builds a sentence similarity network accordingly,and then detects the overlapping sentence communities in such network.Each sentence community is a cluster which implies one software feature,and contains all the sentences potentially describing the implied feature.Further,in order to help people better understand the characteristics of sentence communities,the proposed method designs corresponding algorithms to select the communities with the lowest entropy from all sentence communities in turn,and to select the most representative sentences from the selected communities that have not been selected by other communities as descriptors of the features contained in the community.The natural language product descriptions from Soft pedia.com were crawled as experimental data.Experimental results show that the proposed method has better performance in accuracy and time consumption.

Multi-objective Test Case Prioritization Method Combined with Dynamic Reduction

ZHANG Na, XU Hai-xia, BAO Xiao-an, XU Lu, WU Biao

Computer Science. 2019, 46 (12): 208-212. doi:10.11896/jsjkx.181102106

Abstract

PDF(1327KB) ( 757 )

References | Related Articles | Metrics

Aiming at the shortcomings of ant colony algorithm in solving MOTCP problem,such as slow convergence rate and easy to fall into local optimum,a dynamic multi-objective test case prioritization method for online ant colony pheromone updating was proposed.The method introduces a dynamic reduction idea.Firstly,the initial test case set cove-ring the same requirements is firstly reduced according to the coverage of the requirements by each test case.Secondly,according to whether the test case can detect the error and the severity of the detected error during the execution process,a method for judging the failure degree of the test case is designed.After each iteration of the ant colony,a se-cond reduction is made to the test case in which no error is detected,so as to reduce the number of test cases that the ant colony needs to pass in the next iteration,and the sorting time is greatly reduced by two reductions.At the same time,in the process of each iteration of the ant colony,by considering the influence of the test factor importance degree,the failure degree and the actual execution time on the next round of pheromone,an online ant colony is designed to update the ant colony simultaneously under three influence factors.The pheromone method enables ant colonies to find the next test case faster and more accurately.Finally,this method,traditional ant colony sorting method and multi-objective optimization sorting method were respectively applied to multiple open source software programs for experimental comparison.The simulation results show that the prioritization method of the online update pheromone of the proposed dynamic reduction has great advantages in performance indicators such as defect detection capability and effective execution time,and can detect errors with higher severity at an earlier level.

Consistency Checking Algorithm for Distributed Key-Value Database Based on Operation History Graph

LIAO Bin, ZHANG Tao, LI Min, YU Jiong, GUO Bing-lei, LIU Yan

Computer Science. 2019, 46 (12): 213-219. doi:10.11896/jsjkx.181102097

Abstract

PDF(1753KB) ( 593 )

References | Related Articles | Metrics

The replica mechanism of distributed database system not only improves reliability and performance of the overall system,but also leads to the consistency problem of multi-replica data management mechanism.To keep the consistency of data,a consistency protocol model is needed to avoid data’s inconsistency events.Moreover,consistency checking algorithms are also needed to detect inconsistent data.Firstly,the concepts of temporal relations,security consistency,and concurrent consistency between read and write operations are defined.Secondly,according to the parallel and temporal relationship between read and write operations that recorded in the set of operations,the rules of transforming operation record set to operation record graph are extracted,and then the algorithm of transforming operation records into operation record graph is also designed.Then,taking the set of operation record graph as input,a violation operation search algorithm is designed to find the set of inconsistent read operations which have violated security and parallel consistency.Finally,experiments are conducted based on Cassandra and the read-write consistency is set to ONE.YCSB generates parallel read-write stress tests.The comparative experiments with similar algorithms verify the advantages of the proposed algorithm in both function and efficiency.

A Survey of Automated Negotiation and Its Fuzzy Set Based Models

LUO Xu-dong, HUANG Qiao-juan, ZHAN Jie-yu

Computer Science. 2019, 46 (12): 220-230. doi:10.11896/jsjkx.190800129

Abstract

PDF(1375KB) ( 997 )

References | Related Articles | Metrics

Negotiation is a process in which two or more interested parties communicate with each other for an agreement,automated negotiation isto use artificial intelligence system to automate this process.This paper first discussed the significance of studying automated negotiation.At the national level,the research and development of such systems is consistent with the national AI development strategy of China.It can not only help people live a better life in the e-commerce era,but also reduce corruption by using computer systems instead of manual negotiation.Secondly,the deve-lopment of negotiation proceeds in game theory,management science and computer science were outlined.The Nobel laureate in economics,Nash,started the study of negotiation in the field of game theory.However,the biggest problem in such study is the assumption that all information of negotiating parties involved is essentially public,which is unrealistic.Therefore,researchers in management science focus on how to conduct manual negotiation when all information of all parties is not disclosed.In general,manual negotiation is inefficient and difficult,and often fails to achieve optimal results.As a result,computer scientists are trying to replace manual negotiate with madnines.At the beginning,they focused on the computer-to-computer automated negotiation.Now their focus is turning to computer-to-human automated negotiation.Finally,methods of using fuzzy logic and fuzzy constraints to develop intelligent agents for automated negotiation were summarised.This paper focused on the application of these two fuzzy methods in the automatic negotiation system and the practical application of fuzzy automatic negotiation system.

Technology and Terminology Detection Oriented National Defense Science

FENG Luan-luan, LI Jun-hui, LI Pei-feng, ZHU Qiao-ming

Computer Science. 2019, 46 (12): 231-236. doi:10.11896/jsjkx.190300069

Abstract

PDF(1395KB) ( 1350 )

References | Related Articles | Metrics

With the rapid development of natural language processing,constructing oriented national defense science (ONDS) technology knowledge base has received more and more attention.The identification of technology and terminology is fundamental for constructing ONDS technology knowledge base.To recognize technology and terminology,this paper explored the application of subwords in the traditional Bi-LSTM+CRF sequence labeling model.In addition,this paper proposed linguistic features to boost the performance.Experimental results on the annotated dataset show that the proposed approach achieves 71.8% F1 scores,with improvement of 3.04% over the baseline system,indicating the effectiveness of the proposed approach in recognizing ONDS technology and terminology.Meanwhile,it also outperforms BERT-driven models in recognizing technology and terminology.

Adaptive Integrated Method Based on Sorting Selection Metrics

SHEN Xian-bao, SONG Yu-qing, LIU Zhe

Computer Science. 2019, 46 (12): 237-241. doi:10.11896/jsjkx.181102173

Abstract

PDF(1179KB) ( 729 )

References | Related Articles | Metrics

Aiming at the problem that the rigor of model selection is not high and the system simplification design is difficult to achieve due to the lack of accurate measurement of the integration priority of the base classifier in the integration process,an integrated method based on sorting selection metrics and adaptive weighting setting was proposed.Firstly,the K-fold cross-validation and the combined index metric method constructed by combining the error entropy of the design and the complementarity of the classifier are utilized to select two classi-fiers with the highest integration prio-rity.Then,considering the integration influence between the remaining candidate classifiers and the selected classifier subsets,the overall combination index metric based on combination index is constructed to realize the prioritization of different models.Finally,the best weights are found for different models for integration classification by adaptive weight method.The experimental results on the UCI dataset show that compared with other classification models,the classification evaluation indicators of the proposed method are improved,proving the feasibility of the integration method.This method selects quantitative basis of design model and adaptive weight setting mechanism through the designed model,making the whole integrated classification system have the stratification for model selection and the characteristics ofadaptivesimplification system.

Entity and Relationship Joint Extraction Method of Feedback Mechanism

MA Jian-hong, LI Zhen-zhen, ZHU Huai-zhong, WEI Zi-mo

Computer Science. 2019, 46 (12): 242-249. doi:10.11896/jsjkx.181102117

Abstract

PDF(1893KB) ( 1275 )

References | Related Articles | Metrics

Entity and relationship extraction are two core tasks in information extraction,and are the important cornerstone of knowledge mapping.At present,entity recognition and relationship extraction usually adopt the method of extracting features and rules manually and realizing them independently in two steps.This method is easy to cause duplicate data preprocessing and error propagation.The two modules are interrelated.Entity recognition is the basis of relationship extraction.The results of entity relationship extraction can also feedback and verify entity information.Therefore,a hybrid neural network model (Mufeedback-Join Model) without adding manual features and with mutual feedback mechanism was proposed to extract entities and their relationships jointly and realize the feedback checking mechanism of entity relationship to entity recognition.The model shares Bi-LSTM feature extraction layer to extract text context features,and introduces attention mechanism to capture key parts based on shared layer features.After decoding the feature,CRF is used to complete the entity sequence labeling task.The shared layer feature and entity feature are input into CNN model to realize entity relationship extraction.Finally,the relationship extraction result loss value is calculated,and the feature extraction layer of loss value feedback correction and the parameters of entity recognition model are combined.In the same hardware environment,this method can shorten the training time of model,improve the accuracy,recall and F1 value of entity and relationship extraction,The F1 value of the joint extraction is improved by 2.91%,the entity identification sub-module F1 is increased by 1.34% on average,and the relationship extraction F1 value is increased by 5.79%.The experimental data show that the joint extraction model can merge two sub-modules to reduce data processing time and error data transmission,and the mechanism of mutual feedback can improve the overall recognition effect.

Multi-scale Based Accelerator for Attribute Reduction

JIANG Ze-hua, WANG Yi-bo, XU Gang, YANG Xi-bei, WANG Ping-xin

Computer Science. 2019, 46 (12): 250-256. doi:10.11896/jsjkx.181102031

Abstract

PDF(2258KB) ( 681 )

References | Related Articles | Metrics

The neighborhood rough set measures the similarity between samples by radius,consequently,different radii naturally construct the rough approximations with different scales.Traditional attribute reduction based on neighborhood rough set is frequently executed over multi-radius.The aims are to select an attribute subset with better generalization performance or to explore the trend line of the performances of the reducts in terms of different scales.However,it should be emphasized that the process should be repeatedly executed for each scale,if the traditional algorithm based on heuristic searching is used to compute the multiple scale reducts.The time consumption of computing reducts is too high to be accepted,especially the number of the scale is more,the time consumption will be more.To solve such a problem,the multi-scale based accelerated strategy for attribute reduction was proposed by means of the changing of radius.This strategy considers two trends of changing radius,from smaller radius to greater radius and from greater radius to smaller radius.Moreover,the traversal size of samples and attributes is reduced.Therefore,the current process to find reduct is executed based on the reduct related to the previous radius,which follows that the forward or backward heuristic searching for adding and deleting attributes can be realized.To validate the effectiveness of the accelerated strategy,8 UCI data sets,10-fold cross-validation and 20 radii were employed to conduct the experiment,and the time consumption of computing different reducts and the classification of different reducts were compared.The experimental results over 8 UCI data sets show that the proposed forward or backward accelerated searching can significantly reduce the time consumptions of finding reducts if the case of multi-scale is considered.Moreover,the proposed approach will not significantly decrease the classification performance,and can reduce the degree of over-fitting.

Attribute Reduction in Inconsistent Decision Formal Contexts

LI Zhong-ling, MI Ju-sheng, XIE Bin

Computer Science. 2019, 46 (12): 257-260. doi:10.11896/jsjkx.181102137

Abstract

PDF(1194KB) ( 561 )

References | Related Articles | Metrics

Formal concept analysis was proposed by Wille R.in 1982.It is a model for the study of formal concepts and conceptual hierarchies.As an effective tool in knowledge discovery,it has been applied in various research areas such as information retrieval,data mining and pattern recognition.In practical applications,there may be a lot of redundant attributes in the formal context.Therefore,it is necessary to study the attribute reduction in formal concept analysis,and finding more concise approaches of attribute reduction is an important aspect in formal contexts.In this paper,inspired by the idea of rough set theory,attribute reduction in inconsistent decision formal contexts was studied.Some scholars proposed four definitions of distribution reduction,maximum distribution reduction,assignment reduction,and approximation reduction based on equivalency class in inconsistent information systems.As a formal context is a special information system,in this paper,substituting the equivalency class by the granular set,four new definitions of distribution reduction,maximum distribution reduction,assignment reduction,and upper approximation reduction based on inclusion degree were proposed.It is proved that the distribution reduction must be the maximum distribution reduction,the distribution reduction must be the assignment reduction,and the assignment reduction is equivalent to the upper approximation reduction.As an example,the judgement theorem for assignment consistent set was proved,and Boolean method for assignment reduction were given.

Attribute Reduction Algorithm for Neighborhood Rough Sets with Variable Precision Based on Attribute Importance

ZHENG Wen-bin, LI Jin-jin, HE Qiu-hong

Computer Science. 2019, 46 (12): 261-265. doi:10.11896/jsjkx.181102184

Abstract

PDF(1274KB) ( 824 )

References | Related Articles | Metrics

Neighborhood rough set theory is mainly used for knowledge discovery,attribute selection,decision analysis and data mining,and other fields.It can choose appropriate discretization strategy based on the characteristics of data and perform well in dealing with fuzzy and uncertain knowledge,but the traditional rough set attribute reduction algorithm is difficult to obtain reduction,and the attribute recognition accuracy of reduced rough set is low.Therefore,this paper put forward a kind of attribute reduction algorithm based on attribute importance.Considering the shortcomings of conditional information entropy in many aspects,the threshold parameters are re-selected by using the theory of varia-ble precision neighborhood rough set.Based on the new conditional information entropy as the measurement benchmark,the preference decision rule set is deduced according to the preference attributes in the decision information system.This paper extracted rough rules from preference decision rule set and established a variable precision neighborhood rough set model by using neighborhood granulation method.When dealing with large-scale rough set attribute data,this model takes a long time to calculate and has too many redundant attributes.Aiming at this problem,an evaluation strategy of attribute importance was given.Based on this,a variable precision neighborhood rough set attribute reduction algorithm was theoretically designed by fusing multi-tree.The experimental results show that compared with the traditional method,the accuracy of attribute recognition of the proposed method is 92%,which is improved by 10%.This fully verifies that the proposed attribute reduction algorithm has strong effectiveness and higher application value.

Non-cooperative Human Behavior Recognition Method Based on CSI

LI Xiao-wei, YU Jiang, CHANG Jun, YANG Jin-peng, RAN Ya-xin

Computer Science. 2019, 46 (12): 266-271. doi:10.11896/jsjkx.190200349

Abstract

PDF(3080KB) ( 1104 )

References | Related Articles | Metrics

Currently,Wi-Fi-based wireless personnel perception technology is widely used in anti-intrusion security monitoring,human health care,gait recognition and other fields,regarding this,this paper proposed a non-cooperative human behavior recognition method.The channel state information (CSI) of Wi-Fi signals can be used to recognize five dynamic activities:walking,sitting-standing up,squatting,jumping and falling.The method uses a SIMO system to collect CSI data,and after performing pre-processing on the CSI amplitude and phase respectively,implements a three-step computational cost reduction mechanism:subcarrier fusion,rejection of bad data link based on mobile variance threshold,and data segmentation of dynamic time window based on wavelet transform.Then activity features are extracted and extended from the time domain to the frequency domain.By analyzing the characteristics of the Doppler power spectrum,the utilization of the CSI signal is improved.Experiment results show that the overall recognition rate increases with the use of feature dimensions.Optimized by two rounds of voting,the combined classifier weighted voting method is increasing the overall recognition rate of five dynamic activities to 90.3%.And compared to RSSI,the advantages of CSI in the field of human behavior recognition are more prominent.

Fast Detection and Identification of Traffic Lights Based on Deep Learning

QIAN Hong-yi, WANG Li-hua, MOU Hong-lei

Computer Science. 2019, 46 (12): 272-278. doi:10.11896/jsjkx.190400026

Abstract

PDF(2494KB) ( 2144 )

References | Related Articles | Metrics

Traffic light detection and recognition technology can help drivers make correct driving decisions,reduce traffic accidents,and provide security for unmanned driving.Aiming at the technical difficulties such as the complex and variable traffic light detection scene,and targets typically account for a very small percentage of the dataset images,a fast detection and recognition algorithm for traffic light based on deep learning was proposed.The overall framework consists of three parts:heuristic-based image pre-segmentation,which is used to narrow the search range and improve the relative size and detection accuracy of the traffic light panel in the input images;detection and recognition based on deep learning,using convolutional neural networks to detect and identify traffic lights accurately;NMS (Non-Maximum Suppression) algorithm,which is used to remove the repeated detections of the previous stage.The proposed Split-CS-Yolo model achieves 96.08% mAP and 2.87% miss detection rate on the LISA dataset.Compared with other methods of the Yolo series,it not only has higher accuracy and lower missed detection rate,but also reduces the model size to 8.6% of the original Yolov2,thus increasing the detection speed by 63%.

Study on Image Classification of Capsule Network Using Fuzzy Clustering

ZHANG Tian-zhu, ZOU Cheng-ming

Computer Science. 2019, 46 (12): 279-285. doi:10.11896/jsjkx.190200315

Abstract

PDF(1422KB) ( 1093 )

References | Related Articles | Metrics

The essence of dynamic routing in capsule network is the implementation of clustering algorithm.Considering that the clustering method used in the previous capsule network requires the data to meet certain distributions to achieve the best effect while features of image are complicated,a more universal fuzzy clustering algorithm was taken as the feature integration scheme to replace the old in this paper.And an activation value using information entropy to measure the indeterminacy was added to the model,so as to distinguish the significance of capsule features at the same layer.Meanwhile,drawing on the idea of feature pyramid network,the features of different capsule layers are sampled to the same size to fuse and then are trained independently.Experimental results based on the Keras framework show that the capsule network with new structure has higher recognition accuracy on MNIST and CIFAR-10 than the original capsule network.The contrast experiments prove great potential of fuzzy clustering algorithm applying on capsule network,which alleviates the limitation of the clustering algorithm in the original capsule network.The results also prove that the features of different layers in the capsule network can be fused to be more informative and expressive.

Word Vectors Fusion Based Remote Sensing Scenes Zero-shot Classification Algorithm

WU Chen, YUAN Yu-wei, WANG Hong-wei, LIU Yu, LIU Si-tong, QUAN Ji-cheng

Computer Science. 2019, 46 (12): 286-291. doi:10.11896/jsjkx.181202257

Abstract

PDF(3163KB) ( 952 )

References | Related Articles | Metrics

Zero-shot classification algorithm does not need to label the sample of unseen classes to be recognized,so it can greatly reduce the cost of practical application,which has attracted wide attention in recent years.The problem of structure difference between word vectors and image feature prototypesseriously affects the zero-shot classification performance of remote sensing scenes.Based on the complementarity among different kinds of word vectors,the remote sensing scenes zero-shot classification algorithm based on word vectors fusion,named coupled analysis dictionary lear-ning method,was proposed.Firstly,the sparse coefficients of different kinds of word vectors are obtained by the more efficient analysis dictionary learning to reduce the redundant information.Then,the sparse coefficients are concatenated and denoted as the fused word vectors,and a structure alignment operation is performed based on the image feature prototypes to reduce structural differences by embedding the fused word vectors into image feature space.Finally,the image feature prototypes of the scene classes unseen are calculated,and the nearest neighbor classifier is employed to complete the classification in the image feature space.Quantitative experiments of the fusion of multiple semantic word vectors were carried out on UCM and AID datasets.At the same time,two real remote sensing images were qualitatively tested with RSSCN7 datasets as the seen dataset.Auantitative experiments obtaines the highest overall classification accuracies of 48.40% and 60.23% on UCM and AID,which respectively exceeds the typical comparative methods by 4.80% and 6.98%.In qualitative experiments on two real remote sensing images ,the algorithm also obtaines the best zero-shot classification performance.The experimental results show that the fused word vectors are more consistent with the prototypes in image feature space,and the zero-shot classification accuracies of remote sensing scenes can be significantly improved.

Study on Patient-adaptive Algorithm for ECG Classification Based on Wearable Devices

FAN Min, WANG Xiao-feng, MENG Xiao-feng

Computer Science. 2019, 46 (12): 292-297. doi:10.11896/jsjkx.190500181

Abstract

PDF(1597KB) ( 1363 )

References | Related Articles | Metrics

At present,cardiovascular diseases have become the main cause of global non-communicable death,death toll accounts for about one third of the total toll of death in the world,and the number of patients is increasing year by year.Wearable devices is used to automaticaly classify electrocardiogram to facilitate the early monitoring and prevention of cardiovascular diseases for patients.With the rise of edge machine lear-ning and federated learning ,small machine learning models have become a hot issue.According to the characteristics of wearable electrocardiogram equipment such as low configuration,low power consumption and personalization,this paper studied a lightweight network model based on LSTM,and used adaptive algorithm to optimize the ECG classification model of individual patients.The experiment is conducted by using the MIT-BIH open dataset.And compared with the current studies on the detection performance of VEB and SVEB,the experiment results show that the proposed algorithm has simple model structure and high classification performance,which can meet the requirement of ECG monitoring for patients by wearable devices.

End-to-End Retrieval Algorithm of Two-dimensional Engineering CAD Model Based on Unsupervised Learning

ZENG Fan-zhi, ZHOU Yan, YU Jia-hao, LUO Yue, QIU Teng-da, QIAN Jie-chang

Computer Science. 2019, 46 (12): 298-305. doi:10.11896/jsjkx.190900003

Abstract

PDF(2795KB) ( 1083 )

References | Related Articles | Metrics

Aiming at the problem of efficient retrieval of massive computer aided design(CAD) models in enterprise product manufacturing process,this paper studied a retrieval algorithm based on the content feature of two-dimensionalCAD models,and constructed a retrieval system prototype which can be used for DXF format CAD source file model base.Firstly,through the analysis of the DXF file structure of the two-dimensional CAD model,the rule of the primitive in the model is studied and the shape reconstruction is carried out.Secondly,according to the features of primitive,three kinds of content feature extraction methods are proposed,which are based on statistical histogram,two-dimensional shape distribution and Fourier transform.Finally,a multi-feature fusion framework based on unsupervised learning and similarity calculation method are designed to extract the fusion feature descriptor of the model and realize the retrieval of two-dimensional CAD model.Experiments show that the fusion features extracted in this paper contain more abundant content features and are more effective than single features.The system can be directly used in product customization,product design reuse and other aspects to help enterprises further improve the ability of intelligent manufacturing.

Semi-supervised Scene Recognition Method Based on Multi-mode Fusion

SHEN Hong, LIU Jun-fa, CHEN Yi-qiang, JIANG Xin-long, HUANG Zheng-yu

Computer Science. 2019, 46 (12): 306-312. doi:10.11896/jsjkx.191200500C

Abstract

PDF(1913KB) ( 1113 )

References | Related Articles | Metrics

Scene recognition is an important part of pervasive computing.It aims to provide users with accurate persona-lized service and improve service quality by identifying the location of the smartphone users.In the actual environment,there are two problems in accurate scene recognition.Firstly,based on single mode sensor data or wireless signal data,classification effect is not good enough ,and its generalization is not enough.Secondly,scene recognition accuracy depends on a large number of labeled data,resulting in high cost.In view of these problems,a semi-supervised scene recognition method based on multi-mode fusion was proposed.The menthod makes full use of the complementary information of Wi-Fi,Bluetooth and sensors to improve the accuracy of recognition.Compared with the recognition based on single mode data,fusion feature can increase the static scene classification accuracy by 10%.In this paper,a semi-supervised learning method was constructed to solve the problem of high data acquisition cost in dynamic scene,and the classification accuracy is over 90% by reducing half of the labeled data.The results show that introducing semi-supervised lear-ning method based on the complementary advantages of Wi-Fi,Bluetooth and sensors information can reduce data collecting cost and improve scene recognition accuracy to some extent,and thus highly increase its recognition accuracy and universality.

PPI Network Inference Algorithm for PCP-MS Data

CHEN Zheng, TIAN Bo, HE Zeng-you

Computer Science. 2019, 46 (12): 313-321. doi:10.11896/jsjkx.181102215

Abstract

PDF(3391KB) ( 1053 )

References | Related Articles | Metrics

With the development of proteomics,scholars begin to pay more attention to the construction of Protein-Protein Interaction (PPI) network.Mass spectrometry(MS) has become a representative method for protein-protein inte-raction (PPI) inference,and it is one of the main experiment method to construct PPI network.Based on the technology of mass spectrometry,a large amount of experimental protein MS data is generated,such as affinity purification-mass spectrometry (AP-MS) data and protein correlation profiling-mass spectrometry (PCP-MS) data,which provide important data support for the construction of PPI networks,but constructing PPI networks by hand is impracticable and time consuming.Thus,PPI network inference algorithm for PCP-MS data has begun to become the research hotspot in bioinformatics.This thesis focused on the problem of PPI network inference for two main types of mass spectral data (AP-MS data and PCP-MS data),and designed effective methods respectively to solve the issue of current bottlenecks,achieving the construction of high-quality PPI network.The existing algorithms for PPI network interface from PCP-MS data are still in infancy,and there is a few of related algorithms.The existing method have several problem.Specifically:1)many error interaction is contained in the results produced by the different algorithms,and the correct interaction is omitted in the results.2)Different algorithms may produce very different results when they face the same data set.3)For different data sets,the performance variance of the same algorithm is larger.For the problem of PPI network inference for PCP-MS data,this paper proposed a PPI scoring method based on correlation analysis and rank aggregation.The method is based on unsupervised learning and includes two steps.Firstly,correlation coefficient between protein pairs is computed,and multiple results of PPI scores can be obtained.Secondly,multiple results for each pair of proteins are combininect via rank aggregation to a single PPI score.The experimental results show that this method is comparable with those supervised learning methods using standard reference set.

Following-degree Tree Algorithm to Detect Overlapping Communities in Complex Networks

FU Li-dong, LI Dan, LI Zhan-li

Computer Science. 2019, 46 (12): 322-326. doi:10.11896/jsjkx.190200293

Abstract

PDF(1523KB) ( 626 )

References | Related Articles | Metrics

Overlapping community detection is a key and difficult issue in complex network research field.Due to the widespread hierarchical structures in real networks,the hierarchical overlapping community detection methods are more suitable for studying and analyzing complex networks in the real world.However,these researches of this kind of method are not many now.This paper proposed a new overlapping community detection algorithm named following-degree tree based on the definition of complex network node leadership and subordination concept.Combined with the hierarchical characteristic,this algorithm constructs the following-degree tree by calculating the following degree of eve-ry nodes and finally finds the overlapping nodes and overlapping communities by dividing the tree.The feasibility of the algorithm was proved by an artificial network experiment,and its effectiveness was verified by the experiments on Dolphin network and Karate network.The proposed algorithm has high extended modularity and more reasonable division,which can find overlapping nodes that other algorithms cannot find.

Analysis of SIR Model Based on Individual Heterogeneous Infectivity and State Transition

QU Qian-qian, HAN Hua

Computer Science. 2019, 46 (12): 327-333. doi:10.11896/jsjkx.181001974

Abstract

PDF(2309KB) ( 1093 )

References | Related Articles | Metrics

To explore the phenomenon of infected individuals with different infectious rates,based on the basic SIR epidemic model in complex networks,this paper proposed an epidemic model with two types of infections and probability of metastasis.Based on the existence of the equilibrium point of endemic diseases,it obtaines the basic reproduction number R₀.It analyzes two common immunization strategies:random immunization and target immunization.Simulation experiments show that under the same conditions,diseases spread faster and wider in heterogeneous networks than in homogeneous networks when R₀>1,and network structure has little influence on the spread of diseases when R₀<1.Further researches show that the greater the degree of initial infection nodes in the network,the faster the disease transmission speed and the greater the peak value;the greater the centrality of the proximity of the initial infected nodes,the faster and wider the disease spreads;the point aggregation coefficient has little effect on the transmission process;the basic reproduction number decreases with the increase of the transfer probability,and the increase of the transfer probability can effectively reduce the spread of disease;in the case of the same average immunity rate,the target immunity is more effective than random immunity.

Method of Mining Hidden Transition of Business Process Based on Behavior Profiles

SONG Jian, FANG Xian-wen, WANG Li-li, LIU Xiang-wei

Computer Science. 2019, 46 (12): 334-340. doi:10.11896/jsjkx.180901654

Abstract

PDF(2243KB) ( 638 )

References | Related Articles | Metrics

In the process of business process optimization,mining hidden transitions from infrequent behaviors is one of the important tasks.Mining the hidden transitions from infrequent behaviors can better restore the process model and improve the efficiency of the process.Based on the theory of behavioral profiles,this paper mined logs from relatively high frequency and obtained initial models.Firstly,the event log is filtered by the reasonableness threshold to obtain a valid low-frequency sequence log.Then,the low-frequency sequence log is used to optimize the initial model,and the behavioral profiles relationship between each activity is compared with the source model to find the changed region,and the possible hidden transitions are minned.Next,through the optimization of the indicators to further verify the hidden transitions,a complete and accurate model of the implicit transition process is obtained.Finally,the model is analyzed by concrete examples and simulations to verify the effectiveness of the method.