Computer Science

Select

Research on Neural Network Clustering Algorithm for Short Text

SUN Zhao-ying,LIU Gong-shen

Computer Science 2018, 45 (6A): 392-395.

Abstract （292）

PDF（pc）（1535KB）（1340）

Save

Short text has a small number of vocabularies and weak description of information,resulting in the characteris-tics of high dimensionality,sparse features and noise interference.The existing clustering algorithms have low accuracy and efficiency for the large-scale short text.A short text clustering algorithm based on deep learning convolution neural network was proposed to solve this problem.The proposed clustering algorithm uses the word2vec model to learn the potential semantic association between words in the short text,and the multidimensional vector to represent the single word based on the large-scale corpus,and then the short text is also expressed as the multidimensional original vector form.Using convolution neural network,the feature vector is extracted from the original vector of sparse and high dimension to the low-dimensional text vector with more effective characteristics.Finally,the traditional clustering algorithm is used to cluster the short text.The proposed clustering method is feasible and effective for the reduction of text vector,and has achieved good short text clustering effect with F-measure of over 75%.

Reference | Related Articles | Metrics

Select

Construction Method of Domain Subject Thesaurus Based on Corpus

AN Ya-wei, CAO Xiao-chun, LUO Shun

Computer Science 2018, 45 (6A): 396-397.

Abstract （239）

PDF（pc）（1569KB）（1446）

Save

To achieve a massive domain corpus oriented subject thesaurus,a method based on feature matrix which is set up by computing words co-occurrence was proposed.By operating on this feature matrix,words are divided into clusters,and central word for each words cluster is calculated.Lexical bundles are finally gained by re-organizing words clusters using central word as a core.The experiment indicates that the proposed method can achieve good precision rate and recall rate.

Reference | Related Articles | Metrics

Select

Collaboration Filtering Recommendation Algorithm Based on Ratings Difference
and Interest Similarity

WEI Hui-juan, DAI Mu-hong

Computer Science 2018, 45 (6A): 398-401.

Abstract （185）

PDF（pc）（1592KB）（923）

Save

In order to improve the quality of recommendation system and solve the existing similarity calculation inaccuracy problem of traditional collaborative filtering algorithm,this paper put forward a method to calculate user similarity.Based on the user common ratings,this method firstly calculates the information entropy of rating differentials according to rating differentials and time features.Then it evaluates the similarity of the user by utilizing the information entropy of rating differentials and the rated item attributes.Finally,the nearest neighbors would be calculated according to the user similarity,which helps predict the rating of the target item.The experimental results show that the proposed algorithm makes the target user find the nearest neighbors more accurately and improves the recommendation accuracy effectively.

Reference | Related Articles | Metrics

Select

Spark Based Condensed Nearest Neighbor Algorithm

ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong,ZHAO Chun-ling

Computer Science 2018, 45 (6A): 406-410.

Abstract （310）

PDF（pc）（1565KB）（668）

Save

K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.

Reference | Related Articles | Metrics

Select

Cloud Resource Selection Algorithm by Skyline under MapReduce Frame

QI Yu-dong,HE Cheng,SI Wei-chao

Computer Science 2018, 45 (6A): 411-414.

Abstract （219）

PDF（pc）（1541KB）（487）

Save

This paper researched a cloud resource selected algorithm under the MapReduce frame,which uses a method of possibility filtrate to figure the possibility of a resource nod belonging to the Skyline results set.It filtrates information by value seted in advance,to reduce the frequency of heartbeat in the MapReduce frame eventually,to optimize the flow of network.

Reference | Related Articles | Metrics

Select

Collaborative Filtering Personalized Recommendation Based on Similarity of Tag Information Feature

HE Ming, YAO Kai-sheng,YANG Peng,ZHANG Jiu-ling

Computer Science 2018, 45 (6A): 415-422.

Abstract （291）

PDF（pc）（1624KB）（649）

Save

Tag recommendation systems are aimed to provide personalized recommendation using tag data for users.Previous tag based recommendation methods usually neglect the characteristics of users and items,and similarity mea-sures are unconsidered fully incorporating effectively both user similarity and item similarity,which leads to deviation of recommendation results.To address this issue,this paper proposed the collaborative filtering recommendation method of combining tag features and similarity for personalized recommendation.Two-dimensional matrix is used to define actions among user-tag and tag-item based on integrating information among users,tags and items.Tag features representation is constructed,and user similarity and item similarity are calculated by similarity measure method based on tag features.The user preferences for items are predicted by their tag behaviors and linear combination of similarity of users and items,and the recommended list is generated according to the rank of preferences.The experimental results on Last.fm show that the proposed method can improve recommendation accuracy and satisfy the requirement for users.

Reference | Related Articles | Metrics

Select

Diversity Recommendation Approach Based on Social Relationship and User Preference

SHI Jin-ping,LI Jin,HE Feng-zhen

Computer Science 2018, 45 (6A): 423-427.

Abstract （335）

PDF（pc）（1571KB）（759）

Save

The traditional recommendation algorithm,represented by collaborative filtering,can provide users with a high recommended list with high accuracy,while ignoring another important measure which is diversity in the recommendation system.With the increasing development of social networks,with a lot of redundancy and duplication of information,the overload information makes it more difficult to find user interests quickly and effectively.For recommending the most content for users to meet their hobbies,user interests with a significant relevance and covering different aspects are needed.Therefore,based on social relations and user preferences,this paper proposed a sorting framework for diversity and relevance.Firstly,this paper introduced the social relations graph model,considering the relationship between users and items to better model their relevance.Then,this paper used a linear model to integrate the two important indexes of diversity and relevance.Finally,the algorithm was implemented by Spark GraphX parallel graph calculation framework,and experiments were carried on real dataset to verify the feasibility and scalability of the proposed algorithm.

Reference | Related Articles | Metrics

Select

Study on Active Acquisition of Distributed Web Crawler Cluster

DONG Yu-long,YANG Lian-he,MA Xin

Computer Science 2018, 45 (6A): 428-432.

Abstract （297）

PDF（pc）（1583KB）（1087）

Save

In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load ba-lance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.

Reference | Related Articles | Metrics

Select

Efficient Friend Recommendation Scheme for Social Networks

CHENG Hong-bing, WANG Ke, LI Bing, QIAN Man-yun

Computer Science 2018, 45 (6A): 433-436.

Abstract （284）

PDF（pc）（1594KB）（579）

Save

With the rapid development of modern network technology,human society has entered the era of information.An increasing number of people prefer to talk and make friends with others through social networks.Besides the people or events which users initiatively focus on,social network will also recommend alternative users.However,most of the alternative users are the promotion of social networks.In this paper,for the accuracy and reliability of social networks recommendation,a new scheme based on tag matching was proposed.First,each word in the corpus is trained by Word2Vec,and then a word vectors space can be obtained and the similarity among words can be obtained by using the cosine similarity.Secondly,through the similarity comparison experiments,this paper chose an appropriate similarity value as the threshold to judge whether two words are similar.Finally,the similarity threshold was applied to the matching algorithm.The simulation experiments show that the recommend users are relatively reliable and accurate.

Reference | Related Articles | Metrics

Select

Research on Data Mining Algorithm Based on Examination Process and Knowledge Structure

DAI Ming-zhu,GAO Song-feng

Computer Science 2018, 45 (6A): 437-441.

Abstract （353）

PDF（pc）（1535KB）（656）

Save

In order to study the mastery of knowledge points at different stages of student,based on the theory of data mining,knowledge structure was combined with examination results to study data.Based on the theory of educational measurement and the decision tree algorithm of data mining,an improved algorithm was proposed according to the original C4.5 algorithm,applying the difficulty level of the knowledge points involved in the test papers and the knowledge structure to refine the knowledge structure in order to determine the degree of knowledge of individual students or groups of students and the relationship between the knowledge points.The experimental results show that the efficiency of the improved algorithm is improved,whose formula is simple and practical compared with the original formula.According to the decision tree model,the remaining data is used to verify the improved formula,and it is faster to draw the conclusion that the effect of knowledge points on programming is relatively important.Test data is used to verify the decision tree,and the accuracy rate is 90%.Finally,a visual display of the decision tree can give an effective reference for students to learn the arrangements,teachers to develop teaching programs and arrangements.

Reference | Related Articles | Metrics

Select

Algorithm for Mining Bipartite Network Based on Incremental Modularity

DAI Cai-yan, CHEN Ling, HU Kong-fa

Computer Science 2018, 45 (6A): 442-446.

Abstract （348）

PDF（pc）（1606KB）（674）

Save

Aiming at mining communities from bipartite network,an algorithm based on incremental modularity was proposed.The algorithm assumes that each vertex constitutes a community by itself with its own label.A part of the vertex copies its own label and passes it to a vertex on another part,so that it is located in the same community,and then it performs the same operation on the vertices of another part,and repeats iterations until convergence.In label propagation,the algorithm chooses the edge with the largest incremental modularity,so that the overall modularity is constantly improving.The experimental results on real datasets show that the proposed algorithm can mine high quality communities from bipartite network.

Reference | Related Articles | Metrics

Select

Influence Factors Mining of Traffic Accidents Based on Association Rules

JIA Xi-bin,YE Ying-jie,CHEN Jun-cheng

Computer Science 2018, 45 (6A): 447-452.

Abstract （240）

PDF（pc）（1561KB）（1654）

Save

The road traffic safety is a public safety issue.The number of deaths due to traffic accidents account for the highest proportion in all accidents every year.With the development of big data intelligent analysis technology,the traffic accident data are extesively used to trace the causes,it is helpful to propose specific measures to avoid and prevent the occurrence of traffic accidents.According to the characteristics of diversity causes of traffic accidents,this paper proposd to use the news’ data of traffic accident combining with a wide range of news’ authenticity and characteristics timeliness to do the analysis of factors and the liability of traffic accidents.Taking the traffic accident news in Sina as the data source,the relevant factors of traffic accidents are extracted from it.In terms of the limitation in classic Apriori that only applies to a single dimension association mining and needs to scan database frequently,an improved multi-va-lued attribute Apriori algorithm was proposed.Focuing on the traffic accident data of provinces and cities,a variety of combination factors which lead to these traffic accidents were mined,thus the rules of frequent traffic accidents in pro-vinces and cities were summarized as the basis for taking preventive and regulatory measures.

Reference | Related Articles | Metrics

Select

Scaling-up Algorithm of Multi-scale Classification Based on Fractal Theory

LI Jia-xing, ZHAO Shu-liang,AN Lei,LI Chang-jing

Computer Science 2018, 45 (6A): 453-459.

Abstract （204）

PDF（pc）（1589KB）（568）

Save

At present,the research of multi-scale data mining mainly focuses on space image data,and recently has produced some results on the general data,including the multi-scale clustering and multi-scale association rules,but it has not been involved in the field of classification mining.Combining with fractal theory,this paper applied the theory,knowledge and methods related to the multi-scale data mining to the areas of the classification mining,and proposed an approach of similarity measure based on Hausdorff.Relative to the definition of weight through experience,this paper clearly defined it by the similarity of generalized fractal dimension to improve the precision of similarity measure.Then,this paper proposed a multi-scale classification scaling-up algorithm named MSCSUA(Multi-Scale Classification Scaling-Up Algorithm).At last,this paper performed experiments on four UCI benchmark data sets and one real data set (H province part of the population).The experimental results show that the thought of multi-scale classification is feasible and effective,the MSCSUA algorithm performs well in terms of classification than SLAD,KNN,Decision Tree and LIBSVM algorithms on different data sets.

Reference | Related Articles | Metrics

Select

Bisecting K-means Clustering Method Based on Cohesion and Coupling

YU Yong,KANG Qing-yi,CHEN Chang-geng,KAN Shi-lin,LUO Yong-jun

Computer Science 2018, 45 (6A): 460-464.

Abstract （432）

PDF（pc）（1569KB）（694）

Save

Clustering analysis is one of the most important techniques in data mining.It has important role and wide application in every field of social economy.K-means is one kind of the simple and widely used clustering methods,but its disadvantage is that it depends on the initial conditions and the number of clusters is difficult to determine.This paper introduced the cohesion and coupling of cluster,and presented the measurement of cohesion and coupling.Based on the principle of “high cohesion and low coupling”,the clusters are constantly divided and merged in the process of bisecting K-Means clustering algorithm.By judging whether the clustering results meet the requirements,it can determine the number of clusters,thus improving the bisecting K-Means clustering algorithm.The experimental results on Iris data show that the algorithm is not only more stable,but also has higher clustering accuracy.

Reference | Related Articles | Metrics

Select

TEFRCF:Collaborative Filtering Personalized Recommendation Algorithm Based on Tag
Entropy Feature Representation

HE Ming, YANG Peng, YAO Kai-sheng, ZHANG Jiu-ling

Computer Science 2018, 45 (6A): 465-470.

Abstract （316）

PDF（pc）（1638KB）（658）

Save

Tags are served as an effective way for information classification and information retrieval at the age of Web2.0.Tag recommendation systems aim to provide personalized recommendation for users by using tag data.Theexi-sting tag-based recommendation methods tend to assign the popular tags and their corresponding items more larger weight in predicting users’ interest on the items,resulting in weight deviations,reducing the novelty of the results and being unable to fully reflect users’ personalized interest.In order to solve the problems above,the concept of tag entropy was defined to measure the uncertainty of tags,and the collaborative filtering personalized recommendation algorithm based on tags entropy feature representation was proposed.This method solves the problem of weight deviation by introducing tag entropy,and then the tripartite graphs are used to describe the relationship among users,tags and items.The representation of users and items is constructed based on tag entropy feature representation,and the similarity of items is calculated by the feature similarity measure method.Finally,the user preferences for items are predicted by the linear combination of tags behaviors and similarity of items,and then the recommended list is generated according to the rank of preferences.The experimental results on Last.fm show that the proposed algorithm can improve recommendation accuracy and novelty,and satisfy the requirement for users.

Reference | Related Articles | Metrics

Select

Hash Join in MapReduce Distributed Environment Based on Column-store

ZHANG Bin, LE Jia-jin

Computer Science 2018, 45 (6A): 471-475.

Abstract （218）

PDF（pc）（1610KB）（782）

Save

The characters of big data are volume,variety,value,velocity,and common hardware and open source.Aiming at the system inefficiency and limited scalability of traditional relational database in big data analysis,this paper presented an algorithm of Hash joins in MapReduce distributed environment based on column-store by introducing MapReduce computing model.First of all,this paper proposed the design of large data-oriented distributed computing models.Then,it proposed the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Lastly,the experiments evaluated execution time and load capacity.The results show that the proposed method is effective and can provid good scalability in big data analysis.

Reference | Related Articles | Metrics

Select

Improved XGBoostModel Based on Genetic Algorithm for Hypertension Recipe Recognition

LEI Xue-mei, XIE Yi-tong

Computer Science 2018, 45 (6A): 476-481.

Abstract （362）

PDF（pc）（1559KB）（905）

Save

A novel improved XGBoost (eXtreme Gradient Boosting) model based on genetic algorithmfor hypertension recipe recognition was proposed.The model consists of three steps.Firstly,data pre-processing is employed to handle missing values,remove duplicate data and analyze data feature.Then,the genetic algorithm is used to optimize theparameters of XGBoost model adaptively.At last,hypertension recipe identification model is trained according to the optimal parameters.The results show that the parameters optimized by genetic algorithm performs better than grid search.Moreover,the proposed model outperforms other four models (Random forest,GBDT,Bagging and AdaBooster) over four evaluation measures:accuracy,recall rate,F1 and the area under the curve (AUC) on average,and enhances the interpretability of credit scoring model.

Reference | Related Articles | Metrics

Select

Co-location Pattern Mining Algorithm Based on Data Normalization

ZENG Xin,LI Xiao-wei,YANG Jian

Computer Science 2018, 45 (6A): 482-486.

Abstract （197）

PDF（pc）（1548KB）（515）

Save

In the practical application,the spatial features not only contain the spatial information,but also the attribute information,which is important for the knowledge discovery and scientific decision.Existing co-location pattern mining algorithms do not consider the weight of instances of different attributes in the adjacent distance when calculating the adjacent distance of two different feature instances.It results in that the weight of partial attribute is too large and also affects the result of the co-location pattern mining.Standardizing the attribute values and giving an equal weight to all attributes,a data standardization algorithm DNRA based on join-based was put forward.Meanwhile,a deep research was given on the problem that the distance threshold was difficult to determine.The range of the distance threshold was derived in DNRA algorithm,helping the users to select the appropriate distance threshold.Finally,the performance of the DNRA algorithm was analyzed and compared by a large number of experiments.

Reference | Related Articles | Metrics

Select

Adaptive Stochastic Gradient Descent for Imbalanced Data Classification

TAO Bing-mo,LU Shu-xia

Computer Science 2018, 45 (6A): 487-492.

Abstract （302）

PDF（pc）（1576KB）（852）

Save

For imbalanced data classification,the performance of using traditional stochastic gradient descent for solving SVM problems is not very well.Adaptive stochastic gradient descent algorithm defines a distribution pinstead of using uniform distribution to choose examples,and the smoothing hinge loss function is used in the optimization problem.Because of the training sets are imbalanced,using uniform distribution will cause the algorithm choose more majority class based on the imbalanced ratio.That would result the classifier bias towards the minority class.The distribution p largely overcomes this issue.When to stop the programs becomes an important problem,because the normal stochastic gradient descent algorithm does not have a stop criterion especially for large data sets.The stop criterion was setted according to the classification accuracy on the training sets or its subsets.This stop criterion could stop the programs very early especially for large data sets if the parameters are chosen properly.Some experiments on imbalanced data sets show that the proposed algorithm is effective.

Reference | Related Articles | Metrics

Select

Coordination Filtering Personalized Recommendation Algorithm Considering Average
Preference Weight and Popularity Division

HE Ji-xing,CHEN Wen-bin,MOU Bin-hao

Computer Science 2018, 45 (6A): 493-496.

Abstract （219）

PDF（pc）（1537KB）（996）

Save

This paper presented a new recommendation algorithm which takes into account the average preference weight.The algorithm is divided into three stages:neighborhood computing,data set partitioning and preference prediction.In the neighborhood calculation,the KNN based on the Euclidean distance is used to determine the neighborhood.At the same time,the data set is divided into the data set and the non-popular data set according to the popularity threshold of the data set itself.When the score is predicted,the existing neighborhood selects part of the project accor-ding to the popularity degree,and predicts the user’s average preference weight based on the preference similarity of the item set.The results show that on the Movielens 100K data set,the new algorithm is superior to the typical cosine recommendation algorithm,the person recommendation algorithm,the collaborative filtering algorithm based on the project preference coordination filtering algorithm and the user attribute weighted active neighbor existing algorithms in MAE.

Reference | Related Articles | Metrics

Select

Correlative Factors about Elderly Disabled and Dementia in Big Data Environments

LI Han LI Hui-jia, ZHANG Lin-zi, HUANG Yu-ying

Computer Science 2018, 45 (6A): 497-501.

Abstract （218）

PDF（pc）（1583KB）（865）

Save

Recently,with the deepening of China’s population aging problem,the rising family pension burden,government pressure and the decline of demographic dividend have gained wide attention.Aging,status,and sense of control (ASOC) is a continuous survey data about elderly’s health in American which was conducted every three years from 1995 to 2001.Using three waves of ASOC,we have found influential factors related to disability and dementia by logistic regression based on the descriptive statistical analysis.On the basis of the descriptive statistics,by Logistic regression and removing strong related factors,we discovered that age,gender,smoking,drinking,exercise,heart disease,arthritis/rheumatism, and participation in community servicehave strong correlation with the elderly disability.Age,drinking,hypertension,participation in community service and marital status are significantly related to dementia in the elderly.

Reference | Related Articles | Metrics