Started in January,1974(Monthly)
Supervised and Sponsored by Chongqing Southwest Information Co., Ltd.
ISSN 1002-137X
CN 50-1075/TP
CODEN JKIEBK
Editors
    Content of Big Date & Date Mining in our journal
        Published in last 1 year |  In last 2 years |  In last 3 years |  All
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Research on Neural Network Clustering Algorithm for Short Text
    SUN Zhao-ying,LIU Gong-shen
    Computer Science    2018, 45 (6A): 392-395.  
    Abstract292)      PDF(pc) (1535KB)(1340)       Save
    Short text has a small number of vocabularies and weak description of information,resulting in the characteris-tics of high dimensionality,sparse features and noise interference.The existing clustering algorithms have low accuracy and efficiency for the large-scale short text.A short text clustering algorithm based on deep learning convolution neural network was proposed to solve this problem.The proposed clustering algorithm uses the word2vec model to learn the potential semantic association between words in the short text,and the multidimensional vector to represent the single word based on the large-scale corpus,and then the short text is also expressed as the multidimensional original vector form.Using convolution neural network,the feature vector is extracted from the original vector of sparse and high dimension to the low-dimensional text vector with more effective characteristics.Finally,the traditional clustering algorithm is used to cluster the short text.The proposed clustering method is feasible and effective for the reduction of text vector,and has achieved good short text clustering effect with F-measure of over 75%.
    Reference | Related Articles | Metrics
    Construction Method of Domain Subject Thesaurus Based on Corpus
    AN Ya-wei, CAO Xiao-chun, LUO Shun
    Computer Science    2018, 45 (6A): 396-397.  
    Abstract239)      PDF(pc) (1569KB)(1446)       Save
    To achieve a massive domain corpus oriented subject thesaurus,a method based on feature matrix which is set up by computing words co-occurrence was proposed.By operating on this feature matrix,words are divided into clusters,and central word for each words cluster is calculated.Lexical bundles are finally gained by re-organizing words clusters using central word as a core.The experiment indicates that the proposed method can achieve good precision rate and recall rate.
    Reference | Related Articles | Metrics
    Collaboration Filtering Recommendation Algorithm Based on Ratings Difference
    and Interest Similarity
    WEI Hui-juan, DAI Mu-hong
    Computer Science    2018, 45 (6A): 398-401.  
    Abstract185)      PDF(pc) (1592KB)(923)       Save
    In order to improve the quality of recommendation system and solve the existing similarity calculation inaccuracy problem of traditional collaborative filtering algorithm,this paper put forward a method to calculate user similarity.Based on the user common ratings,this method firstly calculates the information entropy of rating differentials according to rating differentials and time features.Then it evaluates the similarity of the user by utilizing the information entropy of rating differentials and the rated item attributes.Finally,the nearest neighbors would be calculated according to the user similarity,which helps predict the rating of the target item.The experimental results show that the proposed algorithm makes the target user find the nearest neighbors more accurately and improves the recommendation accuracy effectively.
    Reference | Related Articles | Metrics
    Spark Based Condensed Nearest Neighbor Algorithm
    ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong,ZHAO Chun-ling
    Computer Science    2018, 45 (6A): 406-410.  
    Abstract310)      PDF(pc) (1565KB)(668)       Save
    K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.
    Reference | Related Articles | Metrics
    Cloud Resource Selection Algorithm by Skyline under MapReduce Frame
    QI Yu-dong,HE Cheng,SI Wei-chao
    Computer Science    2018, 45 (6A): 411-414.  
    Abstract219)      PDF(pc) (1541KB)(487)       Save
    This paper researched a cloud resource selected algorithm under the MapReduce frame,which uses a method of possibility filtrate to figure the possibility of a resource nod belonging to the Skyline results set.It filtrates information by value seted in advance,to reduce the frequency of heartbeat in the MapReduce frame eventually,to optimize the flow of network.
    Reference | Related Articles | Metrics
    Collaborative Filtering Personalized Recommendation Based on Similarity of Tag Information Feature
    HE Ming, YAO Kai-sheng,YANG Peng,ZHANG Jiu-ling
    Computer Science    2018, 45 (6A): 415-422.  
    Abstract291)      PDF(pc) (1624KB)(649)       Save
    Tag recommendation systems are aimed to provide personalized recommendation using tag data for users.Previous tag based recommendation methods usually neglect the characteristics of users and items,and similarity mea-sures are unconsidered fully incorporating effectively both user similarity and item similarity,which leads to deviation of recommendation results.To address this issue,this paper proposed the collaborative filtering recommendation method of combining tag features and similarity for personalized recommendation.Two-dimensional matrix is used to define actions among user-tag and tag-item based on integrating information among users,tags and items.Tag features representation is constructed,and user similarity and item similarity are calculated by similarity measure method based on tag features.The user preferences for items are predicted by their tag behaviors and linear combination of similarity of users and items,and the recommended list is generated according to the rank of preferences.The experimental results on Last.fm show that the proposed method can improve recommendation accuracy and satisfy the requirement for users.
    Reference | Related Articles | Metrics
    Diversity Recommendation Approach Based on Social Relationship and User Preference
    SHI Jin-ping,LI Jin,HE Feng-zhen
    Computer Science    2018, 45 (6A): 423-427.  
    Abstract335)      PDF(pc) (1571KB)(759)       Save
    The traditional recommendation algorithm,represented by collaborative filtering,can provide users with a high recommended list with high accuracy,while ignoring another important measure which is diversity in the recommendation system.With the increasing development of social networks,with a lot of redundancy and duplication of information,the overload information makes it more difficult to find user interests quickly and effectively.For recommending the most content for users to meet their hobbies,user interests with a significant relevance and covering different aspects are needed.Therefore,based on social relations and user preferences,this paper proposed a sorting framework for diversity and relevance.Firstly,this paper introduced the social relations graph model,considering the relationship between users and items to better model their relevance.Then,this paper used a linear model to integrate the two important indexes of diversity and relevance.Finally,the algorithm was implemented by Spark GraphX parallel graph calculation framework,and experiments were carried on real dataset to verify the feasibility and scalability of the proposed algorithm.
    Reference | Related Articles | Metrics
    Study on Active Acquisition of Distributed Web Crawler Cluster
    DONG Yu-long,YANG Lian-he,MA Xin
    Computer Science    2018, 45 (6A): 428-432.  
    Abstract297)      PDF(pc) (1583KB)(1087)       Save
    In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load ba-lance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.
    Reference | Related Articles | Metrics
    Efficient Friend Recommendation Scheme for Social Networks
    CHENG Hong-bing, WANG Ke, LI Bing, QIAN Man-yun
    Computer Science    2018, 45 (6A): 433-436.  
    Abstract284)      PDF(pc) (1594KB)(579)       Save
    With the rapid development of modern network technology,human society has entered the era of information.An increasing number of people prefer to talk and make friends with others through social networks.Besides the people or events which users initiatively focus on,social network will also recommend alternative users.However,most of the alternative users are the promotion of social networks.In this paper,for the accuracy and reliability of social networks recommendation,a new scheme based on tag matching was proposed.First,each word in the corpus is trained by Word2Vec,and then a word vectors space can be obtained and the similarity among words can be obtained by using the cosine similarity.Secondly,through the similarity comparison experiments,this paper chose an appropriate similarity value as the threshold to judge whether two words are similar.Finally,the similarity threshold was applied to the matching algorithm.The simulation experiments show that the recommend users are relatively reliable and accurate.
    Reference | Related Articles | Metrics
    Research on Data Mining Algorithm Based on Examination Process and Knowledge Structure
    DAI Ming-zhu,GAO Song-feng
    Computer Science    2018, 45 (6A): 437-441.  
    Abstract353)      PDF(pc) (1535KB)(656)       Save
    In order to study the mastery of knowledge points at different stages of student,based on the theory of data mining,knowledge structure was combined with examination results to study data.Based on the theory of educational measurement and the decision tree algorithm of data mining,an improved algorithm was proposed according to the original C4.5 algorithm,applying the difficulty level of the knowledge points involved in the test papers and the knowledge structure to refine the knowledge structure in order to determine the degree of knowledge of individual students or groups of students and the relationship between the knowledge points.The experimental results show that the efficiency of the improved algorithm is improved,whose formula is simple and practical compared with the original formula.According to the decision tree model,the remaining data is used to verify the improved formula,and it is faster to draw the conclusion that the effect of knowledge points on programming is relatively important.Test data is used to verify the decision tree,and the accuracy rate is 90%.Finally,a visual display of the decision tree can give an effective reference for students to learn the arrangements,teachers to develop teaching programs and arrangements.
    Reference | Related Articles | Metrics
    Algorithm for Mining Bipartite Network Based on Incremental Modularity
    DAI Cai-yan, CHEN Ling, HU Kong-fa
    Computer Science    2018, 45 (6A): 442-446.  
    Abstract348)      PDF(pc) (1606KB)(674)       Save
    Aiming at mining communities from bipartite network,an algorithm based on incremental modularity was proposed.The algorithm assumes that each vertex constitutes a community by itself with its own label.A part of the vertex copies its own label and passes it to a vertex on another part,so that it is located in the same community,and then it performs the same operation on the vertices of another part,and repeats iterations until convergence.In label propagation,the algorithm chooses the edge with the largest incremental modularity,so that the overall modularity is constantly improving.The experimental results on real datasets show that the proposed algorithm can mine high quality communities from bipartite network.
    Reference | Related Articles | Metrics
    Influence Factors Mining of Traffic Accidents Based on Association Rules
    JIA Xi-bin,YE Ying-jie,CHEN Jun-cheng
    Computer Science    2018, 45 (6A): 447-452.  
    Abstract240)      PDF(pc) (1561KB)(1654)       Save
    The road traffic safety is a public safety issue.The number of deaths due to traffic accidents account for the highest proportion in all accidents every year.With the development of big data intelligent analysis technology,the traffic accident data are extesively used to trace the causes,it is helpful to propose specific measures to avoid and prevent the occurrence of traffic accidents.According to the characteristics of diversity causes of traffic accidents,this paper proposd to use the news’ data of traffic accident combining with a wide range of news’ authenticity and characteristics timeliness to do the analysis of factors and the liability of traffic accidents.Taking the traffic accident news in Sina as the data source,the relevant factors of traffic accidents are extracted from it.In terms of the limitation in classic Apriori that only applies to a single dimension association mining and needs to scan database frequently,an improved multi-va-lued attribute Apriori algorithm was proposed.Focuing on the traffic accident data of provinces and cities,a variety of combination factors which lead to these traffic accidents were mined,thus the rules of frequent traffic accidents in pro-vinces and cities were summarized as the basis for taking preventive and regulatory measures.
    Reference | Related Articles | Metrics
    Scaling-up Algorithm of Multi-scale Classification Based on Fractal Theory
    LI Jia-xing, ZHAO Shu-liang,AN Lei,LI Chang-jing
    Computer Science    2018, 45 (6A): 453-459.  
    Abstract204)      PDF(pc) (1589KB)(568)       Save
    At present,the research of multi-scale data mining mainly focuses on space image data,and recently has produced some results on the general data,including the multi-scale clustering and multi-scale association rules,but it has not been involved in the field of classification mining.Combining with fractal theory,this paper applied the theory,knowledge and methods related to the multi-scale data mining to the areas of the classification mining,and proposed an approach of similarity measure based on Hausdorff.Relative to the definition of weight through experience,this paper clearly defined it by the similarity of generalized fractal dimension to improve the precision of similarity measure.Then,this paper proposed a multi-scale classification scaling-up algorithm named MSCSUA(Multi-Scale Classification Scaling-Up Algorithm).At last,this paper performed experiments on four UCI benchmark data sets and one real data set (H province part of the population).The experimental results show that the thought of multi-scale classification is feasible and effective,the MSCSUA algorithm performs well in terms of classification than SLAD,KNN,Decision Tree and LIBSVM algorithms on different data sets.
    Reference | Related Articles | Metrics
    Bisecting K-means Clustering Method Based on Cohesion and Coupling
    YU Yong,KANG Qing-yi,CHEN Chang-geng,KAN Shi-lin,LUO Yong-jun
    Computer Science    2018, 45 (6A): 460-464.  
    Abstract432)      PDF(pc) (1569KB)(694)       Save
    Clustering analysis is one of the most important techniques in data mining.It has important role and wide application in every field of social economy.K-means is one kind of the simple and widely used clustering methods,but its disadvantage is that it depends on the initial conditions and the number of clusters is difficult to determine.This paper introduced the cohesion and coupling of cluster,and presented the measurement of cohesion and coupling.Based on the principle of “high cohesion and low coupling”,the clusters are constantly divided and merged in the process of bisecting K-Means clustering algorithm.By judging whether the clustering results meet the requirements,it can determine the number of clusters,thus improving the bisecting K-Means clustering algorithm.The experimental results on Iris data show that the algorithm is not only more stable,but also has higher clustering accuracy.
    Reference | Related Articles | Metrics
    TEFRCF:Collaborative Filtering Personalized Recommendation Algorithm Based on Tag
    Entropy Feature Representation
    HE Ming, YANG Peng, YAO Kai-sheng, ZHANG Jiu-ling
    Computer Science    2018, 45 (6A): 465-470.  
    Abstract316)      PDF(pc) (1638KB)(658)       Save
    Tags are served as an effective way for information classification and information retrieval at the age of Web2.0.Tag recommendation systems aim to provide personalized recommendation for users by using tag data.Theexi-sting tag-based recommendation methods tend to assign the popular tags and their corresponding items more larger weight in predicting users’ interest on the items,resulting in weight deviations,reducing the novelty of the results and being unable to fully reflect users’ personalized interest.In order to solve the problems above,the concept of tag entropy was defined to measure the uncertainty of tags,and the collaborative filtering personalized recommendation algorithm based on tags entropy feature representation was proposed.This method solves the problem of weight deviation by introducing tag entropy,and then the tripartite graphs are used to describe the relationship among users,tags and items.The representation of users and items is constructed based on tag entropy feature representation,and the similarity of items is calculated by the feature similarity measure method.Finally,the user preferences for items are predicted by the linear combination of tags behaviors and similarity of items,and then the recommended list is generated according to the rank of preferences.The experimental results on Last.fm show that the proposed algorithm can improve recommendation accuracy and novelty,and satisfy the requirement for users.
    Reference | Related Articles | Metrics
    Hash Join in MapReduce Distributed Environment Based on Column-store
    ZHANG Bin, LE Jia-jin
    Computer Science    2018, 45 (6A): 471-475.  
    Abstract218)      PDF(pc) (1610KB)(782)       Save
    The characters of big data are volume,variety,value,velocity,and common hardware and open source.Aiming at the system inefficiency and limited scalability of traditional relational database in big data analysis,this paper presented an algorithm of Hash joins in MapReduce distributed environment based on column-store by introducing MapReduce computing model.First of all,this paper proposed the design of large data-oriented distributed computing models.Then,it proposed the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Lastly,the experiments evaluated execution time and load capacity.The results show that the proposed method is effective and can provid good scalability in big data analysis.
    Reference | Related Articles | Metrics
    Improved XGBoostModel Based on Genetic Algorithm for Hypertension Recipe Recognition
    LEI Xue-mei, XIE Yi-tong
    Computer Science    2018, 45 (6A): 476-481.  
    Abstract362)      PDF(pc) (1559KB)(905)       Save
    A novel improved XGBoost (eXtreme Gradient Boosting) model based on genetic algorithmfor hypertension recipe recognition was proposed.The model consists of three steps.Firstly,data pre-processing is employed to handle missing values,remove duplicate data and analyze data feature.Then,the genetic algorithm is used to optimize theparameters of XGBoost model adaptively.At last,hypertension recipe identification model is trained according to the optimal parameters.The results show that the parameters optimized by genetic algorithm performs better than grid search.Moreover,the proposed model outperforms other four models (Random forest,GBDT,Bagging and AdaBooster) over four evaluation measures:accuracy,recall rate,F1 and the area under the curve (AUC) on average,and enhances the interpretability of credit scoring model.
    Reference | Related Articles | Metrics
    Co-location Pattern Mining Algorithm Based on Data Normalization
    ZENG Xin,LI Xiao-wei,YANG Jian
    Computer Science    2018, 45 (6A): 482-486.  
    Abstract197)      PDF(pc) (1548KB)(515)       Save
    In the practical application,the spatial features not only contain the spatial information,but also the attribute information,which is important for the knowledge discovery and scientific decision.Existing co-location pattern mining algorithms do not consider the weight of instances of different attributes in the adjacent distance when calculating the adjacent distance of two different feature instances.It results in that the weight of partial attribute is too large and also affects the result of the co-location pattern mining.Standardizing the attribute values and giving an equal weight to all attributes,a data standardization algorithm DNRA based on join-based was put forward.Meanwhile,a deep research was given on the problem that the distance threshold was difficult to determine.The range of the distance threshold was derived in DNRA algorithm,helping the users to select the appropriate distance threshold.Finally,the performance of the DNRA algorithm was analyzed and compared by a large number of experiments.
    Reference | Related Articles | Metrics
    Adaptive Stochastic Gradient Descent for Imbalanced Data Classification
    TAO Bing-mo,LU Shu-xia
    Computer Science    2018, 45 (6A): 487-492.  
    Abstract302)      PDF(pc) (1576KB)(852)       Save
    For imbalanced data classification,the performance of using traditional stochastic gradient descent for solving SVM problems is not very well.Adaptive stochastic gradient descent algorithm defines a distribution pinstead of using uniform distribution to choose examples,and the smoothing hinge loss function is used in the optimization problem.Because of the training sets are imbalanced,using uniform distribution will cause the algorithm choose more majority class based on the imbalanced ratio.That would result the classifier bias towards the minority class.The distribution p largely overcomes this issue.When to stop the programs becomes an important problem,because the normal stochastic gradient descent algorithm does not have a stop criterion especially for large data sets.The stop criterion was setted according to the classification accuracy on the training sets or its subsets.This stop criterion could stop the programs very early especially for large data sets if the parameters are chosen properly.Some experiments on imbalanced data sets show that the proposed algorithm is effective.
    Reference | Related Articles | Metrics
    Coordination Filtering Personalized Recommendation Algorithm Considering Average
    Preference Weight and Popularity Division
    HE Ji-xing,CHEN Wen-bin,MOU Bin-hao
    Computer Science    2018, 45 (6A): 493-496.  
    Abstract219)      PDF(pc) (1537KB)(996)       Save
    This paper presented a new recommendation algorithm which takes into account the average preference weight.The algorithm is divided into three stages:neighborhood computing,data set partitioning and preference prediction.In the neighborhood calculation,the KNN based on the Euclidean distance is used to determine the neighborhood.At the same time,the data set is divided into the data set and the non-popular data set according to the popularity threshold of the data set itself.When the score is predicted,the existing neighborhood selects part of the project accor-ding to the popularity degree,and predicts the user’s average preference weight based on the preference similarity of the item set.The results show that on the Movielens 100K data set,the new algorithm is superior to the typical cosine recommendation algorithm,the person recommendation algorithm,the collaborative filtering algorithm based on the project preference coordination filtering algorithm and the user attribute weighted active neighbor existing algorithms in MAE.
    Reference | Related Articles | Metrics
    Correlative Factors about Elderly Disabled and Dementia in Big Data Environments
    LI Han LI Hui-jia, ZHANG Lin-zi, HUANG Yu-ying
    Computer Science    2018, 45 (6A): 497-501.  
    Abstract218)      PDF(pc) (1583KB)(865)       Save
    Recently,with the deepening of China’s population aging problem,the rising family pension burden,government pressure and the decline of demographic dividend have gained wide attention.Aging,status,and sense of control (ASOC) is a continuous survey data about elderly’s health in American which was conducted every three years from 1995 to 2001.Using three waves of ASOC,we have found influential factors related to disability and dementia by logistic regression based on the descriptive statistical analysis.On the basis of the descriptive statistics,by Logistic regression and removing strong related factors,we discovered that age,gender,smoking,drinking,exercise,heart disease,arthritis/rheumatism, and participation in community servicehave strong correlation with the elderly disability.Age,drinking,hypertension,participation in community service and marital status are significantly related to dementia in the elderly.
    Reference | Related Articles | Metrics
      First page | Prev page | Next page | Last page Page 1 of 1, 21 records