IMPLEMENTATION OF TF-IDF AND COSINE SIMILARITY ALGORITHMS FOR CLASSIFICATION OF DOCUMENTS BASED ON ABSTRACT SCIENTIFIC JOURNALS

  • Paska Marto Hasugian Program Studi Rekayasa Perangkat Lunak, STMIK Pelita Nusantara
  • Jonson Manurung Program Studi Rekayasa Perangkat Lunak, STMIK Pelita Nusantara
  • Logaraz Logaraz Mahasiswa, Program Studi Rekayasa Perangkat Lunak, STMIK Pelita Nusantara
  • Uzitha Ram Mahasiswa, Program Studi Rekayasa Perangkat Lunak, STMIK Pelita Nusantara
Keywords: Text Mining, Tf-If, Classification, Cs

Abstract

Research on one of the higher education dharmas is carried out by each lecturer and is a challenge for lecturers who pay attention to produce new and useful findings. Research results will be published in journals both nationally and internationally and one of the websites published by Ristekbirn is Sinta which includes all research works in Indonesia. The problem in this research is the accumulation of data that is getting bigger and it needs to be analyzed by utilizing text mining by searching for the resources contained in the abstract document and presenting part of the information. The purpose of this study is to classify the suitability of another document so that knowledge is found. and placement in groups according to existing topics. The process of these problems is by classifying documents based on abstracts from the publication of scientific papers. Solving these problems involves two mutually supporting algorithms, namely TD-IDF with Cosine Similarity with different tasks. TF-IDF ensures the weight of each document that can be read and read with Cosine Similarity. This research uses text mining as part of the search for related patterns and documents that have been tested. For the process of calculating the test data, 1 document and 15 documents were used as training data. With the calculation of TD-IDF the weight of each document from Q, D2 to D15 is 10,946, 28,050,27,176, 39,043, 36,535, 30,696, 25,612, 12,581, 42,335, 29,661, 33,867, 31,706, 22,654, 15,450, 59,832, 42,127, The similarity of the data is tested by determining the value of k = 4 which results in similarity to the Expert System and Cryptography, while with the selection of K = 5 with the highest similarity to the expert system..

Downloads

Download data is not yet available.

References

[1] S. Kurniawan, W. Gata, D. A. Puspitawati, N. -, M. Tabrani, and K. Novel, “Perbandingan Metode Klasifikasi Analisis Sentimen Tokoh Politik Pada Komentar Media Berita Online,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), 2019.
[2] B. S. Prakoso, D. Rosiyadi, H. S. Utama, and D. Aridarma, “Klasifikasi Berita Menggunakan Algoritma Naive Bayes Classifer Dengan Seleksi Fitur Dan Boosting,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), 2019.
[3] H. S. Utama, D. Rosiyadi, B. S. Prakoso, and D. Ariadarma, “Analisis Sentimen Sistem Ganjil Genap di Tol Bekasi Menggunakan,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), 2019.
[4] M. Nurjannah, Hamdani, and I. Fitri Astuti, “Penerapan Algoritma Term Frequency-Inverse Document Frequency (TF-IDF) Untuk Text Mining,” J. Inform. Mulawarman, 2013.
[5] B. Bahruni and F. Fathurrahmad, “Analisis Trend Topik Penelitian pada Web Of Science dan SINTA untuk Penentuan Tema Tugas Akhir Mahasiswa AMIK Indonesia Banda Aceh,” J. SAINTEKOM, 2020.
[6] A. Fathan Hidayatullah, M. Rifqi Ma, and arif Program Studi Manajemen Informatika STMIK Jenderal Achmad Yani Yogyakarta Jl Ringroad Barat, “Penerapan Text Mining dalam Klasifikasi Judul Skripsi,” Semin. Nas. Apl. Teknol. Inf. Agustus, 2016.
[7] W. Gata, “Akurasi Text Mining Menggunakan Algoritma K-Nearest Neighbour pada Data Content Berita SMS,” vol. 6, pp. 1–13, 2017.
[8] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of TF*IDF, LSI and multi-words for text classification,” Expert Syst. Appl., 2011.
[9] A. A. Hakim, A. Erwin, K. I. Eng, M. Galinium, and W. Muliady, “Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach,” in Proceedings - 2014 6th International Conference on Information Technology and Electrical Engineering: Leveraging Research and Technology Through University-Industry Collaboration, ICITEE 2014, 2014.
[10] C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt, “Representation learning for very short texts using weighted word embedding aggregation,” Pattern Recognit. Lett., 2016.
[11] I. Yahav, O. Shehory, and D. Schwartz, “Comments Mining With TF-IDF: The Inherent Bias and Its Removal,” IEEE Trans. Knowl. Data Eng., 2019.
[12] A. I. Kadhim, Y. N. Cheah, and N. H. Ahamed, “Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering,” in Proceedings - 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, ICAIET 2014, 2015.
[13] H. Niemann, M. G. Moehrle, and J. Frischkorn, “Use of a new patent text-mining and visualization method for identifying patenting patterns over time: Concept, method and test application,” Technol. Forecast. Soc. Change, 2017.
[14] Imam Riadi, Sunardi, and P. Widiandana, “Investigating Cyberbullying on WhatsApp Using Digital Forensics Research Workshop,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), 2020.
[15] G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model,” Comput. y Sist., 2014.
[16] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011.
[17] A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” in Proceedings of 2016 4th International Conference on Cyber and IT Service Management, CITSM 2016, 2016.
[18] B. Herwijayanti, D. E. Ratnawati, and L. Muflikhah, “Klasifikasi Berita Online dengan menggunakan Pembobotan TF-IDF dan Cosine Similarity,” Pengemb. Teknol. Inf. dan Ilmu Komput., 2018.
Published
2021-08-31
How to Cite
Marto Hasugian , P., Manurung, J., Logaraz, L., & Ram, U. (2021). IMPLEMENTATION OF TF-IDF AND COSINE SIMILARITY ALGORITHMS FOR CLASSIFICATION OF DOCUMENTS BASED ON ABSTRACT SCIENTIFIC JOURNALS. INFOKUM, 9(2, June), 518-526. Retrieved from https://infor.seaninstitute.org/index.php/infokum/article/view/201