N-gram Based Lexical Sentence Similarity Score Using Modified Jaccard Algorithm
DOI:
https://doi.org/10.70112/ajcst-2025.14.2.4389Keywords:
Natural Language Processing (NLP), Sentence Similarity, Answer Assessment, Jaccard Algorithm, Subjective Answer EvaluationAbstract
In recent years, the Internet has evolved into a global phenomenon, making it nearly impossible to envision modern life without it. Among the vast forms of online content, textual data holds the greatest significance due to its abundance and informational value. However, managing and analysing such extensive text corpora poses several challenges, with sentence similarity emerging as one of the most complex problems in Natural Language Processing (NLP). Although existing sentence comparison techniques perform effectively in specific contexts, they often struggle in others and typically require substantial computational resources, including powerful hardware, extensive training datasets, and high processing capabilities. To address these limitations, this study introduces a lightweight approach that emphasizes word-level similarity through comprehensive n-gram comparisons. The proposed method incorporates semantic understanding and evaluates the longest common subsequence within sentences to generate a more accurate similarity score. It demonstrates superior efficiency over baseline methods by minimizing computational requirements and leveraging straightforward mathematical operations.
References
[1] Y. Li, D. McLean, Z. Bandar, J. O’Shea, and K. Crockett, “Sentence similarity based on semantic nets and corpus statistics,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138–1150, Aug. 2006.
[2] F. Ahmad and M. Faisal, “A novel hybrid methodology for computing semantic similarity between sentences through various word senses,” International Journal of Cognitive Computing in Engineering, vol. 3, pp. 58–77, 2022.
[3] X. Yang et al., “Measurement of semantic textual similarity in clinical texts,” JMIR Medical Informatics, vol. 8, no. 11, e19735, Nov. 2020.
[4] Z. H. Amur, M. T. H. Rahman, and M. H. A. Rahman, “Short-text semantic similarity (STSS): Techniques, challenges and future perspectives,” Applied Sciences, vol. 13, no. 6, Art. 3911, Mar. 2023.
[5] M. Farouk, “Sentence semantic similarity based on word embedding and WordNet,” in Proceedings of the 13th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 2018, pp. 33–37.
[6] H. Steck, C. Ekanadham, and N. Kallus, “Is cosine-similarity of embeddings really about similarity?” in Companion Proceedings of the ACM Web Conference 2024, 2024, pp. 887–890.
[7] T. Wang et al., “A joint FrameNet and element-focusing Sentence-BERT method of sentence similarity computation (FEFS3C),” Expert Systems with Applications, 2022.
[8] Y. Yoo, T.-S. Heo, Y. Park, and K. Kim, “A novel hybrid methodology of measuring sentence similarity,” Symmetry, vol. 13, no. 8, p. 1442, 2021.
[9] M. Oussalah and M. Mohamed, “Knowledge-based sentence semantic similarity: algebraical properties,” Progress in Artificial Intelligence, vol. 11, no. 1, pp. 43–63, 2022.
[10] C. Leacock and M. Chodorow, “C-rater: Automated scoring of short-answer questions,” Computers and the Humanities, vol. 37, pp. 389–405, 2003.
[11] F. Nooralahzadeh et al., “Progressive transformer-based generation of radiology reports,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 2021, pp. 2824–2832.
[12] T. Kakkonen, N. Myller, E. Sutinen, and J. Timonen, “Comparison of dimension reduction methods for automated essay grading,” Journal of Educational Technology & Society, vol. 11, no. 3, pp. 275–288, 2008.
[13] A. Dhokrat, H. Gite, and C. N. Mahender, “Assessment of answers: Online subjective examination,” in Proceedings of the Workshop on Question Answering for Complex Domains, Mumbai, India, 2012, pp. 47–56.
[14] M. M. Islam and A. L. Hoque, “Automated essay scoring using generalized latent semantic analysis,” in Proceedings of the 2010 International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 2010, pp. 358–363.
[15] L. Ramachandran, J. Cheng, and P. Foltz, “Identifying patterns for short answer scoring using graph-based lexico-semantic text matching,” in Proceedings of the 10th Workshop on Innovative Use of NLP for Building Educational Applications, 2015, pp. 97–106.
[16] K. Sakaguchi, M. Heilman, and N. Madnani, “Effective feature integration for automated short answer scoring,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2015), 2015, pp. 1049–1054.
[17] M. Farouk, “Measuring sentences similarity: A survey,” Indian Journal of Science and Technology, vol. 12, no. 25, pp. 1–11, 2019.
[18] Y. Bounab et al., “Sentence to sentence similarity: A review,” in Proceedings of FRUCT’25 (Finnish–Russian University Cooperation in Telecommunications), Helsinki, Finland, Nov. 2019.
[19] B. Li, J. Lu, J.-M. Yao, and Q.-M. Zhu, “Automated essay scoring using the KNN algorithm,” in Proceedings of the 2008 International Conference on Computer Science and Software Engineering, 2008.
[20] R. A. A. Akinyemi, W. Ajayi, and A. Atuman, “Automation of customer support system (Chatbot) to solve web-based financial and payment application service,” Asian Journal of Computer Science and Technology, vol. 12, no. 2, pp. 1–17, 2023.
[21] A. G. L. Raja, F. S. Francis, and P. Sugumar, “Construction of lexicons to perk up re-clustering,” Asian Journal of Computer Science and Technology, vol. 7, no. 3, pp. 82–85, 2018.
[22] S. J. Lakshmi and M. Thangaraj, “Recommender system for student performance using EDM,” Asian Journal of Computer Science and Technology, vol. 7, no. 3, pp. 53–57, 2018.
[23] R. K. Jain, “A survey on different approach used for sign language recognition using machine learning,” Asian Journal of Computer Science and Technology, vol. 12, no. 1, pp. 11–15, 2023.
[24] V. Bonta, N. Kumaresh, and N. Janardhan, “A comprehensive study on lexicon-based approaches for sentiment analysis,” Asian Journal of Computer Science and Technology, vol. 8, no. S2, pp. 1–6, 2019.
[25] A. Ahmadi, “Unravelling the mysteries of hallucination in large language models: Strategies for precision in artificial intelligence language generation,” Asian Journal of Computer Science and Technology, vol. 13, no. 1, pp. 1–10, 2024.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Centre for Research and Innovation

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
