Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

Bagnall, Anthony, Moxon, Simon ORCID: https://orcid.org/0000-0003-4644-1816 and Studholme, David (2007) Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana. Working Paper. University of East Anglia.

[thumbnail of 1.77192.1205421103!cmp-c07-02.pdf]
Preview
PDF (1.77192.1205421103!cmp-c07-02.pdf)
Download (394kB) | Preview

Abstract

The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier trans-form based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms.

Item Type: Monograph (Working Paper)
Faculty \ School: Faculty of Science > School of Computing Sciences
UEA Research Groups: Faculty of Science > Research Groups > Data Science and Statistics
Depositing User: Vishal Gautam
Date Deposited: 04 Apr 2011 12:36
Last Modified: 24 May 2023 05:37
URI: https://ueaeprints.uea.ac.uk/id/eprint/21585
DOI:

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item