Pattern Matching for DNA Sequencing Data

Using Spring Batch and Karp Rabin

Author : Wadï Mami

E-mail : wmami@steg.com.tn/ didipostman77@gmail.com

Date : 17/06/2012

Abstract :

Processing large volume of data has always been a major problem due to the increasing volume of the data. Batch processing can be applied in many use cases. Among them why not Pattern Matching for DNA Sequencing Data. In this article, I am going to demonstrate batch processing using one of the projects of Spring which is Spring Batch. Spring Batch provides functions for processing large volumes of data in batch jobs. In our case reading DNA file or database table and seeking for patterns I mean all the locations of the specified pattern inside a DNA sequence.

Spring batch to process huge data :

Spring Batch is a lightweight, comprehensive batch framework designed to

enable the development of robust batch applications vital for the daily

operations of enterprise systems.

Spring Batch reads and process DNA sequentially until reaching commit-interval value then it writes transformed items (DNA) simultaneously.

Spring Batch uses a 'Chunk Oriented' processing style within its most common

implementation. Chunk oriented processing refers to reading the data one at a

time, and creating 'chunks' that will be written out, within a transaction

boundary. One item is read in from an `ItemReader`, handed to

an `ItemProcessor`, and aggregated. Once the number of items read equals the

commit interval.

, the entire chunk is written out via the ItemWriter, and then the transaction is

committed.

Below is a code representation of the same concepts shown above:

List items = new Arraylist();

for(int i = 0; i < commitInterval; i++){

Object item = itemReader.read()

Object processedItem = itemProcessor.process(item);

items.add(processedItem);

}

itemWriter.write(items);

A step is an object that encapsulates a sequential phase of a job and holds all the

necessary information to define and control processing. It delegates all the

information to a Job (job.xml) to carry out its task.

<job id="dnaSeq">
	<step id="dnaSeqStep">
	<tasklet transaction-manager="transactionManager">
	<chunk reader="csvItemReader" writer="csvItemWriter"
	processor="DNA_SequenceProcessor" commit-interval="2">
	</chunk>
	</tasklet>
	</step>
	</job>

Configuring ItemReader

We will now define ItemReader for our model which will be used for

reading data from CSV file.

<bean:bean id="csvItemReader"

class="org.springframework.batch.item.file.FlatFileItemReader"

scope="step">

<bean:property name="resource"

value="classpath:ch02/data/DNA.csv"/>

<bean:property name="lineMapper">

<bean:bean

class="org.springframework.batch.item.file.mapping.DefaultLineMapper">

<bean:property name="lineTokenizer" ref="lineTokenizer"/>

<bean:property name="fieldSetMapper">

<bean:bean class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">

<bean:property name="prototypeBeanName" value="DNA_Sequence">

</bean:property>

</bean:bean>

</bean:property>

</bean:bean>

</bean:property>

</bean:bean>

<bean:bean id="lineTokenizer"

class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">

<bean:property name="delimiter" value=","/>

<bean:property name="names">

<bean:list>

<bean:value>dna</bean:value>

<bean:value>crissprArrays</bean:value>

</bean:list>

</bean:property>

</bean:bean>

Configuring ItemProcessor

<bean:bean id="DNA_SequenceProcessor" scope="step"

class="com.juxtapose.example.ch02.DNA_SequenceProcessor">

</bean:bean>

As you can see I use a DNASequence_Processor class that implements itemProcessor and use Karp Rabin Algorithm.

ItemWriter

Once the data is processed, the data needs to be stored in a file as per our requirement.

<bean:bean id="csvItemWriter"

class="org.springframework.batch.item.file.FlatFileItemWriter"

scope="step">

<bean:property name="resource" value="file:target/ch02/outputFile.csv"/>

<bean:property name="lineAggregator">

<bean:bean

class="org.springframework.batch.item.file.transform.DelimitedLineAggregator">

<bean:property name="delimiter" value="|"></bean:property>

<bean:property name="fieldExtractor">

<bean:bean

class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">

<bean:property name="names"

value="dna, seqDNA_Arrays">

</bean:property>

</bean:bean>

</bean:property>

</bean:bean>

</bean:property>

</bean:bean>

Conclusion

This article just scratched the surface of Spring Batch in general. The example used in this article is not production-ready code. You can define job configuration depending on your project requirements.

Here The Github repository for the project

https://github.com/didipostman/SBKarpRabin

DNA is a sequence of letters such as A, C, G, T. Searching for specific sequences is often difficult due to measurement errors, mutations or evolutionary alterations. Thus, similarity of two sequences using Levenshtein Distance is more useful than exact matches.
So instead of Karp Rabin we will use Levenshtein Distance or Jaro_Winkler_Similarity by using 
Package org.apache.commons.text.similarity

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/package-summary.html

So Spring Batch +
Levenshtein Distance or Jaro_Winkler Similarity = How Crispr cas9 Works due to (https://www.tudelft.nl/en/2018/tu-delft/mathematics-explains-why-crispr-cas9-sometimes-cuts-the-wrong-dna)

Nano Robots could be guided by 
@springbatch
+ Karp Rabin (or Levenshtein Distance or JaroWinkler Similarity) it is just a suggestion I am not an expert a #nanotechnology #java #developer 
then you can cure diseases like #cancer #diabetes etc

The Ultimate definitive guide on how Spring batch could may explain how CRISPR Cas9Works read this link  https://didipostmanprojects.blogspot.com/2023/01/the-definitive-explanation-on-how.html

Search This Blog

My Projects

Spring Batch Karp Rabin

Spring batch to process huge data :

Spring Batch is a lightweight, comprehensive batch framework designed to

enable the development of robust batch applications vital for the daily

operations of enterprise systems.

Spring Batch reads and process DNA sequentially until reaching commit-interval value then it writes transformed items (DNA) simultaneously.

Spring Batch uses a 'Chunk Oriented' processing style within its most common

implementation. Chunk oriented processing refers to reading the data one at a

time, and creating 'chunks' that will be written out, within a transaction

boundary. One item is read in from an `ItemReader`, handed to

an `ItemProcessor`, and aggregated. Once the number of items read equals the

commit interval.

A step is an object that encapsulates a sequential phase of a job and holds all the

necessary information to define and control processing. It delegates all the

information to a Job (job.xml) to carry out its task.

Configuring ItemReader

Configuring ItemProcessor

ItemWriter

Conclusion

Package org.apache.commons.text.similarity

Comments

Post a Comment

Popular posts from this blog

Goldbach’s conjecture proven

Spring Batch for nanorobots

Shutdown Windows Security Threat

Spring Batch Karp Rabin

Spring batch to process huge data :

Spring Batch is a lightweight, comprehensive batch framework designed to

enable the development of robust batch applications vital for the daily

operations of enterprise systems.

Spring Batch reads and process DNA sequentially until reaching commit-interval value then it writes transformed items (DNA) simultaneously.

Spring Batch uses a 'Chunk Oriented' processing style within its most common

implementation. Chunk oriented processing refers to reading the data one at a

time, and creating 'chunks' that will be written out, within a transaction

boundary. One item is read in from an ItemReader, handed to

an ItemProcessor, and aggregated. Once the number of items read equals the

commit interval.

A step is an object that encapsulates a sequential phase of a job and holds all the

necessary information to define and control processing. It delegates all the

information to a Job (job.xml) to carry out its task.

Configuring ItemReader

Configuring ItemProcessor

ItemWriter

Conclusion

Package org.apache.commons.text.similarity

Comments

Post a Comment

Popular posts from this blog

Goldbach’s conjecture proven

Spring Batch for nanorobots

Shutdown Windows Security Threat

boundary. One item is read in from an `ItemReader`, handed to

an `ItemProcessor`, and aggregated. Once the number of items read equals the