- Assistant Professor
- Expertise: 1. Informatics
a. Algorithm design.
b. Software development and computer programming, both prototyping and machine-level optimization. Familiar with Perl, FORTRAN, and C.
c. Bioinformatics: design and analysis of gene-array (and Genechip microarray) high throughput gene expression data.
2. Biostatistics
a. Data analysis, utilizing a variety of statistical software; Experienced in R (and S-Plus), SAS, Stata, and SPSS
i. Experienced in survival analysis (including multiple events, competing risks, frailty, state-space and Markov chain), and repeated measures (random-effects models and marginal models).
b. Designing analysis plan for biomedical questions, utilizing science in both biomedicine and in statistics.
c. Design of epidemiological surveys, cross-sectional studies, and laboratory animal studies; Design of questionnaires and forms; Proposal and grant writing.
3. Instructor (problem-based learning), Lecturer, Workshop facilitator. Group formation and management.
- Research Interests: I am interested in improving information retrieval from digital repositories. This can be in the form of new methods for data mining, as well as new search engines.
We have designed and implemented a dual-mining method where biomedical patterns detected in a clinical data repository are cross-validated against a knowledge-base, in order to better narrow down on interesting patterns and eliminate uninteresting ones. The dual mining of the database and the knowledgebase provides a surprise score, where discrepancy in strengths of the same pattern in the two repositories is considered a high surprise, and hence more interesting. Details about the method are published in "Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method. BMC Med Inform Decis Mak. 2006 Mar 7;6(1):13". Here is an excerpt:
Subject: information retrieval systems, multi-repository data mining, interestingness measures:
Data mining can be utilized to automate analysis of substantial amounts of data produced in many organizations. However, data mining produces large numbers of rules and patterns, many of which are not useful. Existing methods for pruning uninteresting patterns have only begun to automate the knowledge acquisition step (which is required for subjective measures of interestingness), hence leaving a serious bottleneck. In this project we proposed a method, an automatic acquisition of knowledge, to shorten the pattern list by locating the novel and interesting ones.
The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "Surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures magnitude of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance.
We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database (University of Virginia's Clinical Data Repository) and a biomedical literature citation knowledgebase (MEDLINE).
In a second project, we investigated reasons search engines on MEDLINE database retrieve irrelevant articles for a given query. We have developed a method to retrieve more relevant articles, and have implemented the method in a new search engine, www.relemed.com . A published paper explains the method and the implemented search engine, "Siadaty MS, Shu JS, Knaus WA. Relemed: Sentence-level search engine with relevance score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak. 2007 Jan 10;7:1". Here is an excerpt:
Subject: information retrieval systems, search engines, relevance metrics, natural language processing:
Encountering extraneous articles in response to a query submitted to MEDLINE/PubMed is not uncommon. However, every one of the articles retrieved contains all of the query words. This led us to the conclusion that the presence of query words in an article is not a sufficient condition for the article to be relevant to user's query, although it is a necessary. About 83% of queries sent to PubMed, NLM's search engine for MEDLINE, are multi-word queries. When submitting a query with multiple words, the user is usually interested in some type of relationship between the words, such that the "presence of relationship" between the query words in the article also becomes a necessary condition for relevance. We proposed that if two words occur within an article, the probability that a relation between them is explained is clearly higher when the words occur within the same sentence (or adjacent sentences) versus remote sentences.
We have developed "Relemed", a search engine for MEDLINE. Relemed increases specificity and precision of retrieval by searching for query words within sentences rather than the whole article. It uses sentence-level concurrence as a statistical surrogate for the existence of relationship between the words. It also estimates a relevance score and sorts the results on this basis, thus shifting irrelevant articles lower down the list. We used distributed parallel search architecture, to keep the response time short despite the heavy natural language processing required.
|