[R] Fwd: sqldf not joining all the fields
David Winsemius
dwinsemius at comcast.net
Fri Mar 12 20:51:08 CET 2010
On Mar 12, 2010, at 2:15 PM, Natalie Van Zuydam wrote:
> Hi,
> I did quote the string and as I told read.table to strip.white I
> also tried
> it with no spaces.
> "ASPIRIN DISP AAH" %in% tsf_data$name
> [1] FALSE
>
> "ASPIRINDISPAAH" %in% tsf_data$name
> [1] FALSE
>
> I have looked at the last lines of my y_data object and there is a
> problem
> with the file that I'm going to try to use perl to sort out
> max(tsf_data$item_code)
[1] 123612
Looking at the Firefox version of for this value I see line starting:
123612|MINIJET ADRENALINE 1.5"|INJ|100MCG/ML|2.7.3
... has unmatched '"'
--
David.
>
>
> if I y_data[2847:2848,]
> RONIC ACID|TABS|5MG|6.6.2\n652903|ALENDRONIC ACID Once
> Wee|TABS|70MG|1.1.1\n653000|AZATHIOPRINE|INJ|50MG|8.2.1\n653200|
> DORZOLAMIDE
> EYE|DROPS|2%|11.6\n653500|TOPIRAMATE|TABS|50MG|4.8.1\n653510|
> TOPIRAMATE|TABS|100MG|4.8.1\n653513|TOPIRAMATE|TABS|25MG|
> 4.8.1\n653600|GUAR
> GUM
> SF|SACH|5G|6.1.2.3\n653700|TACALCITOL|OINT|NULL|13.5.2\n654000|
> COCODAMOL|CAPS|30/500MG|4.7.1\n654010|COCODAMOL|TABS|12.8/500M|
> 4.7.1\n654020|COCODAMOL|SACH|30/500MG|4.7.1\n654300|ACAMPROSATE
> CALCIUM|TABS|333MG|4.10\n654400|ACECLOFENAC|TABS|100MG|
> 10.1.1\n654500|TILUDRONIC
> ACID|TABS|200MG|6.6.2\n654600|TAMSULOSIN HCL
> MR|CAPS|400MCG|7.4.1\n654800|PENCICLOVIR|CREAM|1%|13.10.3\......
> ........formulation_code strength bnf_code
> 2847 INJ 1MG/ML 3.4.3
> 2848
>
> The output looks like this so it must be the input files and not an
> sqldf
> problem. I didn't see this in the script editor I was using.
>
> Natalie
>
> On Fri, Mar 12, 2010 at 7:05 PM, Dennis Murphy <djmuser at gmail.com>
> wrote:
>
>> Hi:
>>
>> Part of the problem is that your string contains spaces; in your
>> example
>> below, you would need to quote the string, I believe, as in
>>
>> "ASPIRIN DISP AAH" %in% y$name [untested...]
>>
>> HTH,
>> Dennis
>>
>>
>> On Fri, Mar 12, 2010 at 10:59 AM, Natalie Van Zuydam <nvanzuydam at gmail.com
>>> wrote:
>>
>>> 842752|ASPIRIN DISP AAH|TABS|75MG|2.9 is taken directly from my
>>> y_data
>>> text
>>> file.
>>>
>>> If I search for ASPIRIN DISP AAH %in% y$name I get: FALSE.
>>> Despite the
>>> fact that it is there in the text file that I loaded into y. There
>>> must be
>>> a
>>> problem with my input.
>>>
>>> tsf_data <- read.table("feb09_267_presc_items_tsf.txt", header =
>>> TRUE, sep
>>> =
>>> "|", quote = "\"'",
>>> dec = ".",as.is = TRUE,na.strings = "NA",colClasses
>>> = NA,
>>> nrows = 3864284,
>>> skip = 0, check.names = TRUE,fill=TRUE,
>>> strip.white = TRUE, blank.lines.skip = TRUE,
>>> comment.char = "#", allowEscapes = FALSE, flush =
>>> FALSE,
>>> fileEncoding = "", encoding = "unknown")
>>>
>>> Would fill=TRUE and strip.white=TRUE affect how the item_code
>>> columns are
>>> matched? When I look at the file in a script editor I cannot see
>>> any
>>> strange symbols or formatting? I have had to use fill otherwise I
>>> get the
>>> following error message:
>>>
>>> Warning message:
>>> In scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>> na.strings, :
>>> number of items read is not a multiple of the number of columns
>>>
>>> Not sure if correcting this in read.table is affecting the join
>>> function
>>> of
>>> sqldb?
>>>
>>> Thanks,
>>> Natalie
>>>
>>>
>>>
>>> I also tried 842752
>>>
>>> On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck <
>>> ggrothendieck at gmail.com
>>>> wrote:
>>>
>>>> That is not so. 842752 does not exist in y$item_code and ASPIRIN
>>>> has
>>>> a code of 22730.
>>>>
>>>>> 842752 %in% y$item_code
>>>> [1] FALSE
>>>>
>>>>> subset(y, name == "ASPIRIN")
>>>> item_code name formulation_code strength bnf_code
>>>> 850 22730 ASPIRIN TABS 300MG 4.7.1
>>>> 855 22780 ASPIRIN PDR NULL 4.7.1
>>>> 856 22790 ASPIRIN MIXT $ 4.7.1
>>>>
>>>>
>>>> On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam
>>>> <nvanzuydam at gmail.com> wrote:
>>>>> ---------- Forwarded message ----------
>>>>> From: Natalie Van Zuydam <nvanzuydam at gmail.com>
>>>>> Date: Fri, Mar 12, 2010 at 5:49 PM
>>>>> Subject: Re: [R] sqldf not joining all the fields
>>>>> To: David Winsemius <dwinsemius at comcast.net>
>>>>>
>>>>>
>>>>> Dear David
>>>>>
>>>>> I'm not sure what the problem is as for every item code there is a
>>>>> corresponding information in the y_data. For example 842752
>>>>> from the
>>>> x_data
>>>>> corresponds to Aspirin in the y_data? Yet when I use sqldf to
>>>>> join
>>> the
>>>> two
>>>>> df's I get NA values in the columns from the y_data in z for
>>>>> 842752
>>> item
>>>>> code....is there something wrong with my sqldf code or something
>>>>> wrong
>>>> with
>>>>> the way I have inputed the data frames?
>>>>>
>>>>> Thanks for taking the time to help me,
>>>>> Natalie
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <
>>> dwinsemius at comcast.net
>>>>> wrote:
>>>>>
>>>>>> If I assign the file input to y_data and change you sqldf to
>>>>>>
>>>>>>> z <- sqldf("select * from x_data left join y_data using
>>>> (item_code)"); z
>>>>>>
>>>>>> I can replicate your result. Even after changing the types of
>>>>>> the two
>>>>>> item_code fields to match I still get the same result and when
>>>>>> I see
>>> to
>>>> what
>>>>>> degree they share values I get:
>>>>>>
>>>>>>> sum(x_data$item_code %in% y_data$item_code)
>>>>>> [1] 2
>>>>>>> sum(y_data$item_code %in% x_data$item_code)
>>>>>> [1] 2
>>>>>>
>>>>>>
>>>>>> So why are you so sure they are "complete" as you claimed in your
>>> first
>>>>>> email.
>>>>>>
>>>>>> --
>>>>>> David.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
>>>>>>
>>>>>> You have now given two different assignments to x_data and none
>>>>>> to
>>>> y_data:
>>>>>>>
>>>>>>> The str( from the file access offering:
>>>>>>>
>>>>>>>> str(x_data)
>>>>>>> 'data.frame': 2848 obs. of 5 variables:
>>>>>>> $ item_code : int 100 110 150 160 161 164 200 210 212
>>>>>>> 220 ...
>>>>>>> $ name : chr "NEONACLEX K" "NEONACLEX" "MESORB"
>>> "ABSORBENT
>>>>>>> CELLULOSE MESO" ...
>>>>>>> $ formulation_code: chr "TABS" "TABS" "DRESS" "DRESS" ...
>>>>>>> $ strength : chr "NULL" "5MG" "10CMX10CM"
>>>>>>> "10CMX10CM" ...
>>>>>>> $ bnf_code : chr "2.2.8" "2.2.1" "20.3.1" "20.3.1" ...
>>>>>>>
>>>>>>> The str from assignment from the dput offering
>>>>>>>> str(x_data)
>>>>>>> 'data.frame': 10 obs. of 10 variables:
>>>>>>> $ prochi : chr "CAO0000713" "CAO0000713" "CAO0000713"
>>>>>>> "CAO0000713" ...
>>>>>>> $ prescribed_date: chr "22/06/2001" "28/04/2000" "10/04/2000"
>>>>>>> "03/07/2000" ...
>>>>>>> $ dataMonth : chr "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ item_code : chr "842752" "7800" "842652" "842652" ...
>>>>>>> $ res_seqno : chr "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ quantity : chr "60" "100G" "60" "60" ...
>>>>>>> $ directions : chr "1/D" "A/TD" "1/D" "1/D" ...
>>>>>>> $ no_of_packs : chr "NULL" "NULL" "NULL" "NULL" ...
>>>>>>> $ datasource : chr "TSF" "TSF" "TSF" "TSF" ...
>>>>>>> $ scan_ref_no : chr "NULL" "NULL" "NULL" "NULL" ...
>>>>>>>
>>>>>>> This code "worked", but it is not clear that the x-y assignments
>>> were
>>>>>>> correct:
>>>>>>>
>>>>>>> x_data <- read.table(file="
>>>>>>> http://n4.nabble.com/file/n1590804/
>>>>>>> feb09_267_presc_items_tsf.txt",
>>>> header
>>>>>>> = TRUE, sep = "|", quote = "\"'",
>>>>>>> dec = ".",as.is = TRUE,na.strings =
>>>>>>> "NA",colClasses =
>>> NA,
>>>>>>> nrows = 3864284,
>>>>>>> skip = 0, check.names = TRUE,fill=TRUE,
>>>>>>> strip.white = TRUE, blank.lines.skip = TRUE,
>>>>>>> comment.char = "#", allowEscapes = FALSE, flush =
>>> FALSE,
>>>>>>> fileEncoding = "", encoding = "unknown")
>>>>>>>
>>>>>>> --
>>>>>>> David.
>>>>>>>
>>>>>>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
>>>>>>>
>>>>>>>
>>>>>>>> The y_data file has over 9000 rows in it so I thought it
>>>>>>>> would be
>>> more
>>>>>>>> practical to give you the file to download....
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>>
>>>>
>>> http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
>>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>>> code.
>>>>>>>>
>>>>>>>
>>>>>>> David Winsemius, MD
>>>>>>> West Hartford, CT
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>> code.
>>>>>>>
>>>>>>
>>>>>> David Winsemius, MD
>>>>>> West Hartford, CT
>>>>>>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list